[vfio-users] VFIO and random host crashes

Wed May 18 15:46:37 UTC 2016

@alex

I was thinking along the same lines. I actually initially had a very highly
tuned setup, but for debugging I’m back to the most generic configuration I
can, with as many defaults as possible, and no tuning or anything. I even
recently redid the whole host from scratch to make sure I didn’t have any
weird modprobe configs laying around or anything.

As far as overlapping cores, I have the skylake i7, so 8 logical cores
total for the host. One VM runs 4 virtual cores, and the other only 3. So
Im actually under-committed by 1 core.

The nohz thing is really interesting… I have tried both the lowlatency and
generic ubuntu kernels, I cant remember entirely, but I think the
lowlatency kernel didnt crash as much, I’ll have to confirm.

On Wed, May 18, 2016 at 9:35 AM Colin Godsey <crgodsey at gmail.com> wrote:

> I’ve been running as much monitoring as possible these last few crashes,
> thankfully the SSH sessions lock up too, so I can see the last stats.
>
> top: looks totally normal when it crashes, maybe 60% CPU util,
> swap/cache/sys all look normal.
> context switches: seem mostly normal- total of maybe ~4k voluntary, ~300
> non-voluntary.
> disk usage: crazy up and down constantly… I use ZFS for the VMs which I’m
> not entirely ruling out yet… but I think if anything it may contribute to
> power fluctuations via the disks (4 magnetic total). The entire VM host is
> on its own regular ext4 drive tho, so hoping that helps rule out ZFS
> kernel/software issues.
> interrupts: normal
>
>
> On Wed, May 18, 2016 at 9:24 AM Brett Peckinpaugh <bp10 at erylflynn.com>
> wrote:
>
>> Are you monitoring processor utilization? 2 systems like you describe
>> could tax a host. Maybe it is cpu starvation?
>>
>> On May 18, 2016 7:47:11 AM PDT, Colin Godsey <crgodsey at gmail.com> wrote:
>>
>>> I’ve been running a dual gaming VM rig (2x dedicated GPU) for a little
>>> bit now, and everything works perfectly except when both VMs are under
>>> load, after an hour or so I get a hard crash and/or reboot. It will either
>>> reboot itself, or will hang so bad the physical ‘reset’ button on the box
>>> doesnt work.
>>>
>>> There is 0 evidence in the linux logs about the crash, I literally just
>>> see one of a few standard cron jobs as the syslog, then the next line is
>>> the kernel boot/start-up. Only real evidence I get is that- rarely I can
>>> hear windows crash first. Or windows will crash and Ill get maybe another
>>> second or 2 of ’top’ before the whole system goes down. I find it extremely
>>> odd that there’s some sort of (albeit fast) degradation, but absolutely
>>> nothing interesting in the logs.
>>>
>>> So, I’m pretty sure it’s something hardware related- either PSU or my
>>> mobo is crap and is underpowered somewhere. During load, there are about 5
>>> drives, 2 GTX GPUs, and GBe (~200mbps) all under constant load, so it seems
>>> likely it could be something chipset related.
>>>
>>> *So my question is really: is there ANY kind of kernel/vfio software
>>> level issue that could cause this crash? Or does this just sound like
>>> hardware?* I’ve tried several different power configurations at this
>>> point, I just want to be as sure as possible it’s hardware before i start
>>> replacing more things =\
>>>
>>> This is an up to date Ubuntu Xenial, not really running anything
>>> special. I’ve gotten away with running my VMs almost as pure as possible,
>>> no funny workarounds or anything. OVMF, Windows 10, hyper-v flags. Skylake
>>> i7 @ z170M.
>>>
>>> ------------------------------
>>>
>>> vfio-users mailing list
>>> vfio-users at redhat.com
>>> https://www.redhat.com/mailman/listinfo/vfio-users
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/vfio-users/attachments/20160518/5027bebf/attachment.htm>