[libvirt-users] windows 2008 guest causing rcu_shed to emit NMI

Thu Jan 31 17:40:05 UTC 2013

On Thu, Jan 31, 2013 at 12:11 AM, Marcelo Tosatti <mtosatti at redhat.com> wrote:
> On Wed, Jan 30, 2013 at 11:21:08AM +0300, Andrey Korolyov wrote:
>> On Wed, Jan 30, 2013 at 3:15 AM, Marcelo Tosatti <mtosatti at redhat.com> wrote:
>> > On Tue, Jan 29, 2013 at 02:35:02AM +0300, Andrey Korolyov wrote:
>> >> On Mon, Jan 28, 2013 at 5:56 PM, Andrey Korolyov <andrey at xdel.ru> wrote:
>> >> > On Mon, Jan 28, 2013 at 3:14 AM, Marcelo Tosatti <mtosatti at redhat.com> wrote:
>> >> >> On Mon, Jan 28, 2013 at 12:04:50AM +0300, Andrey Korolyov wrote:
>> >> >>> On Sat, Jan 26, 2013 at 12:49 AM, Marcelo Tosatti <mtosatti at redhat.com> wrote:
>> >> >>> > On Fri, Jan 25, 2013 at 10:45:02AM +0300, Andrey Korolyov wrote:
>> >> >>> >> On Thu, Jan 24, 2013 at 4:20 PM, Marcelo Tosatti <mtosatti at redhat.com> wrote:
>> >> >>> >> > On Thu, Jan 24, 2013 at 01:54:03PM +0300, Andrey Korolyov wrote:
>> >> >>> >> >> Thank you Marcelo,
>> >> >>> >> >>
>> >> >>> >> >> Host node locking up sometimes later than yesterday, bur problem still
>> >> >>> >> >> here, please see attached dmesg. Stuck process looks like
>> >> >>> >> >> root     19251  0.0  0.0 228476 12488 ?        D    14:42   0:00
>> >> >>> >> >> /usr/bin/kvm -no-user-config -device ? -device pci-assign,? -device
>> >> >>> >> >> virtio-blk-pci,? -device
>> >> >>> >> >>
>> >> >>> >> >> on fourth vm by count.
>> >> >>> >> >>
>> >> >>> >> >> Should I try upstream kernel instead of applying patch to the latest
>> >> >>> >> >> 3.4 or it is useless?
>> >> >>> >> >
>> >> >>> >> > If you can upgrade to an upstream kernel, please do that.
>> >> >>> >> >
>> >> >>> >>
>> >> >>> >> With vanilla 3.7.4 there is almost no changes, and NMI started firing
>> >> >>> >> again. External symptoms looks like following: starting from some
>> >> >>> >> count, may be third or sixth vm, qemu-kvm process allocating its
>> >> >>> >> memory very slowly and by jumps, 20M-200M-700M-1.6G in minutes. Patch
>> >> >>> >> helps, of course - on both patched 3.4 and vanilla 3.7 I`m able to
>> >> >>> >> kill stuck kvm processes and node returned back to the normal, when on
>> >> >>> >> 3.2 sending SIGKILL to the process causing zombies and hanged ``ps''
>> >> >>> >> output (problem and workaround when no scheduler involved described
>> >> >>> >> here http://www.spinics.net/lists/kvm/msg84799.html).
>> >> >>> >
>> >> >>> > Try disabling pause loop exiting with ple_gap=0 kvm-intel.ko module parameter.
>> >> >>> >
>> >> >>>
>> >> >>> Hi Marcelo,
>> >> >>>
>> >> >>> thanks, this parameter helped to increase number of working VMs in a
>> >> >>> half of order of magnitude, from 3-4 to 10-15. Very high SY load, 10
>> >> >>> to 15 percents, persists on such numbers for a long time, where linux
>> >> >>> guests in same configuration do not jump over one percent even under
>> >> >>> stress bench. After I disabled HT, crash happens only in long runs and
>> >> >>> now it is kernel panic :)
>> >> >>> Stair-like memory allocation behaviour disappeared, but other symptom
>> >> >>> leading to the crash which I have not counted previously, persists: if
>> >> >>> VM count is ``enough'' for crash, some qemu processes starting to eat
>> >> >>> one core, and they`ll panic system after run in tens of minutes in
>> >> >>> such state or if I try to attach debugger to one of them. If needed, I
>> >> >>> can log entire crash output via netconsole, now I have some tail,
>> >> >>> almost the same every time:
>> >> >>> http://xdel.ru/downloads/btwin.png
>> >> >>
>> >> >> Yes, please log entire crash output, thanks.
>> >> >>
>> >> >
>> >> > Here please, 3.7.4-vanilla, 16 vms, ple_gap=0:
>> >> >
>> >> > http://xdel.ru/downloads/oops-default-kvmintel.txt
>> >>
>> >> Just an update: I was able to reproduce that on pure linux VMs using
>> >> qemu-1.3.0 and ``stress'' benchmark running on them - panic occurs at
>> >> start of vm(with count ten working machines at the moment). Qemu-1.1.2
>> >> generally is not able to reproduce that, but host node with older
>> >> version crashing on less amount of Windows VMs(three to six instead
>> >> ten to fifteen) than with 1.3, please see trace below:
>> >>
>> >> http://xdel.ru/downloads/oops-old-qemu.txt
>> >
>> > Single bit memory error, apparently. Try:
>> >
>> > 1. memtest86.
>> > 2. Boot with slub_debug=ZFPU kernel parameter.
>> > 3. Reproduce on different machine
>> >
>> >
>>
>> Hi Marcelo,
>>
>> I always follow the rule - if some weird bug exists, check it on
>> ECC-enabled machine and check IPMI logs too before start complaining
>> :) I have finally managed to ``fix'' the problem, but my solution
>> seems a bit strange:
>> - I have noticed that if virtual machines started without any cgroup
>> setting they will not cause this bug under any conditions,
>> - I have thought, very wrong in my mind, that the
>> CONFIG_SCHED_AUTOGROUP should regroup the tasks without any cgroup and
>> should not touch tasks already inside any existing cpu cgroup. First
>> sight on the 200-line patch shows that the autogrouping always applies
>> to all tasks, so I tried to disable it,
>> - wild magic appears - VMs didn`t crashed host any more, even in count
>> 30+ they work fine.
>> I still don`t know what exactly triggered that and will I face it
>> again under different conditions, so my solution more likely to be a
>> patch of mud in wall of the dam, instead of proper fixing.
>>
>> There seems to be two possible origins of such error - a very very
>> hideous race condition involving cgroups and processes like qemu-kvm
>> causing frequent context switches and simple incompatibility between
>> NUMA, logic of CONFIG_SCHED_AUTOGROUP and qemu VMs already doing work
>> in the cgroup, since I have not observed this errors on single numa
>> node(mean, desktop) on relatively heavier condition.
>
> Yes, it would be important to track it down though. Enabling
> slub_debug=ZFPU kernel parameter should help.
>
>

Hi Marcelo,

I have finally beat that one. As I have mentioned before in the
off-list message, nested cgroups for vcpu/emulator threads created by
libvirt was a root cause of this problem. Today we`ve disabled
creation of cgroup deeper than qemu/vm/ level and trace didn`t showed
up under different workloads. So for libvirt itself, it may be a
feature request to create thread-based cgroups iff any element of the
VM` config requires that. As for cgroups, seems it is fatal to have
very large amount of nested elements inside cpu on qemu-kvm, or on
very large amount of threads - since I have limited core amount on
each node, I can`t prove what exactly, complicated cgroup hierarchy or
some side effects putting threads on the dedicated cgroup, caused all
this pain. And, of course, without Windows(tm) bug is very hard to
observe in the wild, since almost no synthetic test I have put on the
linux VMs is able to show it.