[rhos-list] Nova-network v.s. Quantum in Openstack preview

Sat Feb 16 18:55:56 UTC 2013

Hi, Eduardo:

Thanks a lot for the comments! It is really helpful! Based on your suggestion, I did a quick verification on the cpu flags, and the result is very ugly…KVM crashes pretty much for most of the flags I tested:

No crash: fxsr
Crash:  sse2, sse, mmx, clflush, pse36, pat, cmov, mca

The test was conducted by using qemu-kvm-rhev-0.12.1.2-2.351.el6.x86_64. I did yum update last night going to bed and the qemu-kvm-0.12.1.2-2.335.el6.x86_64 was obsoleted by qemu-kvm-rhev-0.12.1.2-2.351.el6.x86_64.

I didn't exhaust the list, but all of these flags should be supported by Nehalem. At this moment, do you think we may have CPU defect? Please see attached TXT file for details.

Thanks!

Shixiong

On Feb 15, 2013, at 10:52 PM, Eduardo Habkost <ehabkost at redhat.com<mailto:ehabkost at redhat.com>>
 wrote:

Hi, all,

I'm Eduardo from the KVM team. Some comments and questions below:

On Sat, Feb 16, 2013 at 02:31:24AM +0000, Shixiong Shang (shshang) wrote:
Hi, Perry and Karen:

I did some further investigation tonight. The VM instance was
initiated with lot of parameters, among which, here is one line
related to CPU model:

-cpu Nehalem,+rdtscp,+vmx,+ht,+ss,+acpi,+ds,+vme -enable-kvm

Based on qemu-kvm command and cpu_map.xml file, Nehalem and all of the
flags are supported. However, when I tried to perform CPU check, KVM
crashed again. The backtrace is identical to the ones I saw in failed
VM instance log:

The "check" parameter asks QEMU to print warnings if some CPU features
are not supported by the host CPU, but QEMU will start the guest
normally after that. So, if you got to the "VNC server running" stage,
it means all CPU features from the QEMU "Nehalem" CPU model should be
supported by your host CPU + kernel, and the crash happened while the
guest was already running, not during the CPU feature check.

[root at as-cmp1 libvirt]# /usr/libexec/qemu-kvm -cpu Nehalem,check

I am assuming you used just the above command with no extra parameters
(meaning you don't even need a disk image to reproduce the bug), right?

VNC server running on `::1:5900'
KVM internal error. Suberror: 2

How long does the error message take to appear, after starting qemu-kvm?

extra data[0]: 80000003
extra data[1]: 80000603

The data above is weird: the CPU is reporting that it was trying to
deliver an int3 (but with the interrupt type bits set to "external
interrupt", which doesn't make sense), and got another int3 interrupt
generated when trying to deliver it.

It doesn't look right (the codes don't seem to make sense), and even if
it was right, simply running qemu-kvm with no arguments shouldn't end up
generating int3 interrupts at all.

I would test this in other machines, to make sure this is really not a
hardware defect. Could you send the contents of /proc/cpuinfo? If you
are able to install the x86info package, the output of 'x86info -v -a'
would be useful, too.

rax 00000000000003c3 rbx 00000000000008f2 rcx 000000000000013f rdx 000000000000ffdf
rsi 0000000000000006 rdi 000000000000c993 rsp 00000000000003aa rbp 000000000000f000
r8  0000000000000000 r9  0000000000000000 r10 0000000000000000 r11 0000000000000000
r12 0000000000000000 r13 0000000000000000 r14 0000000000000000 r15 0000000000000000
rip 00000000000010e2 rflags 00000286

Interesting, RIP is different from your previous report. Does the value
change if you run "/usr/libexec/qemu-kvm -cpu Nehalem,check" again?

cs c000 (000c0000/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
ds c000 (000c0000/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
es f000 (000f0000/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
ss 0000 (00000000/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
fs 0000 (00000000/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
gs 0000 (00000000/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
tr 0000 (feffd000/00002088 p 1 dpl 0 db 0 s 0 type b l 0 g 0 avl 0)
ldt 0000 (00000000/0000ffff p 1 dpl 0 db 0 s 0 type 2 l 0 g 0 avl 0)
gdt fc558/37
idt 0/3ff
cr0 10 cr2 0 cr3 0 cr4 0 cr8 0 efer 0

FYI, I am using this qemu-kvm version:
qemu-kvm-0.12.1.2-2.335.el6.x86_64

Thanks. What are the versions of the kernel, seabios, vgabios, and gpxe
packages?

The potential workaround is to use generic CPU model, such as KVM64,
with performance penalty. I will give it a try and keep you posted. In
the meanwhile, if you can think of anything else, please let me at
your early convenience.

If other CPU models work, it may simply indicate that some feature bit
enabled by the Nehalem CPU model may be triggering the problem.

If that's the case, one way to find out which feature is causing the
problem is to try:

$ /usr/lib/qemu-kvm -cpu qemu64,+sse2,+sse,+fxsr,+mmx,+clflush,+pse36,+pat,+cmov,+mca,+pge,+mtrr,+sep,+apic,+cx8,+mce,+pae,+msr,+tsc,+pse,+de,+fpu,+popcnt,+x2apic,+sse4.2,+sse4.1,+cx16,+ssse3,+sse3,+i64,+syscall,+xd,+lahf_lm,model=26

I expect the bug to be reproduced easily using the above command-line.
After that, you can gradually remove features from the command-line,
until we find which one is triggering the problem.

Thanks for your help!

Shixiong

--
Eduardo

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/rhos-list/attachments/20130216/072a33c1/attachment.htm>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: qemu-kvm cpu checks.txt
URL: <http://listman.redhat.com/archives/rhos-list/attachments/20130216/072a33c1/attachment.txt>