[libvirt] [RFC PATCH 0/2] nodeinfo: PPC64: Fix topology and siblings info on capabilities and nodeinfo

Tue May 31 06:08:41 UTC 2016

On Tue, 17 May 2016 11:49:22 +0200
Andrea Bolognani <abologna at redhat.com> wrote:

> On Tue, 2016-05-10 at 17:59 -0400, Cole Robinson wrote:
> > On 05/05/2016 02:48 PM, Andrea Bolognani wrote:  
> > > On Fri, 2016-01-29 at 01:32 -0500, Shivaprasad G Bhat wrote:
> > > 
> > > ** Guest threads limit **
> > > 
> > > My dual-core laptop will happily run a guest configured with
> > > 
> > >   <cpu>
> > >     <topology sockets='1' cores='1' threads='128'/>
> > >   </cpu>
> > > 
> > > but POWER guests are limited to 8/subcores_per_core threads.  
> > 
> > How is it limited? Does something explicitly fail (libvirt, qemu, guest OS)?
> > Or are the threads just not usable in the VM
> > 
> > Is it specific to PPC64 KVM, or PPC64 emulated as well?  
> 
> QEMU fails with errors like
> 
>   qemu-kvm: Cannot support more than 8 threads on PPC with KVM
>   qemu-kvm: Cannot support more than 1 threads on PPC with TCG
> 
> depending on the guest type.

Note that in a sense the two errors come about for different reasons.

On Power, to a much greater degree than x86, threads on the same core
have observably different behaviour from threads on different cores.
Because of that, there's no reasonable way for KVM to present more
guest threads-per-core than there are host threads-per-core.

The limit of 1 thread on TCG is simply because no-one's ever bothered
to implement SMT emulation in qemu.

> > > We need to report this information to the user somehow, and
> > > I can't see an existing place where it would fit nicely. We
> > > definitely don't want to overload the meaning of an existing
> > > element/attribute with this. It should also only appear in
> > > the (dom)capabilities XML of ppc64 hosts.
> > > 
> > > I don't think this is too problematic or controversial, we
> > > just need to pick a nice place to display this information.  
> 
> Adding to the above: we already have
> 
>   <vcpu max='...'/>
> 
> in the domcapabilities XML, and there was some recent
> discussion about improving the information reported there.
> 
> Possibly a good match?
> 
> > > ** Efficient guest topology **
> > > 
> > > To achieve optimal performance, you want to match guest
> > > threads with host threads.
> > > 
> > > On x86, you can choose suitable host threads by looking at
> > > the capabilities XML: the presence of elements like
> > > 
> > >   <cpu id='2' socket_id='0' core_id='1' siblings='2-3'/>
> > >   <cpu id='3' socket_id='0' core_id='1' siblings='2-3'/>
> > > 
> > > means you should configure your guest to use
> > > 
> > >   <vcpu placement='static' cpuset='2-3'>2</vcpu>
> > >   <cpu>
> > >     <topology sockets='1' cores='1' threads='2'/>
> > >   </cpu>
> > > 
> > > Notice how siblings can be found either looking at the
> > > attribute with the same name, or by matching them using the
> > > value of the core_id attribute. Also notice how you are
> > > supposed to pin as many vCPUs as the number of elements in
> > > the cpuset - one guest thread per host thread.  
> > 
> > Ahh, I see that threads are implicitly reported by the fact that socket_id and
> > core_id are identical across the different cpu ids... that took me a couple
> > minutes :)  
> 
> Yup :)
> 
> thread_siblings_list, the sysfs topology file we read to fill
> in the 'siblings' attribute, actually contains the internal
> information the kernel has gathered by matching socket_id (aka
> physical_package_id in sysfs) and core_id[1].
> 
> > > On POWER, this gets much trickier: only the *primary* thread
> > > of each (sub)core appears to be online in the host, but all
> > > threads can actually have a vCPU running on them. So
> > > 
> > >   <cpu id='0' socket_id='0' core_id='32' siblings='0,4'/>
> > >   <cpu id='4' socket_id='0' core_id='32' siblings='0,4'/>
> > > 
> > > which is what you'd get with subcores_per_core=2, is very
> > > confusing.  
> >
> > Okay, this bit took me _more_ than a couple minutes. Is this saying topology of
> > 
> > socket #0
> >   core #32
> >     subcore #1
> >       cpu id='0' thread #1
> >       cpu id='1' thread #2 (offline)
> >       cpu id='2' thread #3 (offline)
> >       cpu id='3' thread #4 (offline)
> >     subcore #2
> >       cpu id='4' thread #1
> >       cpu id='5' thread #2 (offline)
> >       cpu id='6' thread #3 (offline)
> >       cpu id='7' thread #4 (offline)
> > ...
> > 
> > what would the hypothetical physical_core_id value look like in that example?  
> 
> physical_core_id would be 32 for all of the above - it would
> just be the very value of core_id the kernel reads from the
> hardware and reports through sysfs.
> 
> The tricky bit is that, when subcores are in use, core_id and
> physical_core_id would not match. They will always match on
> architectures that lack the concept of subcores, though.

Yeah, I'm still not terribly convinced that we should even be
presenting physical core info instead of *just* logical core info.  If
you care that much about physical core topology, you probably
shouldn't be running your system in subcore mode.

> > > The optimal guest topology in this case would be
> > > 
> > >   <vcpu placement='static' cpuset='4'>4</vcpu>
> > >   <cpu>
> > >     <topology sockets='1' cores='1' threads='4'/>
> > >   </cpu>  
> > 
> > So when we pin to logical CPU #4, ppc KVM is smart enough to see that it's a
> > subcore thread, will then make use of the offline threads in the same subcore?
> > Or does libvirt do anything fancy to facilitate this case?  
> 
> My understanding is that libvirt shouldn't have to do anything
> to pass the hint to kvm, but David will have the authoritative
> answer here.

Um.. I'm not totally certain.  It will be one of two things:
   a) you just bind the guest thread to the representative host thread
   b) you bind the guest thread to a cpumask with all of the host
      threads on the relevant (sub)core - including the offline host
      threads

I'll try to figure out which one it is.

> > > but neither approaches mentioned above work to figure out the
> > > correct value for the cpuset attribute.
> > > 
> > > In this case, a possible solution would be to alter the values
> > > of the core_id and siblings attribute such that both would be
> > > the same as the id attribute, which would naturally make both
> > > approaches described above work.
> > > 
> > > Additionaly, a new attribute would be introduced to serve as
> > > a multiplier for the "one guest thread per host thread" rule
> > > mentioned earlier: the resulting XML would look like
> > > 
> > >   <cpu id='0' socket_id='0' core_id='0' siblings='0' capacity='4'/>
> > >   <cpu id='4' socket_id='0' core_id='4' siblings='4' capacity='4'/>
> > > 
> > > which contains all the information needed to build the right
> > > guest topology. The capacity attribute would have value 1 on
> > > all architectures except for ppc64.  
> > 
> > capacity is pretty generic sounding... not sure if that's good or not in this
> > case. maybe thread_capacity?  
> 
> Yeah, I'm not in love with the name either, but I've been unable
> to come up with a better one myself. thread_capacity might be a
> tiny bit better, but in the end I think there's little chance
> we'll be able to find a good, short name for "you can pin this
> number of guest threads to this host thread" - let's pick
> something not horrible and document the heck out of it.
> 
> > > We could arguably use the capacity attribute to cover the
> > > use case described in the first part as well, by declaring that
> > > any value other than 1 means there's a limit to the number of
> > > threads a guest core can have. I think doing so has the
> > > potential to produce much grief in the future, so I'd rather
> > > keep them separate - even if it means inventing a new element.
> > > 
> > > It's been also proposed to add a physical_core_id attribute,
> > > which would contain the real core id and allow tools to figure
> > > out which subcores belong to the same core - it would be the
> > > same as core_id for all other architectures and for ppc64
> > > when subcores_per_core=1. It's not clear whether having this
> > > attribute would be useful or just confusing.  
> > 
> > IMO it seems like something worth adding since it is a pertinent piece of the
> > topology, even if there isn't a clear programmatic use for it yet.  
> 
> It is a piece of information that we would not be reporting,
> that much is clear. However, as mentioned above, I'm afraid it
> might make things more confusing, especially for architectures
> that do not have subcores - basically all of them.
> 
> So maybe we should only add this information once its usefulness
> has been proven.
> 
> > > This is all I have for now. Please let me know what you think
> > > about it.  
> > 
> > FWIW virt-manager basically doesn't consume the host topology XML, so there's
> > no concern there.  
> 
> That's good to know :)
> 
> > A quick grep seems to indicate that both nova (openstack) and vdsm
> > (ovirt/rhev) _do_ consume this XML for their numa magic (git grep sibling),
> > but I can't speak to the details of how it's consumed.  
> 
> We won't know whether the proposal is actually sensible until
> David weighs in, but I'm adding Martin back in the loop so
> we can maybe give us the oVirt angle in the meantime.

TBH, I'm not really sure what you want from me.  Most of the questions
seem to be libvirt design decisions which are independent of the layers
below.

-- 
David Gibson <dgibson at redhat.com>
Senior Software Engineer, Virtualization, Red Hat
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/libvir-list/attachments/20160531/390e8a3f/attachment-0001.sig>