[libvirt] [RFC PATCH 0/2] nodeinfo: PPC64: Fix topology and siblings info on capabilities and nodeinfo

Andrea Bolognani abologna at redhat.com
Wed Jul 20 12:33:33 UTC 2016


On Tue, 2016-07-19 at 15:35 +0100, Daniel P. Berrange wrote:
> On Thu, May 05, 2016 at 08:48:05PM +0200, Andrea Bolognani wrote:
>> > On Fri, 2016-01-29 at 01:32 -0500, Shivaprasad G Bhat wrote:
> > > 
> > > The nodeinfo output was fixed earlier to reflect the actual cpus available in
> > > KVM mode on PPC64. The earlier fixes covered the aspect of not making a host
> > > look overcommitted when its not. The current fixes are aimed at helping the
> > > users make better decisions on the kind of guest cpu topology that can be
> > > supported on the given sucore_per_core setting of KVM host and also hint the
> > > way to pin the guest vcpus efficiently.
> > >  
> > > I am planning to add some test cases once the approach is accepted.
> > >  
> > > With respect to Patch 2:
> > > The second patch adds a new element to the cpus tag and I need your inputs on
> > > if that is okay. Also if there is a better way. I am not sure if the existing
> > > clients have RNG checks that might fail with the approach. Or if the checks
> > > are not enoforced on the elements but only on the tags.
> > >  
> > > With my approach if the rng checks pass, the new element "capacity" even if
> > > ignored by many clients would have no impact except for PPC64.
> > >  
> > > To the extent I looked at code, the siblings changes dont affect existing
> > > libvirt functionality. Please do let me know otherwise.
>> > So, I've been going through this old thread trying to figure out
> > a way to improve the status quo. I'd like to collect as much
> > feedback as possible, especially from people who have worked in
> > this area of libvirt before or have written tools based on it.
>> > As hinted above, this series is really trying to address two
> > different issue, and I think it's helpful to reason about them
> > separately.
>>> > ** Guest threads limit **
>> > My dual-core laptop will happily run a guest configured with
>> >   <cpu>
> >     <topology sockets='1' cores='1' threads='128'/>
> >   </cpu>
>> > but POWER guests are limited to 8/subcores_per_core threads.
>> > We need to report this information to the user somehow, and
> > I can't see an existing place where it would fit nicely. We
> > definitely don't want to overload the meaning of an existing
> > element/attribute with this. It should also only appear in
> > the (dom)capabilities XML of ppc64 hosts.
>> > I don't think this is too problematic or controversial, we
> > just need to pick a nice place to display this information.
>>> > ** Efficient guest topology **
>> > To achieve optimal performance, you want to match guest
> > threads with host threads.
>> > On x86, you can choose suitable host threads by looking at
> > the capabilities XML: the presence of elements like
>> >   <cpu id='2' socket_id='0' core_id='1' siblings='2-3'/>
> >   <cpu id='3' socket_id='0' core_id='1' siblings='2-3'/>
>> > means you should configure your guest to use
>> >   <vcpu placement='static' cpuset='2-3'>2</vcpu>
> >   <cpu>
> >     <topology sockets='1' cores='1' threads='2'/>
> >   </cpu>
>> > Notice how siblings can be found either looking at the
> > attribute with the same name, or by matching them using the
> > value of the core_id attribute. Also notice how you are
> > supposed to pin as many vCPUs as the number of elements in
> > the cpuset - one guest thread per host thread.
>> > On POWER, this gets much trickier: only the *primary* thread
> > of each (sub)core appears to be online in the host, but all
> > threads can actually have a vCPU running on them. So
>> >   <cpu id='0' socket_id='0' core_id='32' siblings='0,4'/>
> >   <cpu id='4' socket_id='0' core_id='32' siblings='0,4'/>
>> > which is what you'd get with subcores_per_core=2, is very
> > confusing.
>> > The optimal guest topology in this case would be
>> >   <vcpu placement='static' cpuset='4'>4</vcpu>
> >   <cpu>
> >     <topology sockets='1' cores='1' threads='4'/>
> >   </cpu>
>> > but neither approaches mentioned above work to figure out the
> > correct value for the cpuset attribute.
>> > In this case, a possible solution would be to alter the values
> > of the core_id and siblings attribute such that both would be
> > the same as the id attribute, which would naturally make both
> > approaches described above work.
>> > Additionaly, a new attribute would be introduced to serve as
> > a multiplier for the "one guest thread per host thread" rule
> > mentioned earlier: the resulting XML would look like
>> >   <cpu id='0' socket_id='0' core_id='0' siblings='0' capacity='4'/>
> >   <cpu id='4' socket_id='0' core_id='4' siblings='4' capacity='4'/>
>> > which contains all the information needed to build the right
> > guest topology. The capacity attribute would have value 1 on
> > all architectures except for ppc64.
> 
> I don't really like the fact that with this design, we effectively
> have a bunch of <cpu> which are invisible whose existance is just
> implied by the 'capacity=4' attribute.
> 
> I also don't like tailoring output of capabilities XML for one
> specific use case.
> 
> IOW, I think we should explicitly represent all the CPUs in the
> node capabilities, even if they are offline in the host. We could
> introduce a new attribute to indicate the status of CPUs. So
> instead of
> 
>   <cpu id='0' socket_id='0' core_id='0' siblings='0' capacity='4'/>
>   <cpu id='4' socket_id='0' core_id='4' siblings='4' capacity='4'/>
> 
> I'd like to see
> 
>   <cpu id='0' socket_id='0' core_id='0' siblings='0-3' state="online"/>
>   <cpu id='0' socket_id='0' core_id='0' siblings='0-3' state="offline"/>
>   <cpu id='0' socket_id='0' core_id='0' siblings='0-3' state="offline"/>
>   <cpu id='0' socket_id='0' core_id='0' siblings='0-3' state="offline"/>
>   <cpu id='4' socket_id='0' core_id='4' siblings='4-7' state="online"/>
>   <cpu id='4' socket_id='0' core_id='4' siblings='4-7' state="offline"/>
>   <cpu id='4' socket_id='0' core_id='4' siblings='4-7' state="offline"/>
>   <cpu id='4' socket_id='0' core_id='4' siblings='4-7' state="offline"/>

I assume you meant

  <cpu id='0' socket_id='0' core_id='0' siblings='0-3' state="online"/>
  <cpu id='1' socket_id='0' core_id='0' siblings='0-3' state="offline"/>
  <cpu id='2' socket_id='0' core_id='0' siblings='0-3' state="offline"/>
  <cpu id='3' socket_id='0' core_id='0' siblings='0-3' state="offline"/>
  <cpu id='4' socket_id='0' core_id='4' siblings='4-7' state="online"/>
  <cpu id='5' socket_id='0' core_id='4' siblings='4-7' state="offline"/>
  <cpu id='6' socket_id='0' core_id='4' siblings='4-7' state="offline"/>
  <cpu id='7' socket_id='0' core_id='4' siblings='4-7' state="offline"/>

and that this would represent a configuration where
subcores-per-core=2, eg. CPUs 0-7 belong to the same physical
core but to different logical cores (subcores).

IIRC doing something like this was proposed at some point in
the past, but was rejected on the ground that existing tools
assume that 1) CPUs listed in the NUMA topology are online
and 2) you can pin vCPUs to them. So they would try to pin
vCPUs to eg. CPU 1 and that will fail.

Additionally, this doesn't tell us anything about whether any
host CPU can run a guest CPU: given the above configuration,
on ppc64, we know that CPU 1 can run guest threads even though
it's offline because CPU 0 is online, but the same isn't true
on x86.

So we would end up needing three new boolean properties:

  - online - whether the CPU is online
  - can_run_vcpus - whether the CPU can run vCPUs
  - can_pin_vcpus - whether vCPUs can be pinned to the CPU

and all higher level tools would have to adapt to use them.
Existing logic would not work with newer libvirt versions,
and x86 would be affected by these changes as well.

One more thing: since the kernel doesn't expose information
about offline CPUs, we'll have to figure out those ourselves:
we're already doing that, to some degree, on ppc64, but there
are some cases where it's just impossible to do so reliably.
When that happens, we throw our hands up in the air and return
a completely bogus topology. That would suddenly be the case
on x86 as well.

So yeah... Tricky stuff. But thanks for providing some input,
and please keep it coming! :)

> The domain capabilities meanwhile is where you'd express any usage
> constraint for cores/threads requried by QEMU.
> 
> Regards,
> Daniel
-- 
Andrea Bolognani / Red Hat / Virtualization




More information about the libvir-list mailing list