[libvirt] [RFC PATCH 0/2] nodeinfo: PPC64: Fix topology and siblings info on capabilities and nodeinfo

Wed Jul 20 12:49:06 UTC 2016

On Wed, Jul 20, 2016 at 02:33:33PM +0200, Andrea Bolognani wrote:
> On Tue, 2016-07-19 at 15:35 +0100, Daniel P. Berrange wrote:
> > On Thu, May 05, 2016 at 08:48:05PM +0200, Andrea Bolognani wrote:
> > > 
> > > On Fri, 2016-01-29 at 01:32 -0500, Shivaprasad G Bhat wrote:
> > > > 
> > > > The nodeinfo output was fixed earlier to reflect the actual cpus available in
> > > > KVM mode on PPC64. The earlier fixes covered the aspect of not making a host
> > > > look overcommitted when its not. The current fixes are aimed at helping the
> > > > users make better decisions on the kind of guest cpu topology that can be
> > > > supported on the given sucore_per_core setting of KVM host and also hint the
> > > > way to pin the guest vcpus efficiently.
> > > >  
> > > > I am planning to add some test cases once the approach is accepted.
> > > >  
> > > > With respect to Patch 2:
> > > > The second patch adds a new element to the cpus tag and I need your inputs on
> > > > if that is okay. Also if there is a better way. I am not sure if the existing
> > > > clients have RNG checks that might fail with the approach. Or if the checks
> > > > are not enoforced on the elements but only on the tags.
> > > >  
> > > > With my approach if the rng checks pass, the new element "capacity" even if
> > > > ignored by many clients would have no impact except for PPC64.
> > > >  
> > > > To the extent I looked at code, the siblings changes dont affect existing
> > > > libvirt functionality. Please do let me know otherwise.
> > > 
> > > So, I've been going through this old thread trying to figure out
> > > a way to improve the status quo. I'd like to collect as much
> > > feedback as possible, especially from people who have worked in
> > > this area of libvirt before or have written tools based on it.
> > > 
> > > As hinted above, this series is really trying to address two
> > > different issue, and I think it's helpful to reason about them
> > > separately.
> > > 
> > > 
> > > ** Guest threads limit **
> > > 
> > > My dual-core laptop will happily run a guest configured with
> > > 
> > >   <cpu>
> > >     <topology sockets='1' cores='1' threads='128'/>
> > >   </cpu>
> > > 
> > > but POWER guests are limited to 8/subcores_per_core threads.
> > > 
> > > We need to report this information to the user somehow, and
> > > I can't see an existing place where it would fit nicely. We
> > > definitely don't want to overload the meaning of an existing
> > > element/attribute with this. It should also only appear in
> > > the (dom)capabilities XML of ppc64 hosts.
> > > 
> > > I don't think this is too problematic or controversial, we
> > > just need to pick a nice place to display this information.
> > > 
> > > 
> > > ** Efficient guest topology **
> > > 
> > > To achieve optimal performance, you want to match guest
> > > threads with host threads.
> > > 
> > > On x86, you can choose suitable host threads by looking at
> > > the capabilities XML: the presence of elements like
> > > 
> > >   <cpu id='2' socket_id='0' core_id='1' siblings='2-3'/>
> > >   <cpu id='3' socket_id='0' core_id='1' siblings='2-3'/>
> > > 
> > > means you should configure your guest to use
> > > 
> > >   <vcpu placement='static' cpuset='2-3'>2</vcpu>
> > >   <cpu>
> > >     <topology sockets='1' cores='1' threads='2'/>
> > >   </cpu>
> > > 
> > > Notice how siblings can be found either looking at the
> > > attribute with the same name, or by matching them using the
> > > value of the core_id attribute. Also notice how you are
> > > supposed to pin as many vCPUs as the number of elements in
> > > the cpuset - one guest thread per host thread.
> > > 
> > > On POWER, this gets much trickier: only the *primary* thread
> > > of each (sub)core appears to be online in the host, but all
> > > threads can actually have a vCPU running on them. So
> > > 
> > >   <cpu id='0' socket_id='0' core_id='32' siblings='0,4'/>
> > >   <cpu id='4' socket_id='0' core_id='32' siblings='0,4'/>
> > > 
> > > which is what you'd get with subcores_per_core=2, is very
> > > confusing.
> > > 
> > > The optimal guest topology in this case would be
> > > 
> > >   <vcpu placement='static' cpuset='4'>4</vcpu>
> > >   <cpu>
> > >     <topology sockets='1' cores='1' threads='4'/>
> > >   </cpu>
> > > 
> > > but neither approaches mentioned above work to figure out the
> > > correct value for the cpuset attribute.
> > > 
> > > In this case, a possible solution would be to alter the values
> > > of the core_id and siblings attribute such that both would be
> > > the same as the id attribute, which would naturally make both
> > > approaches described above work.
> > > 
> > > Additionaly, a new attribute would be introduced to serve as
> > > a multiplier for the "one guest thread per host thread" rule
> > > mentioned earlier: the resulting XML would look like
> > > 
> > >   <cpu id='0' socket_id='0' core_id='0' siblings='0' capacity='4'/>
> > >   <cpu id='4' socket_id='0' core_id='4' siblings='4' capacity='4'/>
> > > 
> > > which contains all the information needed to build the right
> > > guest topology. The capacity attribute would have value 1 on
> > > all architectures except for ppc64.
> > 
> > I don't really like the fact that with this design, we effectively
> > have a bunch of <cpu> which are invisible whose existance is just
> > implied by the 'capacity=4' attribute.
> > 
> > I also don't like tailoring output of capabilities XML for one
> > specific use case.
> > 
> > IOW, I think we should explicitly represent all the CPUs in the
> > node capabilities, even if they are offline in the host. We could
> > introduce a new attribute to indicate the status of CPUs. So
> > instead of
> > 
> >   <cpu id='0' socket_id='0' core_id='0' siblings='0' capacity='4'/>
> >   <cpu id='4' socket_id='0' core_id='4' siblings='4' capacity='4'/>
> > 
> > I'd like to see
> > 
> >   <cpu id='0' socket_id='0' core_id='0' siblings='0-3' state="online"/>
> >   <cpu id='0' socket_id='0' core_id='0' siblings='0-3' state="offline"/>
> >   <cpu id='0' socket_id='0' core_id='0' siblings='0-3' state="offline"/>
> >   <cpu id='0' socket_id='0' core_id='0' siblings='0-3' state="offline"/>
> >   <cpu id='4' socket_id='0' core_id='4' siblings='4-7' state="online"/>
> >   <cpu id='4' socket_id='0' core_id='4' siblings='4-7' state="offline"/>
> >   <cpu id='4' socket_id='0' core_id='4' siblings='4-7' state="offline"/>
> >   <cpu id='4' socket_id='0' core_id='4' siblings='4-7' state="offline"/>
> 
> I assume you meant
> 
>   <cpu id='0' socket_id='0' core_id='0' siblings='0-3' state="online"/>
>   <cpu id='1' socket_id='0' core_id='0' siblings='0-3' state="offline"/>
>   <cpu id='2' socket_id='0' core_id='0' siblings='0-3' state="offline"/>
>   <cpu id='3' socket_id='0' core_id='0' siblings='0-3' state="offline"/>
>   <cpu id='4' socket_id='0' core_id='4' siblings='4-7' state="online"/>
>   <cpu id='5' socket_id='0' core_id='4' siblings='4-7' state="offline"/>
>   <cpu id='6' socket_id='0' core_id='4' siblings='4-7' state="offline"/>
>   <cpu id='7' socket_id='0' core_id='4' siblings='4-7' state="offline"/>

Sigh, yes, of course it needs unique 'id' values :-)

> and that this would represent a configuration where
> subcores-per-core=2, eg. CPUs 0-7 belong to the same physical
> core but to different logical cores (subcores).
> 
> IIRC doing something like this was proposed at some point in
> the past, but was rejected on the ground that existing tools
> assume that 1) CPUs listed in the NUMA topology are online
> and 2) you can pin vCPUs to them. So they would try to pin
> vCPUs to eg. CPU 1 and that will fail.

We could easily deal with this by adding a flag to the
virConnectGetCapabilities() API  eg VIR_CONNECT_CAPS_ALL_CPUS
so we only show the expanded full CPU list if an app opts
in to it.

> Additionally, this doesn't tell us anything about whether any
> host CPU can run a guest CPU: given the above configuration,
> on ppc64, we know that CPU 1 can run guest threads even though
> it's offline because CPU 0 is online, but the same isn't true
> on x86.
> 
> So we would end up needing three new boolean properties:
> 
>   - online - whether the CPU is online
>   - can_run_vcpus - whether the CPU can run vCPUs
>   - can_pin_vcpus - whether vCPUs can be pinned to the CPU

These two can*vcpus props aren't telling the whole story
because they are assuming use of KVM - TCG or LXC guests
would not follow the same rules for runability here.

This is why I think the host capabilities XML should focus
on describing what hardware actually exists and its state,
and not say anything about guest runnability.

Historically we've let apps figure out the runnability
of pCPUs vs guest vCPUs, but even on x86 it isn't as simple
as you/we make it out to be.

For example, nothing here reflects the fact that the host
OS could have /sys/fs/cgroups/cpuset/cpuset.cpus configured
to a subset of CPUs.  So already today on x86, just because
a CPU is listed in the capabilities XML, does not imply that
you can run a guest on it.

So I think there's a gap for exposing information about
runability of guests vs host CPUs, that does not really
fit in the host capabilities. Possibly it could be in the
guest capabilities, but more likely it would need to be
a new API entirely.

> and all higher level tools would have to adapt to use them.
> Existing logic would not work with newer libvirt versions,
> and x86 would be affected by these changes as well.

Yes, as mentioned above, apps would have to opt-in to using
the new information.

> One more thing: since the kernel doesn't expose information
> about offline CPUs, we'll have to figure out those ourselves:
> we're already doing that, to some degree, on ppc64, but there
> are some cases where it's just impossible to do so reliably.
> When that happens, we throw our hands up in the air and return
> a completely bogus topology. That would suddenly be the case
> on x86 as well.
> 
> So yeah... Tricky stuff. But thanks for providing some input,
> and please keep it coming! :)
> 
> > The domain capabilities meanwhile is where you'd express any usage
> > constraint for cores/threads requried by QEMU.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|