[libvirt] Fw: [RFC PATCH 0/2] nodeinfo: PPC64: Fix topology and siblings info on capabilities and nodeinfo

Mon Feb 22 23:46:58 UTC 2016

Sorry, accidentally sent to just Andrea the first time around,
resending...

Begin forwarded message:

Date: Mon, 22 Feb 2016 16:53:15 +1100
From: David Gibson <dgibson at redhat.com>
To: Andrea Bolognani <abologna at redhat.com>
Subject: Re: [libvirt] [RFC PATCH 0/2] nodeinfo: PPC64: Fix topology
and siblings info on capabilities and nodeinfo

On Fri, 19 Feb 2016 17:40:31 +0100
Andrea Bolognani <abologna at redhat.com> wrote:

> On Tue, 2016-02-16 at 13:04 +1100, David Gibson wrote:  
> > > So the information don't seem to add up: we claim that the
> > > host has 160 online CPUs, but the NUMA topology only contains
> > > information about 5 of them per each node, so 20 in total.
> > > 
> > > That's of course because Linux only provides us with topology
> > > information for online CPUs, but KVM on ppc64 wants secondary
> > > threads to be offline in the host. The secondary threads are
> > > still used to schedule guest threads, so reporting 160 online
> > > CPUs is correct for the purpose of planning the number of
> > > guests based on the capacity of the host; the problem is that
> > > the detailed NUMA topology doesn't match with that.    
> > 
> > Yeah, that's rather unfortunate.  We do want to list all the threads in
> > the capabilities, I think, since they're capable of running vcpus.    
> 
> There's a problem with that, though, that I didn't think about
> when writing the original message. See below.
>   
> > > The new information is more complete than it was before, and
> > > this series certainly would help users make better guest
> > > allocation choices. On the other hand, it doesn't really solve
> > > the problem of nodeinfo and capabilities disagreeing with each
> > > other, and pushes the NUMA topology reported by libvirt a
> > > little farther from the one reported by the kernel.    
> > 
> > Uh.. I don't really see how nodeinfo and capabilities disagree here.    
> 
> Because nodeinfo tell us that the host has 160 online CPUs, while
> capabilities tell us that only 40 CPUs are actually online.  

Hmm, ok.  I guess the question is what does "online" mean in the
context of a libvirt client.  It used to be that "online in host" and
"able to run vcpus" were the same thing, but now they're not.

> Just to clarify, since the naming is not very explicit: nodeinfo
> reports *online* CPUs only, and each <cpu> element in capabilities
> is supposed to represent an *online* CPU. Offline CPUs are not
> listed or counted anywhere.  

Ok.

> > Now how the topology differs from the kernel.    
> 
> Because the kernel reports physical information, so online CPUs
> that are in the same physical core (as evidenced by the core_id
> value) are counted as siblings regardless of the subcore
> configuration.  

Yeah.. I'll have a chat to Ben and/or Paul and see if they think that's
the right choice for the kernel.

> > > It may also break some assumptions, eg. CPU 0 and 4 both have
> > > the same value for 'core_id', so I'd expect them to be among
> > > each other's 'siblings', but they no longer are.    
> > 
> > Ah, yes, that's wrong.  With this setup the core_id should be set to
> > the id of the subcore's first thread, rather than the physical core's
> > first thread.    
> 
> This would be a pretty wild departure from what the kernel is
> exposing to userspace.  

If that's more meaningful for libvirt's clients than the kernel
information, I don't think that's necessarily a problem.

> > > I have a different proposal: since we're already altering the
> > > NUMA topology information reported by the kernel by counting
> > > secondary threads as online, we might as well go all the way
> > > and rebuild the entire NUMA topology as if they were.
> > > 
> > > So the capabilities XML when subcores-per-core=1 would look
> > > like:
> > > 
> > >   <cells num='4'>
> > >     <cell id='0'>
> > >       <memory unit='KiB'>67108864</memory>
> > >       <pages unit='KiB' size='64'>1048576</pages>
> > >       <pages unit='KiB' size='16384'>0</pages>
> > >       <distances>
> > >         <sibling id='0' value='10'/>
> > >         <sibling id='1' value='20'/>
> > >         <sibling id='16' value='40'/>
> > >         <sibling id='17' value='40'/>
> > >       </distances>
> > >       <cpus num='10'>
> > >         <cpu id='0' socket_id='0' core_id='32' subcore_id='0' siblings='0-7'/>    
> > 
> > I don't think adding a subcore_id is a good idea.  Because it's only
> > ever likely to mean something on ppc64, tools are qlikely to just
> > ignore it and use the core_id.    
> 
> Fair enough, and something we definitely need to take into account.
>   
> > Most of the time that will be wrong:
> > behaviorally subcores act like cores in almost all regards.
> > 
> > That could be worked around by instead having core_id give the subcore
> > address and adding a new "supercore_id" or "core_group_id" or something.    
> 
> Well, those would be just as likely to be ignored by tools as
> subcore_id and capacity, wouldn't they? :)  

Yes, but it means when they ignore those and look at "core_id" they'll
get the subcore id, not the physical core id, which is probably what
they actually wanted.  Those few (if any) tools that really, truly,
need the physical core information get to look at supercore or whatever.

> > But frankly, I don't think there's actually much point exposing the
> > physical topology in addition to the logical (subcore) topology.  Yes,
> > subcores will have different performance characteristics to real cores
> > which will be relevant in some situations.  But if you're manually
> > setting the host's subcore mode then you're already into the realm of
> > manually tweaking parameters based on knowledge of your system and
> > workload.  Basically I don't see anything upper layers would do with
> > the subcore vs. core information that isn't likely to be overriden by
> > manual tweaks in any case where you're setting the subcore mode at all.
> > 
> > Bear in mind that now that we have dynamic split core merged, I'm not
> > sure there are *any* realistic use cases for using manual subcore
> > splitting.    
> 
> As far as I know, and of course correct me if I'm wrong, dynamic split
> cores are a way to avoid heavy performance drops when the host is  

Not so much heavy performance drops per se.  What it's really about is
mitigating the case where the system is overcommited in terms of number
of virtual cores, but not in terms of number of virtual threads.
Without dynamic split core the system will make more use of the host's
threading capabilities, with it, it will do a much better job.

> overcommitted, but it has some overhead of its own because of the need
> to split the core when entering KVM and restoring the previous
> configuration when exiting it.  

Hmm.. I guess there would be a little extra latency on entering and
exiting the guest.  I don't know if it's at all significant compared to
the already substantial cost of entering/leaving the guest.

> Splitting cores manually, on the other hand, is less flexible but
> gives the admin complete control over resource allocations and has
> no overhead, which makes it critical for fine performance tuning.  

Well.. except manual splitting can only be done system wide an only do,
whereas dynamic split-core can do it per-core which might allow it to
better match the workload.

> So unless my understanding is off, I believe we need to expose the
> topology in a way that enables both use cases.  

Well, that would be ideal, certainly.

> > So, as noted, I actually prefer Shivaprasad's original approach of
> > treating subcores as cores.  The implementation of that does need some
> > adjustment as noted above, basically treating subcores as cores even
> > more universally than the current patches do.
> > 
> > Basically I see manually setting the subcores-per-core as telling the
> > system you want it treated as having more cores.  So libvirt shouldn't
> > decide it knows better and report physical cores instead.    
> 
> It's not really a case of libvirt thinking it knows better, it's more
> a case of the kernel always reporting information about the physical
> topology and x86 not having AFAIK any configuration where physical
> and logical topology disagree, whereas that's the case on ppc64 as
> soon as you split a core.  

Hmm.  I'm not entirely sure what point you're making there.

> > Tangentially related is the question of how to deal with threads which
> > are offline in the host but can be used for vcpus, which appears with
> > or without subcores.    
> 
> This is exactly the issue I was thinking about at the beginning of
> the message: if I see a <cpu> element that means the CPU is online,
> and I expect to be able to pin a vCPU to it, but that would no
> longer be the case if we started adding <cpu> elements for CPUs
> that are offline in the host - any attempt to pin vCPUs to such
> CPUs would fail.  

Ah, good point.  That sounds like a case for the "capacity" property
then.

> > I have no very strong opinion between the options of (1) adding <cpu>
> > entries for the offline threads (whose siblings should be set based on
> > the subcore) or (2) adding a capacity tag to the online threads
> > indicating that you can put more vcpus on them that the host online
> > thread count indicates.
> > 
> > (1) is more likely to do the right thing with existing tools, but (2)
> > is more accurately expressive.    
> 
> I kinda think it would be the other way around, as I expect existing
> tools to make some of the assumptions described above.
> 
> So after discussing for a while with Shivaprasad on IRC, reading
> your reply and thinking about it some more, I'm starting to think
> that we should report the same information as the kernel whenever
> possible, and enhance it with hints such as capacity for tools
> and admins, but carefully avoid being too clever for our own good.  

Well, I'm tempted to say that ship has sailed for libvirt.. :p.

But I don't disagree.

> We can paper over most of the differences between x86 and ppc64,
> but not all of them, I believe. Documentation is key here, and
> there's definitely room for improvement both when it comes to
> reasonable assumptions and ppc64 specific quirks - this is
> something I'm planning to address.
> 
> So my proposal is to drop patch 1/2 entirely and rework patch 2/2
> to just add the capacity attribute (maybe with a more descriptive
> name?), without touching the list of siblings or any other
> information retrieved from the kernel.
> 
> Cheers.
> 
> -- 
> Andrea Bolognani
> Software Engineer - Virtualization Team  

-- 
David Gibson <dgibson at redhat.com>
Senior Software Engineer, Virtualization, Red Hat

-- 
David Gibson <dgibson at redhat.com>
Senior Software Engineer, Virtualization, Red Hat
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/libvir-list/attachments/20160223/d82256b3/attachment-0001.sig>