[libvirt] [PATCH 1/1] nodeinfo: Increase the num of CPU thread siblings to a larger value

Thu Mar 26 16:12:42 UTC 2015

On 03/26/2015 12:08 PM, Wei Huang wrote:
>
>
> On 03/26/2015 10:49 AM, Don Dutile wrote:
>> On 03/26/2015 07:03 AM, Ján Tomko wrote:
>>> On Thu, Mar 26, 2015 at 12:48:13AM -0400, Wei Huang wrote:
>>>> Current libvirt can only handle up to 1024 thread siblings when it
>>>> reads Linux sysfs topology/thread_siblings. This isn't enough for
>>>> Linux distributions that support a large value. This patch fixes
>>>> the problem by using VIR_ALLOC()/VIR_FREE(), instead of using a
>>>> fixed-size (1024) local char array. In the meanwhile
>>>> SYSFS_THREAD_SIBLINGS_LIST_LENGTH_MAX is increased to 8192 which
>>>> should be large enough for a foreseeable future.
>>>>
>>>> Signed-off-by: Wei Huang <wei at redhat.com>
>>>> ---
>>>>    src/nodeinfo.c | 10 +++++++---
>>>>    1 file changed, 7 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/src/nodeinfo.c b/src/nodeinfo.c
>>>> index 34d27a6..66dc7ef 100644
>>>> --- a/src/nodeinfo.c
>>>> +++ b/src/nodeinfo.c
>>>> @@ -287,7 +287,7 @@ freebsdNodeGetMemoryStats(virNodeMemoryStatsPtr
>>>> params,
>>>>    # define PROCSTAT_PATH "/proc/stat"
>>>>    # define MEMINFO_PATH "/proc/meminfo"
>>>>    # define SYSFS_MEMORY_SHARED_PATH "/sys/kernel/mm/ksm"
>>>> -# define SYSFS_THREAD_SIBLINGS_LIST_LENGTH_MAX 1024
>>>> +# define SYSFS_THREAD_SIBLINGS_LIST_LENGTH_MAX 8192
>>>
>>> There is thread_siblings_list, which contains a range:
>>> 22-23
>>> and thread_siblings file has all the bits set:
>>> 00c00000
>>>
>>> For the second one, the 1024-byte buffer should be enough for 16368
>>> possible siblings.
>>>
>> a 4096 siblings file will generate a (cpumask_t -based) output of :
>> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,
>> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,
>> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,
>> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,
>> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,
>> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,
>> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,
>> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,
>> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,
>> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,
>> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,
>> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,
>> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,
>> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,
>> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,
>> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000080
>> 9(characters per 32-bit mask, including the comma)*8(masks/row)*16(rows)
>> -1(last entry doesn't have a comma) = 1152
>>
>> Other releases/arch's avoid this issue by using cpumask_var_t vs
>> cpumask_t for siblings
>> so it's reflective of actual cpu count a system (not operating system)
>> could provide/support.
> Don, could ARM kernel use cpumask_var_t as well? Or this will require
> lots of change on top of existing code?
>
Yes. Working on that (kernel) patch now.
It was simple/fast to use cpumask_t b/c historically,
the counts (& kernel NR_CPUS value) were low.
On x86, they were ACPI-driven.  On arm64, need ACPI & DT-based solution,
and arm64-acpi looks like it was based more on ia64 then x86, so need
to create/support some new globals on arm64 that cpumask_var_t depend on,
and have to roll DT to do the same.

>> cpumask_t objects are NR_CPUS -sized.
>> In the not so distant future, though, real systems will have 1024 cpus,
>> so might as well accomodate for a couple years after that.
>>
> So we agree that such fix would be necessary, because: i) it will fail
> on cpumask_t based kernel (like Red Hat ARM); ii) eventually we might
> need to revisit this issue when a currently working system reaches the
> tipping point of CPU count (>1000).
>
Yes.

>>> For the first one, the results depend on the topology - if the sibling
>>> ranges are contiguous, even million CPUs should fit there.
>> The _list files(core_siblings_list, thread_siblings_list) have ranges;
>> the non _list (core_siblings, thread_siblings) files have mask like above.
>>
>>> For the worst case, when every other cpu is a sibling, the second file
>>> is more space-efficient.
>>>
>>>
>>> I'm OK with using the same limit for both (8k seems sufficiently large),
>>> but I would like to know:
>>>
>>> Which one is the file that failed to parse in your case?
>>>
>> /sys/devices/system/cpu/cpu*/topology/thread_siblings
>>
>>> I think both virNodeCountThreadSiblings and virNodeGetSiblingsList could
>>> be rewritten to share some code and only look at one of the sysfs files.
>>> The question is - which one?
>>>
>>> Jan
>>>
>>