[Crash-utility] [PATCH] Display online cpus value in preference to kt->cpus
Luciano Chavez
lnx1138 at linux.vnet.ibm.com
Mon Mar 8 17:28:14 UTC 2010
On Mon, 2010-03-08 at 09:49 -0500, Dave Anderson wrote:
> ----- "Dave Anderson" <anderson at redhat.com> wrote:
>
> > ----- "Luciano Chavez" <lnx1138 at linux.vnet.ibm.com> wrote:
> >
> > > Hi Dave,
> > >
> > > Thinking about backward compatibility, would displaying "ONLINE CPUS"
> > > still seem OK for the case where kernel_init() finds the smp_num_cpus
> > > symbol (as for a 2.4 kernel)? Before there were the various cpu maps, I
> > > think smp_num_cpus was analogous to the possible cpus as opposed to
> > > online. I can see this requiring some thought as to what CPUS in the
> > > output means when you have various different maps now (online, possible,
> > > and present). That being said, it would be good to leave no doubt and
> > > explicitly state the count is for the present or online CPUS with the
> > > latter being my suggestion.
> > >
> > > I forgot to mention that I suspect the problem I mentioned before would
> > > get stranger for POWER7 which offers 4 threads per core. I didn't have
> > > access to a POWER7 machine to see just what it would do if we tried
> > > disabling SMT as before but it follows the same pattern the count
> > > displayed would be way off from the online count.
> >
> > I just ran through a bunch of stashed dumpfiles I have on hand, and
> > it gets even murkier when dealing with Xen or KVM kernels, because
> > as part of the post-crash shutdown (or forced dump), all but one of
> > the cpus may be taken "offline". So even though there may be 4 vcpus,
> > and crash correctly shows 4 "CPUS", the cpu_online_map shows only one
> > cpu bit. So if we went ahead and displayed a number based upon the
> > cpu_online_map, it would completely misleading. Incorrect
> > actually...
>
> You can always dump the possible/present/online map information with
> the "help -k" debug option.
>
> So for example, taking a 2.6.9-era (RHEL4) xen kernel that crashed
> on vcpu 3 due to a NULL reference, the hypervisor made a callback to
> the other vcpus to shut them down prior to the core dumping procedure:
>
> crash> help -k
> ...
> cpu_possible_map: (does not exist)
> cpu_present_map: 0 1 2 3
> cpu_online_map: 3
> ...
>
> So the online map cannot be used for the cpu count, and for that
> matter, it wouldn't make any sense to even display the online map
> count.
>
> In any case, for now I prefer not to change things, at least for the
> other architectures.
>
> That being said, I defer machine-specific items for ppc64, s390
> and s390x to the IBM maintainers, and to HP for ia64. (The ppc
> and alpha architectures have no active "maintainers" any more,
> so those arches are pretty much withering on the vine.)
>
> So if you want to do something specifically for ppc64, please
> re-post a patch for just that architecture.
>
> Dave
>
Dave,
Thanks for taking a good look at all the many cases that would make a
general solution of using online cpu count messy. I originally did want
to make this change only applicable to ppc64. The thing was, only
ppc64_display_machine_stats() was possible to affect and to make the
value displayed consistent, changing display_sys_stats() and
dump_kernel_table() was necessary.
So, re-thinking this to be a ppc64 specific change to CPUS to be
displayed as the online count when possible and having everyone else do
what they do now, which is to display kt->cpus, I suggest the following:
1. Add a get_cpus_to_display as a machdep function
2. For ppc64, initialize machdep->get_cpus_to_display to
ppc64_get_cpus_to_display() which will attempt to use get_cpus_online()
or fallback to using kt->cpus
3. For all other architectures, have them initialize
machdep->get_cpus_to_display to generic_get_cpus_to_display() which
returns kt->cpus to maintain the status quo of the code as it is now
4. Replace kt->cpus in display_sys_stats() and dump_kernel_table() in
kernel.c to invoke machdep->get_cpus_to_display() when displaying CPUS
Let me know what you think. I think this solution allows for future
flexibility for other architectures if in the future they individually
need to change what they display for the cpu count.
regards,
--
Luciano Chavez <lnx1138 at linux.vnet.ibm.com>
IBM Linux Technology Center
More information about the Crash-utility
mailing list