[Crash-utility] Question on online/present/possible CPUS

Hagen, Jeffrey Jeffrey.Hagen at Teradata.com
Thu Sep 23 20:29:48 UTC 2010


Hi Dave,

	Attached is our suggested patch for the issue with CPU count in
an NMI switch induced coredump.  Basically the change uses the
cpu_present_mask instead of the cpu_online_mask in x86_64_per_cpu_init
and x86_64_get_smp_cpus.

	In answer to your question below: "Are you saying that the NMI
switch shutdown handler takes the other cpus offline?" --- Yes!!

Thanks,

Jeff


-----Original Message-----
From: crash-utility-bounces at redhat.com
[mailto:crash-utility-bounces at redhat.com] On Behalf Of Dave Anderson
Sent: Thursday, August 12, 2010 6:22 AM
To: Discussion list for crash utility usage,maintenance and development
Subject: Re: [Crash-utility] Question on online/present/possible CPUS


----- "Jeffrey Hagen" <Jeffrey.Hagen at teradata.com> wrote:

> Hi Petr and Dave,
> 
> I have a couple of comments on Petr's email regarding CPU count.
> 
> When the dump is the result of an NMI (nmi switch pressed) due to a
hung
> system, one often needs to analyze the state and backtrace for all the
> CPU's.  Since the kernel halts all but CPU0, the crash utility cannot
> see the other "offline" CPU's.

I've never seen that behavior before.  Probably because I've never seen
an x86_64 dumpfile that was created as a result of the NMI switch being
pressed?  Anyway, are you saying that the NMI switch shutdown handler 
takes the other cpus offline?
 
> This behavior has changed for the x86 architecture somewhere between
> 2.6.16 (SLES10) and 2.6.32 (SLES11) due to the removal of the
x8664_pda
> structure.  
> The function x86_64_init (in x86_64.c) now calls x86_64_per_cpu_init
> which doesn't count the offline CPUS when calculating the number of
> CPU's.  Previously, x86_64_cpu_pda_init (called if x8664_pda exists),
> didn't check for online/offline status.

Again -- I've never seen this behaviour before.

In any case, I'll look at any patch suggestions you guys have in mind.

Thanks,
  Dave

 
> Regarding #3 in Petr's email.  It appears that the set command won't
> accept a value >= kt_cpus (number of CPUS).  It doesn't check if the
CPU
> is offline or not.
> 
> Thanks,
> 
> Jeff Hagen
> 
> 
> 
> >
> > Hi all,
> >
> > before making a larger cleanup, I want to ask here for your
> opinion.
> It
> > seems that there is quite a bit of confusion about the meaning of
> CPU
> > count printed out by the crash utility.
> >
> > 1. Number of CPUs
> >
> > Some people think that crash should always output the number of
> CPUs
> in
> > the system (ie. a quad-core server should always output 'CPUS: 4'),
> > while other people think that only online CPUs should be counted.
> >
> > 2. CPU numbering
> >
> > For example, if there are 4 CPUs in the system, but some of them
> are
> > taken offline (e.g. CPU 1 and CPU 3), _and_ crash output the number
> of
> > online CPUs, it would print out 'CPUS: 2'. It's not easy to find
> out
> > that valid CPU numbers are 0 and 2 in this case.
> 
> Hi Petr,
> 
> For all but ppc64, the number shown by the initial banner and the
> "sys" command is essentially "the-highest-cpu-number-plus-one".
> For ppc64 (as requested and implemented by the IBM/ppc64
> maintainers),
> it shows the number of online cpus.  There's reasons for doing it
> either of the two ways, but I'm on vacation now, and you can research
> the list archives for the various arguments for-and-against doing it
> either way.  Check the changelog.html for when it was changed for
> ppc64, and then cross-reference the revision date with the list
> archives.
> 
> > 3. Examining offline CPU
> >
> > Sometimes, it may be useful to examine the state of an offline CPU.
> Now,
> > I know that the saved state is most likely stale, but it can be
> useful
> > in some cases (e.g. a crash after dropping to kdb). The crash
> utility
> > currently refuses to select an offline CPU with 'set -c #'. Are
> there
> > any concerns about allowing it?
> 
> I tend to agree with you, but the only thing that's useful and
> available from an offline cpu is the swapper task for that cpu
> and the runqueue for that cpu.  And both of those entities are
> readily accessible if you really need them.  Although I don't know
> anything about kdb status, so maybe there's something of per-cpu
> interest, but I don't know why it would be necessary to "set"
> that cpu?
> 
> In any case, like I said before, I'm just temporarily online while
> on vacation, and will be back to work on the 9th.
> 
> Thanks,
>   Dave
> 
> --
> Crash-utility mailing list
> Crash-utility at redhat.com
> https://www.redhat.com/mailman/listinfo/crash-utility

--
Crash-utility mailing list
Crash-utility at redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility
-------------- next part --------------
A non-text attachment was scrubbed...
Name: crash-5.0.7-cpu_count.patch
Type: application/octet-stream
Size: 1838 bytes
Desc: crash-5.0.7-cpu_count.patch
URL: <http://listman.redhat.com/archives/crash-utility/attachments/20100923/751a2c0d/attachment.obj>


More information about the Crash-utility mailing list