[Crash-utility] Handle the NT_PRSTATUS lost for the "bt" command
Dave Anderson
anderson at redhat.com
Tue Jun 19 16:20:21 UTC 2012
----- Original Message -----
> Hi Dave,
>
> Thanks for your comments.
>
> (2012/06/19 1:01), Dave Anderson wrote:
>
> > I don't want to add any new initialization-time code -- especially if
> > it's related to the NT_PRSTATUS notes -- that can abort a crash session
> > unnecessarily. In your new crash_was_lost_crash_note() function, there
> > are these two FAULT_ON_ERROR readmem() calls:
> >
> > readmem(symbol_value("crash_notes"), KVADDR,&crash_notes_ptr,
> > sizeof(ulong), "crash_notes", FAULT_ON_ERROR);
> > and
> >
> > readmem(crash_notes_ptr, KVADDR, buf, SIZE(note_buf),
> > "cpu crash_notes", FAULT_ON_ERROR);
> >
> > Although they are highly unlikely to fail, can you please make
> > both of them RETURN_ON_ERROR, and if the readmem() fails, have
> > it bail out and return FALSE?
>
> I see, I'll make consideration about initialization-time rule from now,
> to keep crash session, should use readmem() with QUIET|RETURN_ON_ERROR.
>
> > And then, if necessary, make any adjustments to map_cpus_to_prstatus_kdump_cmprs()
> > to handle that remote possibility. You should be able to test it with your
> > patched kernel.
>
> I'm not able to test for unexpected, deliberate readmem() failure
> because my core file contains "crash_notes" memory field...
>
> Although I've probably misunderstood your advice,
> I had better add more debugging messages into
> map_cpus_to_prstatus_kdump_cmprs() to handle conditions
> where imply "not saved NT_PRSTATUS" or "nt_prstatus_percpu[] is
> remapped".
>
> > Also, I don't understand the wording of this error message
> > at the end of crash_was_lost_crash_note():
> >
> > error(WARNING, "catch lost crash_notes at cpu#%d\n", cpu);
> >
> > Can you re-word that? The notes were not "lost", but rather were
> > "not saved" by the crashing system.
>
> I re-word all "lost" keywords, so too function name with "saved".
> And also make this warning valid only when CRASHDEBUG() is enable
> because we can check it later by using "help -D".
>
> > Lastly, in __diskdump_memory_dump(), you just skip the "lost"
> > notes sections:
> >
> > for (i = 0, j = 0; j< dd->num_prstatus_notes; i++) {
> > if (dd->nt_prstatus_percpu[i] == NULL)
> > continue;
> > fprintf(fp, " notes[%d]: %lx\n",
> > i, (ulong)dd->nt_prstatus_percpu[i]);
> > j++;
> > }
> >
> > Can you make it more obvious, say, by displaying something like:
> >
> > notes[6]: (not saved)
>
> Looks better, thanks for giving good hint.
>
> [updates by attached patch]
>
> - display messages only if CRASHDEBUG() >= 1
> crash -d 1
> => display about NT_PRSTATUS mapping messages as below.
> ----------------
> WARNING: cpu#0: not saved crash_notes
> WARNING: cpu#1: not saved crash_notes
> crash: NT_PRSTATUS: cpu#2 = 107a34a8 (remapped from cpu#0)
> WARNING: cpu#3: not saved crash_notes
> WARNING: cpu#4: not saved crash_notes
> WARNING: cpu#5: not saved crash_notes
> WARNING: cpu#6: not saved crash_notes
> WARNING: cpu#7: not saved crash_notes
> ----------------
>
> - help message is enhanced
> crash> help -D | grep notes
> num_prstatus_notes: 1
> notes_buf: 107a34a8
> notes[0]: (not saved)
> notes[1]: (not saved)
> notes[2]: 107a34a8
> notes[3]: (not saved)
> notes[4]: (not saved)
> notes[5]: (not saved)
> notes[6]: (not saved)
> notes[7]: (not saved)
>
> Thanks,
> Toshi
OK, now I'm getting confused...
I note that currently __diskdump_memory_dump() does this:
for (i = 0; i < dd->num_prstatus_notes; i++) {
fprintf(fp, " notes[%d]: %lx\n",
i, (ulong)dd->nt_prstatus_percpu[i]);
}
But your patch does this:
nrcpus = kt->kernel_NR_CPUS ? kt->kernel_NR_CPUS : NR_CPUS;
for (i = 0; i < nrcpus; i++) {
if (dd->nt_prstatus_percpu[i] == NULL) {
fprintf(fp,
" notes[%d]: (not saved)\n",
i);
continue;
}
fprintf(fp, " notes[%d]: %lx\n",
i, (ulong)dd->nt_prstatus_percpu[i]);
}
Now surely we don't want to dump "NR_CPUS" notes pointers,
unless there actually are that many cpus in the system.
For example, in RHEL6 sets CONFIG_NR_CPUS to 4096 for x86_64,
but rarely do we see a dumpfile with that many cpus.
So I would prefer that the for loop continue to be bounded
by "dd->num_prstatus_notes".
However, given that process_elf32_notes() and process_elf64_notes()
do a cursory test of each note by checking for the NT_PRSTATUS
type, I'm presuming that in your dumpfile, dd->num_prstatus_notes
must be equal to 1, correct?
void
process_elf64_notes(void *note_buf, unsigned long size_note)
{
Elf64_Nhdr *nt;
size_t index, len = 0;
int num = 0;
for (index = 0; index < size_note; index += len) {
nt = note_buf + index;
if(nt->n_type == NT_PRSTATUS) {
dd->nt_prstatus_percpu[num] = nt;
num++;
}
len = sizeof(Elf64_Nhdr);
len = roundup(len + nt->n_namesz, 4);
len = roundup(len + nt->n_descsz, 4);
}
if (num > 0) {
pc->flags2 |= ELF_NOTES;
dd->num_prstatus_notes = num;
}
return;
}
But then map_cpus_to_prstatus_kdump_cmprs() remaps the
notes without modifying the dd->num_prstatus_notes counter.
So then there's this:
void *
diskdump_get_prstatus_percpu(int cpu)
{
if ((cpu < 0) || (cpu >= dd->num_prstatus_notes))
return NULL;
return dd->nt_prstatus_percpu[cpu];
}
So in a case such as your example, where cpu 2 was the only cpu that saved
its notes, the function above would incorrectly return NULL when called
with diskdump_get_prstatus_percpu(2).
What am I missing?
Dave
More information about the Crash-utility
mailing list