[Crash-utility] Question for LKCD maintainers - How about adding a debug flag to crash and only calling abort() if crash is started with '-d' flag provided?

Wed Jan 2 15:29:24 UTC 2008

Piet Delaney wrote:
> Dave Anderson wrote:
> 
>>Long after I stopped tinkering with the LKCD code in crash,
>>changes were contributed to support physical memory zones
>>in the LKCD dumpfile format.
> 
> 
> Hi Dave:
> 
> That could easily have been me.    I added zone support to the
> LKCD kernel and lcrash code and then updated your crash code
> to support zones. I kinda recall LKCD not dumping in monotonically
> increasing order and my modifying your crash code to live with this
> new feature in the LKCD dumps. I was trying to get the LKCD folks into
> supporting crash in addition to lcrash but failed to get any support from
> Tom Morano or Matt Robinson. I didn't realize that I had broken crash
> with the zone changes and felt responsible to fix crash to deal with this
> change that I had made. I also like the crash interface over the lcrash
> interface. I proposed to Tom using the elf format like KEXEC uses but
> he didn't go for it. I don't know why we can't hid additional crash info
> into ELF files and maintain as much compatibility as possible.
> 
> 
> 
>>  Specifically there is this
>>piece of save_offset() in lkcd_common.c:
>>
>>        /* find the zone */
>>        for (ii=0; ii < lkcd->num_zones; ii++) {
>>                if (lkcd->zones[ii].start == zone) {
>>                        if (lkcd->zones[ii].pages[page].offset != 0) {
>>                           if (lkcd->zones[ii].pages[page].offset !=
>>off) {
>>                                error(INFO, "conflicting page: zone
>>%lld, "
>>                                        "page %lld: %lld, %lld !=
>>%lld\n",
>>                                        (unsigned long long)zone,
>>                                        (unsigned long long)page,
>>                                        (unsigned long long)paddr,
>>                                        (unsigned long long)off,
>>                                        (unsigned long long) \
>>                                           
>>lkcd->zones[ii].pages[page].offset);
>>                                abort();
>>                           }
>>                           ret = 0;
>>                        } else {
>>                           lkcd->zones[ii].pages[page].offset = off;
>>                           ret = 1;
>>                        }
>>                        break;
>>                }
>>        }
> 
> The printf looks a bit like my coding style, though I don't know
> why (I ?)  decided to abort() in this case. I suppose the idea is
> to look at the situation with gdb on the resulting core file.
> 
> 
> 
>>The call to abort() above kills the crash session, which is both
>>annoying and unnecessary.
> 
> Isn't it worth while to look at the core file to understand the reason
> for the abort() being called for?

I would think so, but not by me -- the developers of this LKCD
off-shoot can debug their own stuff.

> 
> 
>>I am seeing it in a customer dumpfile, who have their own dumping scheme
>>that is based upon LKCD version 7.  I understand that this may be a
>>problem with their LKCD port, but nonetheless, it's the only place in
>>the crash utility that doesn't recover gracefully from dumpfile access
>>errors.
>>
>>Anyway, I would like to either:
>>
>> 1. change the error(INFO...) to error(FATAL...) so that run-time
>>    commands encountering this error will just fail, and the session
>>    will return to the crash> prompt, or
>> 2. return 0, so that a "seek error" can be subsequently displayed
>>    by the readmem() command.
>>
>>Number 2 is preferable, because it yields more clues as to where the
>>readmem() came from, but since I don't know much about the LKCD
>>physical memory zones stuff, is there any reason that shouldn't
>>be done?
> 
> 
> How about having a crash debug flag and only calling abort if the
> debug flag is set. You might print in the error message that the
> user can force a core dump by adding a '-d' flag on invocation of
> crash and sending you the core file.

Regardless of the reason behind it, the whole point is that there
was no need to abort the crash session.  If the "missing" page was
crucial to the crash session being able to run, then crash would
die on its own terms.  There are no other abort() calls in the
crash sources.

But in this case, the page was unnecessary for analysis of
the problem.  But when some commands (I forget which -- certainly
"search" for example) bumped into the page, the session would
abort() and had to be started up again.

Anyway, the abort() call was removed in version 4.0-4.9:

   - Fix for LKCD dumpfile access failures that abort() the crash session
     after displaying an error message indicating a problem with physical
     memory zones in the dumpfile.  Without the patch, the crash session
     would end immediately after displaying an error message of the sort:
     "conflicting page: zone 0, page 0: 0, 177160130 != 65536".  That
     error message will now only be displayed if the crash debug mode is 1
     or more, a readmem() "seek error" will be displayed instead, and the
     session will return to the "crash>" prompt.  (anderson at redhat.com)

This was the patch:

--- lkcd_common.c       15 Nov 2007 15:44:38 -0000      1.29
+++ lkcd_common.c       19 Nov 2007 15:48:18 -0000      1.30
@@ -708,14 +708,15 @@
                 if (lkcd->zones[ii].start == zone) {
                         if (lkcd->zones[ii].pages[page].offset != 0) {
                            if (lkcd->zones[ii].pages[page].offset != off) {
-                               error(INFO, "conflicting page: zone %lld, "
+                               if (CRASHDEBUG(1))
+                                   error(INFO, "LKCD: conflicting page: zone 
%lld, "
                                         "page %lld: %lld, %lld != %lld\n",
                                         (unsigned long long)zone,
                                         (unsigned long long)page,
                                         (unsigned long long)paddr,
                                         (unsigned long long)off,
                                         (unsigned long 
long)lkcd->zones[ii].pages[page].offset);
-                               abort();
+                               return -1;
                            }
                            ret = 0;
                         } else {

With respect to the -d flag suggestion, if you want to drop core
then you can set the internal crash "core" variable to "on", which
which will force a segmentation violation after printing the next
error message:

   crash> set core
   core: off (do NOT drop core on error message)
   crash> set core on
   core: on (drop core on error message)
   crash>

And then run the command that generates the error, say for
example, reading a non-existent physical address:

   crash> rd -p deadbeef
   [./crash] error trace: 8095503 => 8095799 => 8096ab4 => 808879c
   rd: read error: physical address: deadbeef  type: "32-bit PHYSADDR"

     808879c: __error+108
     8096ab4: readmem+1328
     8095799: display_memory+657
     8095503: cmd_rd+1558

   DROP_CORE flag set: forcing a segmentation fault
   Segmentation fault (core dumped)
   $

> 
> 
> While I've got your attention. I'm upgrading our 2.6.12-stable kernel to
> 2.6.16-stable and want to start supporting core dumps. Ideally I'd like to
> have core dumps that are compatible with gdb and crash. Can crash
> handle the elf core files generated by KEXEC/KCORE. Last I thought
> about this I recall there being incompatibilities and it getting worse
> with kernels being compiled to be relocatable and kgdb having a problem
> because it wasn't aware of the relocation.

By "KEXEC/KCORE" I'm presuming you mean "kexec/kdump", but I'm
not sure what incompatibility you're referring to?

Maybe the workaround for x86 kernels whose CONFIG_PHYSICAL_START
contains a value that is greater then CONFIG_PHYSICAL_ALIGN:

   http://people.redhat.com/anderson/crash.changelog.html#4_0_4_5

Or maybe you're talking about 32-bit gdb not being able to handle
kdump-generated 64-bit ELF core files for 32-bit kernels?

Dave