[Crash-utility] determining a "valid" vmcore

Thu Feb 7 19:38:32 UTC 2008

Hi Andrew,

Dave Anderson wrote:
> Andrew Hecox wrote:
>> On Thu, 2008-02-07 at 10:32 -0500, Dave Anderson wrote:
>>> Andrew Hecox wrote:
>>>> hello,
>>>>
>>>> I'm looking at a customer issue where diskdumpmsg is unable to read a
>>>> vmcore file. It is not clear if this a problem with the vmcore file or
>>>> diskdumpmsg. I can load the vmcore with crash and in my naive usage of
>>>> it, can see no problems. However, I'm new to the tool so that doesn't
>>>> give me a lot of confidence.
>>>> Does anyone have any suggestions on how or if I can use crash to help
>>>> determine if there's corruption in the vmcore file? Or any other way of
>>>> approaching the problem?
>>>> Thanks much,
>>>>
>>>> Andrew
>>>>
>>> I'm not sure what you expect the crash utility to do -- if it comes
>>> up to a prompt with no error or warning messages, it means that the
>>> ELF header contains what appears to be valid usable information,
>>> and that the minimum kernel memory contents required to set up the
>>> crash utility's notion of the running system are all in place.  That's
>>> not to say that there is no chance that the vmcore contains some
>>> corruption that was not recognized.
>>>
>>
>> Thanks. Any other suggestions on how to determine if a vmcore is "valid"
>> or is that not even a reasonable question to try and ask? The problem
>> I'm trying to solve is described better below:
>>
>>> With respect to diskdumpmsg, as I understand it, it was fairly recently
>>> changed from a perl script to a C file so that it could be run
>>> earlier in time so as to be able to use the swap partition.  Looking
>>> at main() in the diskdumpmsg.c file (version 1.4.1-2), there are 
>>> numerous
>>> error types and associated error messages.  What do you mean when you
>>> say that "diskdumpmsg is unable to read a vmcore file"?
>>
>> Specifically:
>>  - user reported a floating point exception from diskdump on startup
>>  - the result was reproducible locally but only with their vmcore file
>>  - fpe occurred in get_logbuf:
>>                 log_end %= log_buf_len;
>>  - log_buf_len had been set to 0 in read_buffer
>>           if (!page_is_dumpable(pfn, dump->device)) {
>>               memset(buf, 0, copy_len);
>>           } else {
>>  - I don't know enough to say if the page really wasn't dumpable. 
>> static inline bool page_is_dumpable(unsigned int nr, DumpDevice *device)
>> {
>>   return device->dumpable_bitmap[nr>>3] & (1 << (nr & 7));
>> }
>>  - I wrote a patch with one way to avoid the FPE (attached) and sent it
>> to SEG.
>>
>> Now I'm trying to determine if the vmcore file should be readable by
>> diskdumpmsg. In other words, is this a problem in diskdumpmsg post-crash
>> or a problem with the vmcore file prior to it getting to diskdumpmsg.
>> Unfortunately, I don't understand the problem domain very well at all,
>> hence the probably naive questions :)
>>
>> Any suggestions are appreciated.
>>
>> -Andrew
> 
> So it appears that the page containing the log_buf_len symbol is not
> readable or contained in the dumpfile.  BTW, is this a compressed
> dumpfile or an ELF formatted dumpfile?  And what "dump_level" did
> they configure?
> 
> Anyway, back to the log_buf_len symbol read, what happens when you
> enter the "log" command while in a crash session?  It attempts to
> read that symbol immediately.

The virtual address of log_buf_len may be converted to wrong pfn.
Could you check pfn value passed to "page_is_dumpable"?

Thanks,
Takao Indoh