[Crash-utility] kmem -[sS] segfault on 2.6.25.17

Thu Oct 16 20:37:44 UTC 2008

On Thu, Oct 16, 2008 at 3:54 PM, Dave Anderson <anderson at redhat.com> wrote:
>
> ----- "Mike Snitzer" <snitzer at gmail.com> wrote:
>
>> Frame 0 of crash's core shows:
>> (gdb) bt
>> #0  0x0000003b708773e0 in memset () from /lib64/libc.so.6
>>
>> I'm not sure how to get the faulting address though?  Is it just
>> 0x0000003b708773e0?
>
> No, that's the text address in memset().  If you "disass memset",
> I believe that you'll see that the address above is dereferencing
> the rcx register/pointer.  So then, if you enter "info registers",
> you'll get a register dump, and rcx would be the failing address.

OK.

0x0000003b708773e0 <memset+192>:        movnti %r8,(%rcx)

(gdb) info registers
...
rcx            0xa7b000 10989568

(gdb) x/x 0xa7b000
0xa7b000:       Cannot access memory at address 0xa7b000

>> I've not rebooted the system at all either... now when I run 'kmem
>> -s'
>> in live crash I see:
>>
>> CACHE            NAME                 OBJSIZE  ALLOCATED     TOTAL
>> SLABS  SSIZE
>> ...
>> kmem: nfs_direct_cache: full list: slab: ffff810073503000  bad inuse
>> counter: 5
>> kmem: nfs_direct_cache: full list: slab: ffff810073503000  bad inuse
>> counter: 5
>> kmem: nfs_direct_cache: partial list: bad slab pointer: 88
>> kmem: nfs_direct_cache: full list: bad slab pointer: 98
>> kmem: nfs_direct_cache: free list: bad slab pointer: a8
>> kmem: nfs_direct_cache: partial list: bad slab pointer:
>> 9f911029d74e35b
>> kmem: nfs_direct_cache: full list: bad slab pointer: 6b6b6b6b6b6b6b6b
>> kmem: nfs_direct_cache: free list: bad slab pointer: 6b6b6b6b6b6b6b6b
>> kmem: nfs_direct_cache: partial list: bad slab pointer: 100000001
>> kmem: nfs_direct_cache: full list: bad slab pointer: 100000011
>> kmem: nfs_direct_cache: free list: bad slab pointer: 100000021
>> ffff810073501600 nfs_direct_cache         192          2        40
>>  2     4k
>> ...

> Are those warnings happening on *every* slab type?  When you run on a
> live system, the "shifting sands" of the kernel underneath the crash
> utility can cause errors like the above.  But at least some/most of
> the other slabs' infrastructure should remain stable while the command
> runs.

Ah makes sense, yes many of them do remain stable:

kmem: request_sock_TCPv6: full list: bad slab pointer: 79730070756b6f7f
kmem: request_sock_TCPv6: free list: bad slab pointer: 79730070756b6f8f
ffff810079199240 request_sock_TCPv6       160          0         0      0     4k
ffff81007919a200 TCPv6                   1896          3         4      2     4k
ffff81007dcb41c0 dm_mpath_io               64          0         0      0     4k
...
ffff81007d9ce580 sgpool-8                 280          2        42      3     4k
ffff81007d9cf540 scsi_bidi_sdb             48          0         0      0     4k
ffff81007d98b500 scsi_io_context          136          0         0      0     4k
ffff81007d95e4c0 ext3_inode_cache         992      38553     38712   9678     4k
ffff81007d960480 ext3_xattr               112         68       102      3     4k

etc

>> But if I run crash against the vmcore I do get the segfault...
>>
>
> When you run it on the vmcore, do you get the segfault immediately?
> Or do some slabs display their stats OK, but then when it deals with
> one particular slab it generates the segfault?
>
> I mean that it's possible that the target slab was in transition
> at the time of the crash, in which case you might see some error
> messages like you see on the live system.  But it is difficult to
> explain why it's dying specifically where it is, even if the slab
> was in transition.
>
> That all being said, even if the slab was in transition, obviously
> the crash utility should be able to handle it more gracefully...

None of the slabs display their stats OK, crash segfaults immediately.

>> > BTW, if need be, would you be able to make the vmlinux/vmcore pair
>> > available for download somewhere?  (You can contact me off-list
>> with
>> > the particulars...)
>>
>> I can work to make that happen if needed...
>
> FYI, I did try our RHEL5 "debug" kernel (2.6.18 + hellofalotofpatches),
> which has both CONFIG_DEBUG_SLAB and CONFIG_DEBUG_SLAB_LEAK turned on,
> but I don't see the problem.  So unless something obvious can be
> determined, that may be the only way I can help.

Interesting.  OK, I'll work to upload them somewhere and I'll send you
a pointer off-list.

Thanks!
Mike