[Crash-utility] Infinite loop during gathering of kmem slab cache data

Tue Mar 15 13:40:06 UTC 2011

----- Original Message -----
> I am seeing an issue when opening a crash file from RHEL AS 2.1 with
> the last 2.1 kernel released where crash will sit at “please wait...
> (gathering kmem slab cache data)” forever taking 100% of the CPU.
> Turning on debugging at level 5 continuously prints out the following
> sequence over and over without stopping:
>
> <readmem: f79d52cc, KVADDR, "kmem_cache buffer", 244, (ROE), 8585500>
> <readmem: f79d5340, KVADDR, "cpudata array", 128, (ROE), ffffcbd0>
> <readmem: f79d5344, KVADDR, "cpucache limit", 4, (ROE), ffffcbcc>
> <readmem: f79d5344, KVADDR, "cpucache limit", 4, (ROE), ffffcbcc>
> <readmem: f79d534c, KVADDR, "cpucache limit", 4, (ROE), ffffcbcc>
> <readmem: f79d534c, KVADDR, "cpucache limit", 4, (ROE), ffffcbcc>
> <readmem: f79d5354, KVADDR, "cpucache limit", 4, (ROE), ffffcbcc>
 ... [ cut ] ...
> <readmem: f79d53a4, KVADDR, "cpucache limit", 4, (ROE), ffffcbcc>
> <readmem: f79d53ac, KVADDR, "cpucache limit", 4, (ROE), ffffcbcc>
> <readmem: f79d53ac, KVADDR, "cpucache limit", 4, (ROE), ffffcbcc>
> <readmem: f79d53b4, KVADDR, "cpucache limit", 4, (ROE), ffffcbcc>
> <readmem: f79d53b4, KVADDR, "cpucache limit", 4, (ROE), ffffcbcc>
> <readmem: f79d53bc, KVADDR, "cpucache limit", 4, (ROE), ffffcbcc>
> <readmem: f79d52cc, KVADDR, "kmem_cache buffer", 244, (ROE), 8585500>
> 
> ....

If you run crash on the live system running that kernel, or on any
other dumps running that kernel version, does crash initialize?  
(I'm sorry -- that kernel is so old I don't even have any sample 
dumpfiles left hanging around...)  Somehow kmem_cache_init() is 
looping back on itself when walking through the linked-list of kmem_cache
structures.  But that's such a fundamental flaw, that I'd really like
to know if somehow over the years something has changed such that the
mechanism used for following the linked-list no longer works for a kernel
of that vintage?

Anyway, there's no way of working around that other than what you have
done by using --no_kmem_cache.  That doesn't cause any other functional 
problems other than to set an internal KMEM_CACHE_UNAVAIL flag that
is used to disallow command options that need to access the kmem slab
substructure.

Other than that, there's nothing else that can be done unless you
want to start tinkering with the crash code.  It would be a matter
of determining why the linked list cannot be followed.  If you want
to pursue that, I'd suggest playing around with the kmem_cache_list()
function in memory.c.  You can bring the crash session up with
--no_kmem_cache, and then run the "kmem -s list" function, which
calls kmem_cache_list().  That function just walks the kmem_cache
linked list dumping out the name and address of each slab cache 
in the same way the kmem_cache_init() does -- but you would first 
have to comment out the check for the KMEM_CACHE_UNAVAIL flag at
the top of the function:

static void
kmem_cache_list(void)
{
        ulong cache, cache_cache, name;
        long next_offset, name_offset;
        char *cache_buf;
        int has_cache_chain;
        ulong cache_chain;
        char buf[BUFSIZE];

        if (vt->flags & KMEM_CACHE_UNAVAIL) {
                error(INFO, "kmem cache slab subsystem not available\n");
                return;
        }

Anyway, doing that, you can see how/where the linked-list goes back
on itself.

But it's really important to first determine whether this is always
going to fail with that kernel, or if it's specific to that particular
dumpfile.

Dave