[Crash-utility] loop in crash

Wed Apr 25 14:42:25 UTC 2012

----- Original Message -----
> 
> Hi Dave,
> 
> I have a corrupt vmcore file (for ARM) that makes crash loop forever.
> The problem is in memory.c, function max_cpudata_limit. The last
> part of that function:
> 
> if (VALID_MEMBER(kmem_list3_shared) &&
> VALID_MEMBER(kmem_cache_s_lists) &&
> readmem(kmem_cache_nodelists(cache), KVADDR, &start_address[0],
> sizeof(ulong) * vt->kmem_cache_len_nodes, "array nodelist array",
> RETURN_ON_ERROR)) {
> for (i = 0; i < vt->kmem_cache_len_nodes; i++) {
> if (start_address[i] == 0)
> continue;
> if (readmem(start_address[i] + OFFSET(kmem_list3_shared),
> KVADDR, &shared, sizeof(void *),
> "kmem_list3 shared", RETURN_ON_ERROR|QUIET)) {
> if (!shared)
> break;
> }
> if (readmem(shared + OFFSET(array_cache_limit),
> KVADDR, &limit, sizeof(int), "shared array_cache limit",
> RETURN_ON_ERROR|QUIET)) {
> if (limit > max_limit)
> max_limit = limit;
> break;
> }
> }
> }
> FREEBUF(start_address);
> return max_limit;
>
> bail_out:
> vt->flags |= KMEM_CACHE_UNAVAIL;
> error(INFO, "unable to initialize kmem slab cache subsystem\n\n");
> *cpus = 0;
> return 0;
> 
> 
> The problem is that the readmem statement “if
> (readmem(start_address[i] + OFFSET(kmem_list3_shared), …..” fails,
> and then the function max_cpudata_limit is called over and over
> again. I did a patch adding “else goto bail_out;” if the readmem
> fails and then crash managed to continue. I do not know if this is
> really a good idea.
> 
> As this seems only to be a problem for corrupt vmcore files I do not
> know if you want to do anything about it.

Maybe -- maybe not...

In the case of corrupted vmcores, it's preferable to avoid a cover-up,
and in fact, the crash utility is often "doing its job" by failing,
i,e., its failure points to the problem at hand.

However, in the specific case of the kmem_cache initialization, that has
been a problem area in the past when the subsystem itself is corrupted,
or perhaps in your case where the vmcore is corrupted.  That's why
the "crash --no_kmem_cache" or "crash --kmem_cache_delay" options 
were put in place.

Now in your case, I'm guessing that the crash session may have
quietly "hung" during initialization?  And with debug turned on you
may have seen the readmem failures?

I tried to reproduce this by injecting a readmem() failure for
that particular readmem(), but it does not result in a loop.
In my test, the readmem() fails, max_cpudata_limit() eventually returns, 
and kmem_cache_init() just goes onto the next kmem_cache in the chain.
Also, because that readmem() is explicitly set RETURN_ON_ERROR|QUIET, it can
conceivably fail without max_cpudata_limit() having to set KMEM_CACHE_UNAVAIL.

Anyway, if max_cpudata_limit() returns without setting KMEM_CACHE_UNAVAIL,
kmem_cache_init() should just continue to walk through the kmem_cache
chain:

        [ initialize "cache" and "cache_end" ]

        do {
                ... [ cut ] ...

                if ((tmp = max_cpudata_limit(cache, &tmp2)) > max_limit)
                        max_limit = tmp;

                /*
                 *  Recognize and bail out on any max_cpudata_limit() failures.
                 */
                if (vt->flags & KMEM_CACHE_UNAVAIL) {
                        FREEBUF(cache_buf);
                        return;
                }

                ... [ cut ] ...

                cache = ULONG(cache_buf + next_offset);

                switch (vt->flags & (PERCPU_KMALLOC_V1|PERCPU_KMALLOC_V2))
                {
                case PERCPU_KMALLOC_V1:
                        cache -= next_offset;
                        break;
                case PERCPU_KMALLOC_V2:
                        if (cache != cache_end)
                                cache -= next_offset;
                        break;
                }

        } while (cache != cache_end)

So I don't understand how you got into a loop unless the kmem_cache list
walk-through is the real problem.  If you were to print out the "cache"
address each time through the do-while loop, does the list start repeating
itself? 

And if that's true, perhaps the kmem_cache_init() should use the
hq_open()/hq_enter()/hq_close() facility on each cache address to
catch a duplicate (false) entry.

Dave