[dm-devel] Serial console is causing system lock-up

Thu Mar 7 12:54:44 UTC 2019

On Thu, 7 Mar 2019, Sergey Senozhatsky wrote:

> On (03/07/19 11:37), John Ogness wrote:
> > > Sorry John, the reasoning is that I'm trying to understand
> > > why this does not look like soft or hard lock-up or RCU stall
> > > scenario.
> > 
> > The reason is that you are seeing data being printed on the console. The
> > watchdogs (soft, hard, rcu, nmi) are all touched with each emergency
> > message.
> 
> Correct. Please see below.
> 
> > > The CPU which spins on prb_lock() can have preemption disabled and,
> > > additionally, can have local IRQs disabled, or be under RCU read
> > > side lock. If consoles are busy, then there are CPUs which printk()
> > > data and keep prb_lock contended; prb_lock() does not seem to be
> > > fair. What am I missing?
> > 
> > You are correct. Making prb_lock fair might be something we want to look
> > into. Perhaps also based on the loglevel of what needs to be
> > printed. (For example, KERN_ALERT always wins over KERN_CRIT.)
> 
> Good.
> 
> I'm not insisting, but I have a feeling that touching watchdogs after
> call_console_drivers() might be too late, sometimes. When we spin in

There are still NMI stacktraces - see here 
http://people.redhat.com/~mpatocka/testcases/console-lockup/

> prb_lock() we wait for all CPUs which are before/ahead of us to
> finish their call_console_drivers(), one by one. So if CPUZ is very
> unlucky and is in atomic context, then prb_lock() for that CPUZ can
> last for  N * call_console_drivers().  And depending on N (which also
> includes unfairness) and call_console_drivers() timings NMI watchdog
> may pay CPUZ a visit before it gets its chance to touch watchdogs.
> 
> *May be* sometimes we might want to touch watchdogs in prb_lock().
> 
> So, given the design of new printk(), I can't help thinking about the
> fact that current
> 	"the winner takes it all"
> may become
> 	"the winner waits for all".
> 
> Mikulas mentioned that he observes "** X messages dropped" warnings.
> And this suggests that, _most likely_, we had significantly more that
> 2 CPUs calling printk() concurrently.

When I observe these messages (usually with small log buffer size), it 
doesn't lockup.

The lockups happen because the messages are stuffed into a 2MiB buffer and 
then printed over 115200 baud serial line.

You can see this: 
http://people.redhat.com/~mpatocka/testcases/console-lockup/5.0-total-lockup.txt

Here it attempted to write 1355277 bytes over the slow serial line - it 
takes a few minutes.

> - A single source - single CPU calling printk() - would not lose messages,
>   because it would print its own message before it printk() another one (we
>   still could have another CPU rescheduling under console_sem, but I don't
>   think this is the case).
> 
> - Two CPUs would also probably not lose messages, Steven's console_owner
>   would throttle them down.
> 
> So I think what we have was a spike of WARN/ERR printk-s comming from
> N CPUs concurrently.

Losing messages is in my opinion reasonable (if they are produced faster 
than they could be printed). Another possibility is to always write the 
message synchronously and exit printk only after it is written.

> And this brings us to another pessimistic scenario: a very unlucky
> CPUZ has to spin in prb_lock() waiting for other CPUs to print out
> the very same 2 million chars. Which in terms of printk() latency
> looks to me just like current printk.
> 
> John, sorry to ask this, does new printk() design always provide
> latency guarantees good enough for PREEMPT_RT?
> 
> I'm surely missing something. Where am I wrong?
> 
> 	-ss

Mikulas