[dm-devel] Serial console is causing system lock-up

Thu Mar 7 14:08:39 UTC 2019

>>>>> "John" == John Ogness <john.ogness at linutronix.de> writes:

John> On 2019-03-06, Steven Rostedt <rostedt at goodmis.org> wrote:
>>> This bug only happens if we select large logbuffer (millions of
>>> characters). With smaller log buffer, there are messages "** X printk
>>> messages dropped", but there's no lockup.
>>> 
>>> The kernel apparently puts 2 million characters into a console log
>>> buffer, then takes some lock and than tries to write all of them to a
>>> slow serial line.
>>> 
>>> [...]
>>> 
>>> The MD-RAID is supposed to recalculate data for the corrupted device
>>> and bring it back to life. However, scrubbing the MD-RAID device
>>> resulted in a lot of reads from the device with bad checksums, these
>>> were reported to the log and killed the machine.
>>> 
>>> I made a patch to dm-integrity to rate-limit the error messages. But
>>> anyway - killing the machine in case of too many log messages seems
>>> bad.  If the log messages are produced faster than the kernel can
>>> write them, the kernel should discard some of them, not kill itself.
>> 
>> Sounds like another aurgment for the new printk design.

John> Assuming the bad checksum messages are considered an emergency
John> (for example, at least loglevel KERN_WARN), then the new printk
John> design would print those messages synchronously to the slow
John> serial line in the context of the driver as the driver is
John> producing them.

John> There wouldn't be a lock-up, but it would definitely slow down
John> the driver. The situation of "messages being produced faster
John> than the kernel can write them" would never exist because the
John> printk() call will only return after the writing is completed. I
John> am curious if that would be acceptable here?

The real problem is the disconnect between serial console speed and
capacity in bits/sec and that of the regular console.  Serial, esp at
9600 baud is just a slow and limited resource which needs to be
handled differently than a graphical console.

I'm also big on ratelimiting messages, even critical warning
messages.  Too much redundant info doesn't help anyone.  And what a
subsystem thinks is critical, may not be critical to the system as a
whole.

In this case, if these checksum messages are telling us that there's
corruption, why isn't dm-integrity going readonly and making the block
device get the filesystem to also go readonly and to stop the damage
right away?

If it's just a warning for the niceness, then please rate limit them,
or summarize them in some more useful way.  Or even log them to
somewhere else than the console once the problem is noted.

John