[dm-devel] Kernel BUG at dm-cache-policy-mq.c

Mike Snitzer snitzer at redhat.com
Tue Mar 21 16:26:31 UTC 2017


On Tue, Mar 21 2017 at  9:02am -0400,
Stanislas Oger <stanislas.oger at gmail.com> wrote:

> Hi,
> 
> We currently encounter a critical issue on a Proxmox cluster we
> operate, which seems to be triggered by a bug in dm-cache ("kernel
> BUG at drivers/md/dm-cache-policy-mq.c:1079!", see syslog below).
> 
> 
> 1/ Context
> 
> The Proxmox cluster uses 4.4 kernel, the VM storage is a DRBD9
> cluster on top of lvm with SSD caching. The underlaying disks are on
> a MegaRAID hardware RAID.
> The problem started to occur since we installed a VM (a mail server)
> that performs many disk reads on many small files (~ 1 million),
> with read lock using flock at each read. With the VM fully running,
> the IO wait of the system is less than 1%.
> 
> 
> 2/ The problem
> 
> Randomly, without pre-fail signs, syslog reports a bug in
> dm-cache-policy-mq.c (see below). A few minutes later all write
> operations infinitely block. A few minutes after the node stopped to
> perform write operations, the other DRBD9 nodes stop writing too. At
> this point all the cluster is down. Reads can be done as usual, but
> write operations are inifitinely blocking.
> 
> The only way we figured out to overcome this situation is to perform
> a hard reboot of the failing node. As soon as the failing node is
> down, the other nodes resume to a normal activity. When the failing
> node is up again, DRBD9 performs disk resynchronization and the
> cluster resume normal activity, as if nothing happened.
> 
> The bug occurred with both 4.4.35 and 4.4.40 kernels, with a
> frequency of about once every 10 days.

How large is your cache? (size of slow and fast device?)

Have you tried the smq policy?  mq is no longer maintained (has been
removed and made an alias of smq, see commit 9ed84698fdda ("dm cache:
make the 'mq' policy an alias for 'smq'")).

It should be noted that dm-cache is changing significantly in 4.12
(already staged in linux-next), see:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.12

The new smq code doesn't have the BUG_ON() in question.




More information about the dm-devel mailing list