[dm-devel] [PATCHES] new solution for dm_any_congested crash

Mikulas Patocka mpatocka at redhat.com
Fri Nov 14 01:55:27 UTC 2008


Hi

The Chandra's patch was correct, but the problem is more serious (the same 
crash could happen in dm_merge_bvec, dm_unplug_all or at some other dm 
places), so I had to rework reference counting.

These are three patches.
1. reverts Chadra's changes
2. just a little swap of two calls, to prepare for the third
3. the reference counting rework

Chandra, please test the patches at your system (without your original 
patch) and verify that they avoid the crashes as well as your patch does.

Mikulas
-------------- next part --------------
Revert dm-avoid-destroying-table-in-dm_any_congested.patch

The patch is correct, however the problem is more serious.

The same lazy destructor calls can happen in dm_merge_bvec,
dm_unplug_all and in many other device mapper places. So, a more
generalized solution needs to be developed.

Signed-off-by: Mikulas Patocka <mpatocka at redhat.com>

---
 drivers/md/dm.c |   24 ++++++++----------------
 1 file changed, 8 insertions(+), 16 deletions(-)

Index: linux-2.6.28-rc4-devel/drivers/md/dm.c
===================================================================
--- linux-2.6.28-rc4-devel.orig/drivers/md/dm.c	2008-11-14 02:39:36.000000000 +0100
+++ linux-2.6.28-rc4-devel/drivers/md/dm.c	2008-11-14 02:39:09.000000000 +0100
@@ -941,24 +941,16 @@ static void dm_unplug_all(struct request
 
 static int dm_any_congested(void *congested_data, int bdi_bits)
 {
-	int r = bdi_bits;
-	struct mapped_device *md = congested_data;
-	struct dm_table *map;
-
-	atomic_inc(&md->pending);
-
-	if (!test_bit(DMF_BLOCK_IO, &md->flags)) {
-		map = dm_get_table(md);
-		if (map) {
-			r = dm_table_any_congested(map, bdi_bits);
-			dm_table_put(map);
-		}
-	}
+	int r;
+	struct mapped_device *md = (struct mapped_device *) congested_data;
+	struct dm_table *map = dm_get_table(md);
 
-	if (!atomic_dec_return(&md->pending))
-		/* nudge anyone waiting on suspend queue */
-		wake_up(&md->wait);
+	if (!map || test_bit(DMF_BLOCK_IO, &md->flags))
+		r = bdi_bits;
+	else
+		r = dm_table_any_congested(map, bdi_bits);
 
+	dm_table_put(map);
 	return r;
 }
 
-------------- next part --------------
Swap these two reference decrements ... so that the last reference to
the table is dropped in __unbind.

This is required for the following patch,
dm-table-rework-reference-counting.patch, which will change the logic in
such a way that table destructor is called only at specific points in
the code.

Signed-off-by: Mikulas Patocka <mpatocka at redhat.com>

---
 drivers/md/dm.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6.28-rc4-devel/drivers/md/dm.c
===================================================================
--- linux-2.6.28-rc4-devel.orig/drivers/md/dm.c	2008-11-14 02:39:09.000000000 +0100
+++ linux-2.6.28-rc4-devel/drivers/md/dm.c	2008-11-14 02:43:10.000000000 +0100
@@ -1322,8 +1322,8 @@ void dm_put(struct mapped_device *md)
 			dm_table_presuspend_targets(map);
 			dm_table_postsuspend_targets(map);
 		}
-		__unbind(md);
 		dm_table_put(map);
+		__unbind(md);
 		free_dev(md);
 	}
 }
-------------- next part --------------
Rework table reference counting.

The code before the patch uses a reference counters on a table. When the
last reference is dropped and the counter reaches zero, the table
destructor is called.
Table reference counters are acquired/released from upcalls from other
kernel code (dm_any_congested, dm_merge_bvec, dm_unplug_all).
If the reference counter reaches zero in one of the upcalls, the table
destructor is called from almost random kernel code.

This leads to various problems:
* dm_any_congested being called under a spinlock, which calls the
  destructor, which calls some sleeping function.
* the destructor attempting to take a lock that is already taken by the
  same process.
* stale reference from some other kernel code keeps the table
  constructed, which keeps some devices open, even after successful
  return from "dmsetup remove". This can confuse lvm and prevent closing
  of underlying devices or reusing device minor numbers.

The patch changes reference counting so that there is the table
destructor can be called only at predetermined places.

The table has always exactly one reference from either mapped_device->map
or hash_cell->new_map. This reference is not counted in table->holders.
A pair of dm_create_table/dm_destroy_table functions is used for table
creation/destruction.

Temporary references from the other code increase table->holders. A pair
of dm_table_get/dm_table_put functions is used to manipulate it.

When the table is about to be destroyed, we wait for table->holders to
reach 0. Then, we call the table destructor.

This way, the destructor is called only at specific point
(dm_table_destroy function) and the above problems associated with lazy
destructing can't happen.

Signed-off-by: Mikulas Patocka <mpatocka at redhat.com>

---
 drivers/md/dm-ioctl.c |   10 ++++------
 drivers/md/dm-table.c |   27 +++++++++++++++++++++++----
 drivers/md/dm.c       |    7 ++++---
 drivers/md/dm.h       |    1 +
 4 files changed, 32 insertions(+), 13 deletions(-)

Index: linux-2.6.28-rc4-devel/drivers/md/dm-ioctl.c
===================================================================
--- linux-2.6.28-rc4-devel.orig/drivers/md/dm-ioctl.c	2008-11-14 02:36:45.000000000 +0100
+++ linux-2.6.28-rc4-devel/drivers/md/dm-ioctl.c	2008-11-14 02:43:14.000000000 +0100
@@ -233,7 +233,7 @@ static void __hash_remove(struct hash_ce
 	}
 
 	if (hc->new_map)
-		dm_table_put(hc->new_map);
+		dm_table_destroy(hc->new_map);
 	dm_put(hc->md);
 	free_cell(hc);
 }
@@ -827,8 +827,8 @@ static int do_resume(struct dm_ioctl *pa
 
 		r = dm_swap_table(md, new_map);
 		if (r) {
+			dm_table_destroy(new_map);
 			dm_put(md);
-			dm_table_put(new_map);
 			return r;
 		}
 
@@ -836,8 +836,6 @@ static int do_resume(struct dm_ioctl *pa
 			set_disk_ro(dm_disk(md), 0);
 		else
 			set_disk_ro(dm_disk(md), 1);
-
-		dm_table_put(new_map);
 	}
 
 	if (dm_suspended(md))
@@ -1080,7 +1078,7 @@ static int table_load(struct dm_ioctl *p
 	}
 
 	if (hc->new_map)
-		dm_table_put(hc->new_map);
+		dm_table_destroy(hc->new_map);
 	hc->new_map = t;
 	up_write(&_hash_lock);
 
@@ -1109,7 +1107,7 @@ static int table_clear(struct dm_ioctl *
 	}
 
 	if (hc->new_map) {
-		dm_table_put(hc->new_map);
+		dm_table_destroy(hc->new_map);
 		hc->new_map = NULL;
 	}
 
Index: linux-2.6.28-rc4-devel/drivers/md/dm-table.c
===================================================================
--- linux-2.6.28-rc4-devel.orig/drivers/md/dm-table.c	2008-11-14 02:36:45.000000000 +0100
+++ linux-2.6.28-rc4-devel/drivers/md/dm-table.c	2008-11-14 02:43:14.000000000 +0100
@@ -60,6 +60,21 @@ struct dm_table {
 };
 
 /*
+ * New table reference rules:
+ *
+ * The table has always exactly one reference from either mapped_device->map
+ * or hash_cell->new_map. This reference is not counted in table->holders.
+ * A pair of dm_create_table/dm_destroy_table functions is used for table
+ * creation/destruction.
+ *
+ * Temporary references from the other code increase table->holders. A pair
+ * of dm_table_get/dm_table_put functions is used to manipulate it.
+ *
+ * When the table is about to be destroyed, we wait for table->holders to
+
+ */
+
+/*
  * Similar to ceiling(log_size(n))
  */
 static unsigned int int_log(unsigned int n, unsigned int base)
@@ -223,7 +238,7 @@ int dm_table_create(struct dm_table **re
 		return -ENOMEM;
 
 	INIT_LIST_HEAD(&t->devices);
-	atomic_set(&t->holders, 1);
+	atomic_set(&t->holders, 0);
 
 	if (!num_targets)
 		num_targets = KEYS_PER_NODE;
@@ -253,10 +268,14 @@ static void free_devices(struct list_hea
 	}
 }
 
-static void table_destroy(struct dm_table *t)
+void dm_table_destroy(struct dm_table *t)
 {
 	unsigned int i;
 
+	while (atomic_read(&t->holders))
+		yield();
+	smp_mb();
+
 	/* free the indexes (see dm_table_complete) */
 	if (t->depth >= 2)
 		vfree(t->index[t->depth - 2]);
@@ -294,8 +313,8 @@ void dm_table_put(struct dm_table *t)
 	if (!t)
 		return;
 
-	if (atomic_dec_and_test(&t->holders))
-		table_destroy(t);
+	smp_mb__before_atomic_dec();
+	atomic_dec(&t->holders);
 }
 
 /*
Index: linux-2.6.28-rc4-devel/drivers/md/dm.c
===================================================================
--- linux-2.6.28-rc4-devel.orig/drivers/md/dm.c	2008-11-14 02:43:10.000000000 +0100
+++ linux-2.6.28-rc4-devel/drivers/md/dm.c	2008-11-14 02:43:14.000000000 +0100
@@ -1208,10 +1208,11 @@ static int __bind(struct mapped_device *
 
 	__set_size(md, size);
 
-	if (size == 0)
+	if (size == 0) {
+		dm_table_destroy(t);
 		return 0;
+	}
 
-	dm_table_get(t);
 	dm_table_event_callback(t, event_callback, md);
 
 	write_lock(&md->map_lock);
@@ -1233,7 +1234,7 @@ static void __unbind(struct mapped_devic
 	write_lock(&md->map_lock);
 	md->map = NULL;
 	write_unlock(&md->map_lock);
-	dm_table_put(map);
+	dm_table_destroy(map);
 }
 
 /*
Index: linux-2.6.28-rc4-devel/drivers/md/dm.h
===================================================================
--- linux-2.6.28-rc4-devel.orig/drivers/md/dm.h	2008-11-14 02:36:45.000000000 +0100
+++ linux-2.6.28-rc4-devel/drivers/md/dm.h	2008-11-14 02:43:14.000000000 +0100
@@ -36,6 +36,7 @@ struct dm_table;
 /*-----------------------------------------------------------------
  * Internal table functions.
  *---------------------------------------------------------------*/
+void dm_table_destroy(struct dm_table *t);
 void dm_table_event_callback(struct dm_table *t,
 			     void (*fn)(void *), void *context);
 struct dm_target *dm_table_get_target(struct dm_table *t, unsigned int index);


More information about the dm-devel mailing list