[dm-devel] [PATCH] updated dm-switch

Mikulas Patocka mpatocka at redhat.com
Wed May 22 21:14:18 UTC 2013


Hi

This is updated dm-switch. It has better documentation and better error 
reporting in message function.

It also fixes bugs introduced when someone reworked my original the code. 
The code placed here 
http://people.redhat.com/agk/patches/linux/editing/dm-add-switch-target.patch 
is very buggy - it has argument parsing bugs that prevent it from being 
loaded at all and uninitialized sctx->nr_paths variable that results in 
memory corruption.

Mikulas

---

From: Jim Ramsay <jim_ramsay at dell.com>

dm-switch is a new target that maps IO to underlying block devices
efficiently when there are a large number of fixed-sized address regions
but there is no simple pattern to allow for a compact mapping
representation such as dm-stripe.

Motivation
----------

Dell EqualLogic and some other iSCSI storage arrays use a distributed
frameless architecture.  In this architecture, the storage group
consists of a number of distinct storage arrays ("members"), each having
independent controllers, disk storage and network adapters.  When a LUN
is created it is spread across multiple members.  The details of the
spreading are hidden from initiators connected to this storage system.
The storage group exposes a single target discovery portal, no matter
how many members are being used.  When iSCSI sessions are created, each
session is connected to an eth port on a single member.  Data to a LUN
can be sent on any iSCSI session, and if the blocks being accessed are
stored on another member the IO will be forwarded as required.  This
forwarding is invisible to the initiator.  The storage layout is also
dynamic, and the blocks stored on disk may be moved from member to
member as needed to balance the load.

This architecture simplifies the management and configuration of both
the storage group and initiators.  In a multipathing configuration, it
is possible to set up multiple iSCSI sessions to use multiple network
interfaces on both the host and target to take advantage of the
increased network bandwidth.  An initiator could use a simple round robin
algorithm to send IO across all paths and let the storage array members
forward it as necessary, but there is a performance advantage to
sending data directly to the correct member.  

The Device Mapper table architecture already supports designating different
address regions with different targets, but in our architecture the LUN
is spread with an address region size on the order of 10s of MBs, which
means the resulting DM table could have more than a million entries
and consume far too much memory.

Solution
--------
Based on earlier discussion with the dm-devel contributors, we have
solved this problem by using Device Mapper to build a two-layer device
hierarchy:

    Upper Tier – Determine which array member the IO should be sent to.
    Lower Tier – Load balance amongst paths to a particular member.

The lower tier consists of a single dm multipath device for each member.
Each of these multipath devices contains the set of paths directly to
the array member in one priority group, and leverages existing path
selectors to load balance amongst these paths.  We also build a
non-preferred priority group containing paths to other array members for
failover reasons.

The upper tier consists of a single dm switch device, using the new DM
target module.  This device uses a bitmap to look up the location of the
IO and choose the appropriate lower tier device to route the IO.  By
using a bitmap we are able to use 4 bits for each address range in a 16
member group (which is very large for us).  This is a much denser
representation than the DM table B-tree can achieve.

Though we have developed this target for a specific storage device, we
have made an effort to keep it as general purpose as possible in the hope
that others may benefit.

Originally developed by Jim Ramsay. Simplified by Mikulas Patocka.

Signed-off-by: Jim Ramsay <jim_ramsay at dell.com>
Signed-off-by: Mikulas Patocka <mpatocka at redhat.com>

FIXME reformatted new-style documentation file, complete code review (table storage/message parsing), finish variable/fn naming review

---
 Documentation/device-mapper/switch.txt |   83 +++++
 drivers/md/Kconfig                     |   14 
 drivers/md/Makefile                    |    1 
 drivers/md/dm-switch.c                 |  539 +++++++++++++++++++++++++++++++++
 4 files changed, 637 insertions(+)

Index: linux-3.9.3-fast/Documentation/device-mapper/switch.txt
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-3.9.3-fast/Documentation/device-mapper/switch.txt	2013-05-22 22:42:57.000000000 +0200
@@ -0,0 +1,83 @@
+dm-switch
+=========
+
+dm-switch target is suitable for Dell EqualLogic storage system.
+
+The EqualLogic storage consists of several nodes. Each host is connected
+to each node. The host may send I/O requests to any node, the node that
+received the requests forwards it to the node where the data is stored.
+
+However, there is a performance advantage of sending I/O requests to the
+node where the data is stored to avoid forwarding. The dm-switch targets
+is created to use this performance advantage.
+
+The dm-switch target splits the device to fixed-size regions. It maintains
+a redirection table that maps regions to storage nodes. Every request is
+forwarded to the corresponding storage node specified in the redirection
+table. The table may be changed with messages while the dm-switch target
+is running.
+
+
+Construction Parameters
+=======================
+
+    <num_nodes> <region_size> <num_optional_arguments>
+    for each node: <device_name> <offset>
+
+<num_nodes>
+    The number of nodes in the EqualLogic storage.
+
+<region_size>
+    The number of 512-byte sectors in a page. Each page can be redirected
+    to a specified node.
+
+<num_optional_arguments>
+    The number of optional arguments. Currently, there are no optional
+    arguments possible and it must be zero.
+
+<device_name>
+    The block device that is used to access the specific node.
+
+<offset>
+    The offset of data start on the specific node (in the unit of
+    512-byte sectors). This number is added to the sector number when
+    forwarding the request to the specific node. Typically, it is zero.
+
+
+Messages
+========
+
+set_region_mappings <index>:<node> [<index>]:<node> [<index>]:<node>...
+
+Modify the region table - set which regions are redirected to which nodes.
+
+<index>
+    Hexadecimal argument.
+    The region number (region size was specified in constructor parameters).
+    If index is omitted, a next region (previous index + 1) is used
+
+<node>
+    Hexadecimal argument.
+    The number number (in the range 0 ... <num_nodes> - 1)
+
+
+Status
+======
+
+There is no status line.
+
+
+Example
+=======
+
+Assume that you have volumes vg1/switch0 vg1/switch1 vg1/switch2 with
+the same size.
+
+Create a switch device with 64kB region size:
+    dmsetup create switch --table "0 `blockdev --getsize /dev/vg1/switch0`
+	switch 3 128 0 /dev/vg1/switch0 0 /dev/vg1/switch1 0 /dev/vg1/switch2 0"
+
+Set mappings for the first 7 entries to point to devices switch0, switch1,
+switch2, switch0, switch1, switch2, switch1:
+    dmsetup message switch 0 set_region_mappings 0:1 :2 :0 :1 :2 :0 :1
+
Index: linux-3.9.3-fast/drivers/md/Kconfig
===================================================================
--- linux-3.9.3-fast.orig/drivers/md/Kconfig	2013-05-22 22:27:40.000000000 +0200
+++ linux-3.9.3-fast/drivers/md/Kconfig	2013-05-22 22:28:28.000000000 +0200
@@ -428,4 +428,18 @@ config DM_VERITY
 
 	  If unsure, say N.
 
+config DM_SWITCH
+	tristate "Switch target support (EXPERIMENTAL)"
+	depends on BLK_DEV_DM
+	---help---
+	  This device-mapper target creates a device that supports an arbitrary
+	  mapping of fixed-size regions of I/O across a fixed set of paths.
+	  The path uses for any specific region can be switched dynamically
+	  by sending the target a message.
+
+	  To compile this code as a module, choose M here: the module will
+	  be called dm-switch.
+
+	  If unsure, say N.
+
 endif # MD
Index: linux-3.9.3-fast/drivers/md/Makefile
===================================================================
--- linux-3.9.3-fast.orig/drivers/md/Makefile	2013-05-22 22:27:40.000000000 +0200
+++ linux-3.9.3-fast/drivers/md/Makefile	2013-05-22 22:28:28.000000000 +0200
@@ -53,6 +53,7 @@ obj-$(CONFIG_DM_CACHE)		+= dm-cache.o
 obj-$(CONFIG_DM_CACHE_MQ)	+= dm-cache-mq.o
 obj-$(CONFIG_DM_CACHE_CLEANER)	+= dm-cache-cleaner.o
 obj-$(CONFIG_DM_ZEROED)		+= dm-zeroed.o
+obj-$(CONFIG_DM_SWITCH)		+= dm-switch.o
 
 ifeq ($(CONFIG_DM_UEVENT),y)
 dm-mod-objs			+= dm-uevent.o
Index: linux-3.9.3-fast/drivers/md/dm-switch.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-3.9.3-fast/drivers/md/dm-switch.c	2013-05-22 23:04:00.000000000 +0200
@@ -0,0 +1,539 @@
+/*
+ * Copyright (C) 2010-2012 by Dell Inc.  All rights reserved.
+ * Copyright (C) 2011-2012 Red Hat, Inc.
+ *
+ * This file is released under the GPL.
+ *
+ * dm-switch is a device-mapper target that maps IO to underlying block devices
+ * efficiently when there are a large number of fixed-sized address regions
+ * but there is no simple pattern to allow for a compact mapping
+ * representation such as dm-stripe.
+ */
+
+#include <linux/device-mapper.h>
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/vmalloc.h>
+
+#define DM_MSG_PREFIX "switch"
+
+/*
+ * One region_table_slot holds <region_entries_per_slot> region table
+ * entries each of which is <region_table_entry_bits> in size.
+ */
+typedef unsigned long region_table_slot;
+
+/*
+ * A device with the offset to its start sector.
+ */
+struct switch_path {
+	struct dm_dev *dmdev;
+	sector_t start;
+};
+
+/*
+ * Context block for a dm switch device.
+ */
+struct switch_ctx {
+	struct dm_target *ti;
+
+	unsigned nr_paths;		/* Number of paths in path_list. */
+
+	unsigned region_size;		/* Region size in 512-byte sectors */
+	unsigned long nr_regions;	/* Number of regions making up the device */
+	signed char region_size_bits;	/* log2 of region_size or -1 */
+
+	unsigned char region_table_entry_bits;	/* Number of bits in one region table entry */
+	unsigned char region_entries_per_slot;	/* Number of entries in one region table slot */
+	signed char region_entries_per_slot_bits;	/* log2 of region_entries_per_slot or -1 */
+
+	region_table_slot *region_table;	/* Region table */
+
+	/*
+	 * Array of dm devices to switch between.
+	 */
+	struct switch_path path_list[0];
+};
+
+static struct switch_ctx *alloc_switch_ctx(struct dm_target *ti, unsigned nr_paths,
+					   unsigned region_size)
+{
+	struct switch_ctx *sctx;
+
+	sctx = kzalloc(sizeof(struct switch_ctx) + nr_paths * sizeof(struct switch_path),
+		       GFP_KERNEL);
+	if (!sctx)
+		return NULL;
+
+	sctx->ti = ti;
+	sctx->region_size = region_size;
+
+	ti->private = sctx;
+
+	return sctx;
+}
+
+static int alloc_region_table(struct dm_target *ti, unsigned nr_paths)
+{
+	struct switch_ctx *sctx = ti->private;
+	sector_t nr_regions = ti->len;
+	sector_t nr_slots;
+
+// still to review table construction/usage
+	if (!(sctx->region_size & (sctx->region_size - 1)))
+		sctx->region_size_bits = __ffs(sctx->region_size);
+	else
+		sctx->region_size_bits = -1;
+
+	sctx->region_table_entry_bits = 1;
+	while (sctx->region_table_entry_bits < sizeof(region_table_slot) * 8 &&
+	       (region_table_slot)1 << sctx->region_table_entry_bits < nr_paths)
+		sctx->region_table_entry_bits++;
+
+	sctx->region_entries_per_slot = (sizeof(region_table_slot) * 8) / sctx->region_table_entry_bits;
+	if (!(sctx->region_entries_per_slot & (sctx->region_entries_per_slot - 1)))
+		sctx->region_entries_per_slot_bits = __ffs(sctx->region_entries_per_slot);
+	else
+		sctx->region_entries_per_slot_bits = -1;
+
+	if (sector_div(nr_regions, sctx->region_size))
+		nr_regions++;
+
+	sctx->nr_regions = nr_regions;
+	if (sctx->nr_regions != nr_regions || sctx->nr_regions >= ULONG_MAX) {
+		ti->error = "Region table too large";
+		return -EINVAL;
+	}
+
+	nr_slots = nr_regions;
+	if (sector_div(nr_slots, sctx->region_entries_per_slot))
+		nr_slots++;
+
+	if (nr_slots > ULONG_MAX / sizeof(region_table_slot)) {
+		ti->error = "Region table too large";
+		return -EINVAL;
+	}
+
+	sctx->region_table = vmalloc(nr_slots * sizeof(region_table_slot));
+	if (!sctx->region_table) {
+		ti->error = "Cannot allocate region table";
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static void switch_get_position(struct switch_ctx *sctx, unsigned long region_nr,
+				unsigned long *region_index, unsigned *bit)
+{
+	if (sctx->region_entries_per_slot_bits >= 0) {
+		*region_index = region_nr >> sctx->region_entries_per_slot_bits;
+		*bit = region_nr & (sctx->region_entries_per_slot - 1);
+	} else {
+		*region_index = region_nr / sctx->region_entries_per_slot;
+		*bit = region_nr % sctx->region_entries_per_slot;
+	}
+
+	*bit *= sctx->region_table_entry_bits;
+}
+
+/*
+ * Find which path to use at given offset.
+ */
+static unsigned switch_get_path_nr(struct switch_ctx *sctx, sector_t offset)
+{
+	unsigned long region_index;
+	unsigned bit, path_nr;
+	sector_t p;
+
+	p = offset;
+	if (sctx->region_size_bits >= 0)
+		p >>= sctx->region_size_bits;
+	else
+		sector_div(p, sctx->region_size);
+
+	switch_get_position(sctx, p, &region_index, &bit);
+	path_nr = (ACCESS_ONCE(sctx->region_table[region_index]) >> bit) &
+	       ((1 << sctx->region_table_entry_bits) - 1);
+
+	/* This can only happen if the processor uses non-atomic stores. */
+	if (unlikely(path_nr >= sctx->nr_paths))
+		path_nr = 0;
+
+	return path_nr;
+}
+
+static void switch_region_table_write(struct switch_ctx *sctx, unsigned long region_nr,
+				      unsigned value)
+{
+	unsigned long region_index;
+	unsigned bit;
+	region_table_slot pte;
+
+	switch_get_position(sctx, region_nr, &region_index, &bit);
+
+	pte = sctx->region_table[region_index];
+	pte &= ~((((region_table_slot)1 << sctx->region_table_entry_bits) - 1) << bit);
+	pte |= (region_table_slot)value << bit;
+	sctx->region_table[region_index] = pte;
+}
+
+/*
+ * Fill the region table with an initial round robin pattern.
+ */
+static void initialise_region_table(struct switch_ctx *sctx)
+{
+	unsigned path_nr = 0;
+	unsigned long region_nr;
+
+	for (region_nr = 0; region_nr < sctx->nr_regions; region_nr++) {
+		switch_region_table_write(sctx, region_nr, path_nr);
+		if (++path_nr >= sctx->nr_paths)
+			path_nr = 0;
+	}
+}
+
+static int parse_path(struct dm_arg_set *as, struct dm_target *ti)
+{
+	struct switch_ctx *sctx = ti->private;
+	unsigned long long start;
+	int r;
+
+	r = dm_get_device(ti, dm_shift_arg(as), dm_table_get_mode(ti->table),
+			  &sctx->path_list[sctx->nr_paths].dmdev);
+	if (r) {
+		ti->error = "Device lookup failed";
+		return r;
+	}
+
+	if (kstrtoull(dm_shift_arg(as), 10, &start) || start != (sector_t)start) {
+		ti->error = "Invalid device starting offset";
+		dm_put_device(ti, sctx->path_list[sctx->nr_paths].dmdev);
+		return -EINVAL;
+	}
+
+	sctx->path_list[sctx->nr_paths].start = start;
+
+	sctx->nr_paths++;
+
+	return 0;
+}
+
+/*
+ * Destructor: Don't free the dm_target, just the ti->private data (if any).
+ */
+static void switch_dtr(struct dm_target *ti)
+{
+	struct switch_ctx *sctx = ti->private;
+
+	while (sctx->nr_paths--)
+		dm_put_device(ti, sctx->path_list[sctx->nr_paths].dmdev);
+
+	vfree(sctx->region_table);
+	kfree(sctx);
+}
+
+/*
+ * Constructor arguments:
+ *   <num paths> <region size> <num optional args> [<optional args>...]
+ *   [<dev_path> <offset>]+
+ *
+ * Optional args are to allow for future extension: currently this
+ * parameter must be 0.
+ */
+static int switch_ctr(struct dm_target *ti, unsigned argc, char **argv)
+{
+	static struct dm_arg _args[] = {
+		{1, (KMALLOC_MAX_SIZE - sizeof(struct switch_ctx)) / sizeof(struct switch_path), "Invalid number of paths"},
+		{1, UINT_MAX, "Invalid region size"},
+		{0, 0, "Invalid number of optional args"},
+	};
+
+	struct switch_ctx *sctx;
+	struct dm_arg_set as;
+	unsigned nr_paths, region_size, nr_optional_args;
+	int r;
+
+	as.argc = argc;
+	as.argv = argv;
+
+	r = dm_read_arg(_args, &as, &nr_paths, &ti->error);
+	if (r)
+		return -EINVAL;
+
+	r = dm_read_arg(_args + 1, &as, &region_size, &ti->error);
+	if (r)
+		return r;
+
+	r = dm_read_arg_group(_args + 2, &as, &nr_optional_args, &ti->error);
+	if (r)
+		return r;
+	/* parse optional arguments here, if we add any */
+
+	if (as.argc != nr_paths * 2) {
+		ti->error = "Incorrect number of path arguments";
+		return -EINVAL;
+	}
+
+	sctx = alloc_switch_ctx(ti, nr_paths, region_size);
+	if (!sctx) {
+		ti->error = "Cannot allocate redirection context";
+		return -ENOMEM;
+	}
+
+	r = dm_set_target_max_io_len(ti, region_size);
+	if (r)
+		goto error;
+
+	while (as.argc) {
+		r = parse_path(&as, ti);
+		if (r)
+			goto error;
+	}
+
+	r = alloc_region_table(ti, nr_paths);
+	if (r)
+		goto error;
+
+	initialise_region_table(sctx);
+
+	/* For UNMAP, sending the request down any path is sufficient */
+	ti->num_discard_bios = 1;
+
+	return 0;
+
+error:
+	switch_dtr(ti);
+
+	return r;
+}
+
+static int switch_map(struct dm_target *ti, struct bio *bio)
+{
+	struct switch_ctx *sctx = ti->private;
+	sector_t offset = dm_target_offset(ti, bio->bi_sector);
+	unsigned path_nr = switch_get_path_nr(sctx, offset);
+
+	bio->bi_bdev = sctx->path_list[path_nr].dmdev->bdev;
+	bio->bi_sector = sctx->path_list[path_nr].start + offset;
+
+	return DM_MAPIO_REMAPPED;
+}
+
+/*
+ * We need to parse hex numbers in the message as quickly as possible.
+ *
+ * This table-based hex parser improves performance.
+ * It improves a time to load 1000000 entries compared to the condition-based
+ * parser.
+ *		table-based parser	condition-based parser
+ * PA-RISC	0.29s			0.31s
+ * Opteron	0.0495s			0.0498s
+ */
+static const unsigned char hex_table[256] = {
+255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
+255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
+255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
+0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 255, 255, 255, 255, 255, 255,
+255, 10, 11, 12, 13, 14, 15, 255, 255, 255, 255, 255, 255, 255, 255, 255,
+255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
+255, 10, 11, 12, 13, 14, 15, 255, 255, 255, 255, 255, 255, 255, 255, 255,
+255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
+255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
+255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
+255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
+255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
+255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
+255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
+255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
+255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255
+};
+
+static __always_inline unsigned long parse_hex(const char **string)
+{
+	unsigned char d;
+	unsigned long r = 0;
+
+	while ((d = hex_table[(unsigned char)**string]) < 16) {
+		r = (r << 4) | d;
+		(*string)++;
+	}
+
+	return r;
+}
+
+static int process_set_region_mappings(struct switch_ctx *sctx,
+			     unsigned argc, char **argv)
+{
+	unsigned i;
+	unsigned long region_index = 0;
+
+	for (i = 1; i < argc; i++) {
+		unsigned long path_nr;
+		const char *string = argv[i];
+
+		if (*string == ':')
+			region_index++;
+		else {
+			region_index = parse_hex(&string);
+			if (unlikely(*string != ':')) {
+				DMWARN("invalid set_region_mappings argument: '%s'", argv[i]);
+				return -EINVAL;
+			}
+		}
+
+		string++;
+		if (unlikely(!*string)) {
+			DMWARN("invalid set_region_mappings argument: '%s'", argv[i]);
+			return -EINVAL;
+		}
+
+		path_nr = parse_hex(&string);
+		if (unlikely(*string)) {
+			DMWARN("invalid set_region_mappings argument: '%s'", argv[i]);
+			return -EINVAL;
+		}
+		if (unlikely(region_index >= sctx->nr_regions)) {
+			DMWARN("invalid set_region_mappings region number: %lu >= %lu", region_index, sctx->nr_regions);
+			return -EINVAL;
+		}
+		if (unlikely(path_nr >= sctx->nr_paths)) {
+			DMWARN("invalid set_region_mappings device: %lu >= %u", path_nr, sctx->nr_paths);
+			return -EINVAL;
+		}
+
+		switch_region_table_write(sctx, region_index, path_nr);
+	}
+
+	return 0;
+}
+
+/*
+ * Messages are processed one-at-a-time.
+ *
+ * Only set_region_mappings is supported.
+ */
+static int switch_message(struct dm_target *ti, unsigned argc, char **argv)
+{
+	static DEFINE_MUTEX(message_mutex);
+
+	struct switch_ctx *sctx = ti->private;
+	int r = -EINVAL;
+
+	mutex_lock(&message_mutex);
+
+	if (!strcasecmp(argv[0], "set_region_mappings"))
+		r = process_set_region_mappings(sctx, argc, argv);
+	else
+		DMWARN("Unrecognised message received.");
+
+	mutex_unlock(&message_mutex);
+
+	return r;
+}
+
+static void switch_status(struct dm_target *ti, status_type_t type,
+			  unsigned status_flags, char *result, unsigned maxlen)
+{
+	struct switch_ctx *sctx = ti->private;
+	unsigned sz = 0;
+	int path_nr;
+
+	switch (type) {
+	case STATUSTYPE_INFO:
+		result[0] = '\0';
+		break;
+
+	case STATUSTYPE_TABLE:
+		DMEMIT("%u %u 0", sctx->nr_paths, sctx->region_size);
+		for (path_nr = 0; path_nr < sctx->nr_paths; path_nr++)
+			DMEMIT(" %s %llu", sctx->path_list[path_nr].dmdev->name,
+			       (unsigned long long)sctx->path_list[path_nr].start);
+		break;
+	}
+}
+
+/*
+ * Switch ioctl:
+ *
+ * Passthrough all ioctls to the path for sector 0
+ */
+static int switch_ioctl(struct dm_target *ti, unsigned cmd,
+			unsigned long arg)
+{
+	struct switch_ctx *sctx = ti->private;
+	struct block_device *bdev;
+	fmode_t mode;
+	unsigned path_nr;
+	int r = 0;
+
+	path_nr = switch_get_path_nr(sctx, 0);
+
+	bdev = sctx->path_list[path_nr].dmdev->bdev;
+	mode = sctx->path_list[path_nr].dmdev->mode;
+
+	/*
+	 * Only pass ioctls through if the device sizes match exactly.
+	 */
+	if (ti->len + sctx->path_list[path_nr].start != i_size_read(bdev->bd_inode) >> SECTOR_SHIFT)
+		r = scsi_verify_blk_ioctl(NULL, cmd);
+
+	return r ? : __blkdev_driver_ioctl(bdev, mode, cmd, arg);
+}
+
+static int switch_iterate_devices(struct dm_target *ti,
+				  iterate_devices_callout_fn fn, void *data)
+{
+	struct switch_ctx *sctx = ti->private;
+	int path_nr;
+	int r;
+
+	for (path_nr = 0; path_nr < sctx->nr_paths; path_nr++) {
+		r = fn(ti, sctx->path_list[path_nr].dmdev,
+			 sctx->path_list[path_nr].start, ti->len, data);
+		if (r)
+			return r;
+	}
+
+	return 0;
+}
+
+static struct target_type switch_target = {
+	.name = "switch",
+	.version = {1, 0, 0},
+	.module = THIS_MODULE,
+	.ctr = switch_ctr,
+	.dtr = switch_dtr,
+	.map = switch_map,
+	.message = switch_message,
+	.status = switch_status,
+	.ioctl = switch_ioctl,
+	.iterate_devices = switch_iterate_devices,
+};
+
+static int __init dm_switch_init(void)
+{
+	int r;
+
+	r = dm_register_target(&switch_target);
+	if (r < 0)
+		DMERR("dm_register_target() failed %d", r);
+
+	return r;
+}
+
+static void __exit dm_switch_exit(void)
+{
+	dm_unregister_target(&switch_target);
+}
+
+module_init(dm_switch_init);
+module_exit(dm_switch_exit);
+
+MODULE_DESCRIPTION(DM_NAME " fixed-size address-region-mapping throughput-oriented path selector");
+MODULE_AUTHOR("Kevin D. O'Kelley <Kevin_OKelley at dell.com>");
+MODULE_AUTHOR("Narendran Ganapathy <Narendran_Ganapathy at dell.com>");
+MODULE_AUTHOR("Jim Ramsay <Jim_Ramsay at dell.com>");
+MODULE_AUTHOR("Mikulas Patocka <mpatocka at redhat.com>");
+MODULE_LICENSE("GPL");


More information about the dm-devel mailing list