<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<META NAME="Generator" CONTENT="MS Exchange Server version 6.5.7654.12">
<TITLE>RFC: dm-switch target</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/plain format -->
<P><FONT SIZE=2>We propose a new DM target, dm-switch, which can be used to efficiently<BR>
implement a mapping of IOs to underlying block devices in scenarios where there<BR>
are: (1) a large number of address regions, (2) a fixed size of these address<BR>
regions, (3) no pattern than allows for a compact description with something<BR>
like the dm-stripe target.<BR>
<BR>
Motivation:<BR>
<BR>
Dell EqualLogic and some other iSCSI storage arrays use a distributed frameless<BR>
architecture. In this architecture, the storage group consists of a number of<BR>
distinct storage arrays ("members"), each having independent controllers, disk<BR>
storage and network adapters. When a LUN is created it is spread across<BR>
multiple members. The details of the spreading are hidden from initiators<BR>
connected to this storage system. The storage group exposes a single target<BR>
discovery portal, no matter how many members are being used. When iSCSI<BR>
sessions are created, each session is connected to an eth port on a single<BR>
member. Data to a LUN can be sent on any iSCSI session, and if the blocks being<BR>
accessed are stored on another member the IO will be forwarded as required.<BR>
This forwarding is invisible to the initiator. The storage layout is also<BR>
dynamic, and the blocks stored on disk may be moved from member to member as<BR>
needed to balance the load.<BR>
<BR>
This architecture simplifies the management and configuration of both the<BR>
storage group and initiators. In a multipathing configuration, it is possible<BR>
to set up multiple iSCSI sessions to use multiple network interfaces on both the<BR>
host and target to take advantage of the increased network bandwidth. An<BR>
initiator can use a simple round robin algorithm to send IO on all paths and let<BR>
the storage array members forward it as necessary. However, there is a<BR>
performance advantage to sending data directly to the correct member. The<BR>
Device Mapper table architecture supports designating different address regions<BR>
with different targets. However, in our architecture the LUN is spread with a<BR>
chunk size on the order of 10s of MBs, which means the resulting DM table could<BR>
have more than a million entries, which consumes too much memory.<BR>
<BR>
Solution:<BR>
<BR>
Based on earlier discussion with the dm-devel contributors, we have solved this<BR>
problem by using Device Mapper to build a two-layer device hierarchy:<BR>
<BR>
Upper Tier Determine which array member the IO should be sent to.<BR>
Lower Tier Load balance amongst paths to a particular member.<BR>
<BR>
The lower tier consists of a single multipath device for each member. Each of<BR>
these multipath devices contains the set of paths directly to the array member<BR>
in one priority group, and leverages existing path selectors to load balance<BR>
amongst these paths. We also build a non-preferred priority group containing<BR>
paths to other array members for failover reasons.<BR>
<BR>
The upper tier consists of a single switch device, using the new DM target<BR>
module proposed here. This device uses a bitmap to look up the location of the<BR>
IO and choose the appropriate lower tier device to route the IO. By using a<BR>
bitmap we are able to use 4 bits for each address range in a 16 member group<BR>
(which is very large for us). This is a much denser representation than the DM<BR>
table B-tree can achieve.<BR>
<BR>
Though we have developed this target for a specific storage device, we have made<BR>
an effort to keep it a general purpose as possible in hopes that others may<BR>
benefit. We welcome any feedback on the design or implementation.<BR>
<BR>
/*<BR>
********************************************************************************<BR>
*<BR>
* Copyright (c) 2010 by Dell, Inc.<BR>
*<BR>
* All rights reserved. This software may not be copied, disclosed,<BR>
* transferred, or used except in accordance with a license granted<BR>
* by Dell, Inc. This software embodies proprietary information<BR>
* and trade secrets of Dell, Inc.<BR>
*<BR>
* Description:<BR>
*<BR>
* file: dm-switch.h<BR>
* authors: Kevin_OKelley@dell.com and Narendran_Ganapathy@dell.com<BR>
*<BR>
* This file contains the definitions for the "switch" target - particularly<BR>
* the netlink messages.<BR>
*<BR>
********************************************************************************<BR>
*/<BR>
/*<BR>
* Copyright (C) 2001-2003 Sistina Software (UK) Limited.<BR>
* Copyright (C) 2004-2008 Red Hat Inc. All rights reserved.<BR>
*<BR>
* This file is released under the GPL.<BR>
*/<BR>
<BR>
#ifndef __DM_SWITCH_H<BR>
#define __DM_SWITCH_H<BR>
<BR>
#define MAX_IPC_MSG_LEN 65480 // dictated by netlink socket<BR>
#define MAX_ERR_STR_LEN 255 // maximum length of the error string<BR>
<BR>
typedef enum Opcode_Enum<BR>
{<BR>
OPCODE_PAGE_TABLE_UPLOAD = 1,<BR>
}<BR>
Opcode_t;<BR>
<BR>
/*<BR>
* IPC Page Table message<BR>
*/<BR>
typedef struct IpcPgTable_Struct<BR>
{<BR>
uint32_t total_len; // Total length of this IPC message<BR>
Opcode_t opcode;<BR>
uint32_t userland[2]; // Userland optional data (dmsetup status)<BR>
uint32_t dev_major; // DM device major<BR>
uint32_t dev_minor; // DM device minor<BR>
uint32_t page_total; // Total pages in the volume<BR>
uint32_t page_offset; // Starting page offset for this IPC<BR>
uint32_t page_count; // Number of page table entries in this IPC<BR>
uint32_t page_size; // Page size in 512B sectors<BR>
uint16_t dev_count; // Number of devices<BR>
uint8_t pte_bits; // Page Table Entry field size in bits<BR>
uint8_t reserved; // Integer alignment<BR>
uint8_t ptbl_buff[1]; // Page table entries (variable length)<BR>
}<BR>
IpcPgTable_t;<BR>
<BR>
/*<BR>
* IPC Response message<BR>
*/<BR>
typedef struct IpcResponse_Struct<BR>
{<BR>
uint32_t total_len; // total length of the IPC<BR>
Opcode_t opcode;<BR>
uint32_t userland[2]; // Userland optional data<BR>
uint32_t dev_major; // DM device major<BR>
uint32_t dev_minor; // DM device minor<BR>
uint32_t status;<BR>
char err_str[MAX_ERR_STR_LEN+1];<BR>
}<BR>
IpcResponse_t;<BR>
<BR>
// Generic Netlink family attributes: used to define the family<BR>
enum<BR>
{<BR>
NETLINK_ATTR_UNSPEC,<BR>
NETLINK_ATTR_MSG,<BR>
NETLINK_ATTR__MAX,<BR>
};<BR>
#define NETLINK_ATTR_MAX (NETLINK_ATTR__MAX - 1)<BR>
<BR>
// Netlink commands (operations)<BR>
enum<BR>
{<BR>
NETLINK_CMD_UNSPEC,<BR>
NETLINK_CMD_GET_PAGE_TBL,<BR>
NETLINK_CMD__MAX,<BR>
};<BR>
#define NETLINK_CMD_MAX (NETLINK_CMD__MAX - 1)<BR>
<BR>
#endif /* __DM_SWITCH_H */<BR>
<BR>
/*<BR>
********************************************************************************<BR>
*<BR>
* Copyright (c) 2010-2011 by Dell, Inc.<BR>
*<BR>
* All rights reserved. This software may not be copied, disclosed,<BR>
* transferred, or used except in accordance with a license granted<BR>
* by Dell, Inc. This software embodies proprietary information<BR>
* and trade secrets of Dell, Inc.<BR>
*<BR>
* Description:<BR>
*<BR>
* file: dm-switch.c<BR>
* authors: Kevin_OKelley@dell.com and Narendran_Ganapathy@dell.com<BR>
*<BR>
* This file contains all the functions to create a "switch" target to<BR>
* separate the MPIO to the preferred block mode devices.<BR>
*<BR>
********************************************************************************<BR>
*/<BR>
/*<BR>
* Copyright (C) 2001-2003 Sistina Software (UK) Limited.<BR>
* Copyright (C) 2004-2008 Red Hat Inc. All rights reserved.<BR>
*<BR>
* This file is released under the GPL.<BR>
*/<BR>
<BR>
#include <linux/module.h><BR>
#include <linux/init.h><BR>
#include <linux/blkdev.h><BR>
#include <linux/bio.h><BR>
#include <linux/slab.h><BR>
#include <linux/device.h><BR>
#include <linux/version.h><BR>
#include <linux/dm-ioctl.h><BR>
#include <linux/device-mapper.h><BR>
#include <net/genetlink.h><BR>
#include <asm/div64.h><BR>
<BR>
#include "dm-switch.h"<BR>
<BR>
#define DM_MSG_PREFIX "switch"<BR>
MODULE_DESCRIPTION(DM_NAME " throughput-oriented path selector");<BR>
MODULE_AUTHOR("Kevin D. O'Kelley <Kevin_OKelley@dell.com>");<BR>
MODULE_LICENSE("GPL");<BR>
<BR>
/*<BR>
* Switch context block: A new one is created for each dm device. Contains an array of devices<BR>
* that we have taken references.<BR>
*/<BR>
struct switch_dev {<BR>
struct dm_dev *dmdev;<BR>
sector_t start;<BR>
atomic_t error_count;<BR>
};<BR>
<BR>
struct switch_ptbl {<BR>
uint32_t pte_bits; // Page Table Entry field size in bits<BR>
uint32_t pte_mask; // Page Table Entry field mask<BR>
uint32_t pte_fields; // Number of Page Table Entries per uint32_t<BR>
uint32_t ptbl_bytes; // Page table size in bytes<BR>
uint32_t ptbl_num; // Page table size in entries<BR>
uint32_t ptbl_max; // Page table maximum size in entries;<BR>
uint32_t ptbl_buff[0]; // Address of page table<BR>
};<BR>
<BR>
struct switch_ctx {<BR>
struct list_head list;<BR>
dev_t dev_this; // Device serviced by this target<BR>
uint32_t dev_count; // Number of devices<BR>
uint32_t page_size; // Page size in 512B sectors<BR>
uint32_t userland[2]; // Userland optional data (dmsetup status)<BR>
uint64_t ios_remapped, ios_unmapped; // I/Os remapped, I/Os not remapped<BR>
spinlock_t spinlock; // Control access to counters<BR>
<BR>
struct switch_ptbl *ptbl; // Page table (if loaded)<BR>
struct switch_dev dev_list[0]; // Array of dm devices to switch between<BR>
};<BR>
<BR>
/*<BR>
* Global variables<BR>
*/<BR>
LIST_HEAD(__g_context_list); // Linked list of context blocks<BR>
static spinlock_t __g_spinlock; // Control access to list of context blocks<BR>
<BR>
static int switch_ctr_limits(struct dm_target *ti, struct dm_dev *dm)<BR>
{<BR>
struct block_device *sd = dm->bdev;<BR>
struct hd_struct *hd = sd->bd_part;<BR>
<BR>
if (hd != NULL) {<BR>
if (ti->len <= hd->nr_sects)<BR>
return true;<BR>
ti->error = "Device too small for target";<BR>
return false;<BR>
}<BR>
<BR>
ti->error = "Missing device limits";<BR>
printk("%s %s\n", __FUNCTION__, ti->error);<BR>
return true;<BR>
}<BR>
<BR>
/*<BR>
* Constructor: Called each time a dmsetup command creates a dm device. The target parameter will already<BR>
* have the table, type, begin and len fields filled in. Arguments are in pairs: <dev_path> <offset>.<BR>
* Therefore, we get multiple constructor calls, but we will need to build a list of switch_ctx blocks so<BR>
* that the page table information gets matched to the correct device.<BR>
*/<BR>
static int switch_ctr(struct dm_target *ti, unsigned int argc, char **argv)<BR>
{<BR>
int n;<BR>
unsigned int dev_count;<BR>
unsigned long flags, major, minor;<BR>
unsigned long long start;<BR>
struct switch_ctx *pctx;<BR>
struct mapped_device *md = NULL;<BR>
struct dm_dev *dm;<BR>
char *chp;<BR>
<BR>
if (argc < 4) {<BR>
ti->error = "Insufficient arguments";<BR>
return -EINVAL;<BR>
}<BR>
dev_count = simple_strtoul(argv[0], &chp, 10);<BR>
if (*chp) {<BR>
ti->error = "Invalid device count";<BR>
return -EINVAL;<BR>
}<BR>
if (dev_count != (argc - 2) / 2) {<BR>
ti->error = "Invalid argument count";<BR>
return -EINVAL;<BR>
}<BR>
pctx = kmalloc(sizeof(*pctx) + (dev_count * sizeof(struct switch_dev)), GFP_KERNEL);<BR>
if (pctx == NULL) {<BR>
ti->error = "Cannot allocate redirect context";<BR>
return -ENOMEM;<BR>
}<BR>
pctx->dev_count = dev_count;<BR>
pctx->page_size = simple_strtoul(argv[1], &chp, 10);<BR>
if ((*chp) || (pctx->page_size == 0)) {<BR>
ti->error = "Invalid page size";<BR>
goto failed_kfree;<BR>
}<BR>
pctx->ptbl = NULL;<BR>
pctx->userland[0] = pctx->userland[1] = 0;<BR>
pctx->ios_remapped = pctx->ios_unmapped =0;<BR>
spin_lock_init(&pctx->spinlock);<BR>
<BR>
/*<BR>
* Find the device major and minor for the device that is being served by this target.<BR>
*/<BR>
md = dm_table_get_md(ti->table);<BR>
if (md == NULL) {<BR>
ti->error = "Cannot locate dm device";<BR>
goto failed_kfree;<BR>
}<BR>
chp = (char *) dm_device_name(md);<BR>
if (chp == NULL) {<BR>
ti->error = "Cannot acquire dm device name";<BR>
goto failed_kfree;<BR>
}<BR>
major = simple_strtoul(chp, &chp, 10);<BR>
if (*chp++ != ':') {<BR>
ti->error = "Invalid dm device name (major)";<BR>
goto failed_kfree;<BR>
}<BR>
minor = simple_strtoul(chp, &chp, 10);<BR>
if (*chp) {<BR>
ti->error = "Invalid dm device name (minor)";<BR>
goto failed_kfree;<BR>
}<BR>
pctx->dev_this = MKDEV(major, minor);<BR>
<BR>
/*<BR>
* Check each device beneath the target to ensure that the limits are consistent.<BR>
*/<BR>
for (n = 0, argc = 2; n < pctx->dev_count; n++, argc += 2) {<BR>
if (sscanf(argv[argc + 1], "%llu", &start) != 1) {<BR>
ti->error = "Invalid device starting offset";<BR>
goto failed_dev_list_prev;<BR>
}<BR>
if (dm_get_device(ti, argv[argc], dm_table_get_mode(ti->table), &dm)) {<BR>
ti->error = "Device lookup failed";<BR>
goto failed_dev_list_prev;<BR>
}<BR>
pctx->dev_list[n].dmdev = dm;<BR>
pctx->dev_list[n].start = start;<BR>
atomic_set(&(pctx->dev_list[n].error_count), 0);<BR>
if (!switch_ctr_limits(ti, dm))<BR>
goto failed_dev_list_all;<BR>
}<BR>
<BR>
spin_lock_irqsave(&__g_spinlock, flags);<BR>
list_add_tail(&pctx->list, &__g_context_list);<BR>
spin_unlock_irqrestore(&__g_spinlock, flags);<BR>
ti->private = pctx;<BR>
return 0;<BR>
<BR>
failed_dev_list_prev: // De-reference previous devices<BR>
n--; // (i.e. don't include this one)<BR>
failed_dev_list_all: // De-reference all devices<BR>
printk("%s device=%s, start=%s\n", __FUNCTION__, argv[argc], argv[argc + 1]);<BR>
for (; n >= 0; n--) {<BR>
dm_put_device(ti, pctx->dev_list[n].dmdev);<BR>
}<BR>
<BR>
failed_kfree:<BR>
printk(KERN_WARNING "%s %s\n", __FUNCTION__, ti->error);<BR>
kfree(pctx);<BR>
return -EINVAL;<BR>
}<BR>
<BR>
/*<BR>
* Destructor: Don't free the dm_target, just the ti->private data (if any).<BR>
*/<BR>
static void switch_dtr(struct dm_target *ti)<BR>
{<BR>
int n;<BR>
unsigned long flags;<BR>
struct switch_ctx *pctx = (struct switch_ctx *) ti->private;<BR>
void *ptbl;<BR>
<BR>
spin_lock_irqsave(&__g_spinlock, flags);<BR>
ptbl = pctx->ptbl;<BR>
rcu_assign_pointer(pctx->ptbl, NULL);<BR>
list_del(&pctx->list);<BR>
spin_unlock_irqrestore(&__g_spinlock, flags);<BR>
for (n = 0; n < pctx->dev_count; n++) {<BR>
dm_put_device(ti, pctx->dev_list[n].dmdev);<BR>
}<BR>
synchronize_rcu();<BR>
if (ptbl)<BR>
kfree(ptbl);<BR>
kfree(pctx);<BR>
}<BR>
<BR>
/*<BR>
* NOTE: If CONFIG_LBD is disabled, sector_t types are uint32_t. Therefore, in this routine, we<BR>
* convert the offset into a uint64_t instead of a sector_t so that all of the remaining arithmatic<BR>
* is correct, including the do_div() calls.<BR>
*/<BR>
static int switch_map(struct dm_target *ti, struct bio *bio,<BR>
union map_info *map_context)<BR>
{<BR>
struct switch_ctx *pctx = (struct switch_ctx *) ti->private;<BR>
struct switch_ptbl *ptbl;<BR>
unsigned long flags;<BR>
uint64_t itbl, offset = bio->bi_sector - ti->begin;<BR>
uint32_t idev = 0, irem;<BR>
uint64_t *pinc = &pctx->ios_unmapped;<BR>
<BR>
rcu_read_lock();<BR>
ptbl = rcu_dereference(pctx->ptbl);<BR>
if (ptbl != NULL)<BR>
{<BR>
itbl = offset;<BR>
do_div(itbl, pctx->page_size);<BR>
if (itbl < ptbl->ptbl_num) {<BR>
irem = do_div(itbl, ptbl->pte_fields);<BR>
idev = (ptbl->ptbl_buff[itbl] >> (irem * ptbl->pte_bits))<BR>
& ptbl->pte_mask;<BR>
if (idev <= pctx->dev_count) {<BR>
pinc = &pctx->ios_remapped;<BR>
}<BR>
else {<BR>
printk(KERN_WARNING "%s dev=%d, offset=%lld\n", __FUNCTION__, idev, offset);<BR>
idev = 0;<BR>
}<BR>
}<BR>
else {<BR>
printk(KERN_WARNING "%s Page Table Entry %lld >= %d\n", __FUNCTION__,<BR>
itbl, ptbl->ptbl_num);<BR>
}<BR>
}<BR>
rcu_read_unlock();<BR>
spin_lock_irqsave(&pctx->spinlock, flags);<BR>
(*pinc)++;<BR>
spin_unlock_irqrestore(&pctx->spinlock, flags);<BR>
bio->bi_bdev = pctx->dev_list[idev].dmdev->bdev;<BR>
bio->bi_sector = pctx->dev_list[idev].start + offset;<BR>
return DM_MAPIO_REMAPPED;<BR>
}<BR>
<BR>
/*<BR>
* Switch status:<BR>
*<BR>
* INFO: #dev_count device [device] 5 'A'['A' ...] userland[0] userland[1] #remapped #unmapped<BR>
* where:<BR>
* "'A'['A']" is a single word with an 'A' (active) or 'D' for each device<BR>
* The userland values are set by the last userland message to load the page table<BR>
* "#remapped" is the number of remapped I/Os<BR>
* "#unmapped" is the number of I/Os that could not be remapped<BR>
*<BR>
* TABLE: #page_size #dev_count device start [device start ...]<BR>
*/<BR>
static int switch_status(struct dm_target *ti, status_type_t type, char *result,<BR>
unsigned int maxlen)<BR>
{<BR>
struct switch_ctx *pctx = (struct switch_ctx *) ti->private;<BR>
char buffer[pctx->dev_count + 1];<BR>
unsigned int sz = 0;<BR>
int n;<BR>
uint64_t remapped, unmapped;<BR>
unsigned long flags;<BR>
<BR>
result[0] = '\0';<BR>
switch (type) {<BR>
case STATUSTYPE_INFO:<BR>
DMEMIT("%d", pctx->dev_count);<BR>
for (n = 0; n < pctx->dev_count; n++) {<BR>
DMEMIT(" %s", pctx->dev_list[n].dmdev->name);<BR>
buffer[n] = 'A';<BR>
}<BR>
buffer[n] = '\0';<BR>
spin_lock_irqsave(&pctx->spinlock, flags);<BR>
remapped = pctx->ios_remapped;<BR>
unmapped = pctx->ios_unmapped;<BR>
spin_unlock_irqrestore(&pctx->spinlock, flags);<BR>
DMEMIT(" 5 %s %08x %08x %lld %lld", buffer, pctx->userland[0], pctx->userland[1],<BR>
remapped, unmapped);<BR>
break;<BR>
<BR>
case STATUSTYPE_TABLE:<BR>
DMEMIT("%d %d", pctx->dev_count, pctx->page_size);<BR>
for (n = 0; n < pctx->dev_count; n++) {<BR>
DMEMIT(" %s %llu", pctx->dev_list[n].dmdev->name,<BR>
(unsigned long long) pctx->dev_list[n].start);<BR>
}<BR>
break;<BR>
<BR>
default:<BR>
return 0;<BR>
}<BR>
return 0;<BR>
}<BR>
<BR>
/*<BR>
* Switch ioctl:<BR>
*<BR>
* Passthrough all ioctls to the first path.<BR>
*/<BR>
static int switch_ioctl(struct dm_target *ti, unsigned int cmd,<BR>
unsigned long arg)<BR>
{<BR>
struct switch_ctx *pctx = (struct switch_ctx *) ti->private;<BR>
struct block_device *bdev;<BR>
fmode_t mode = 0;<BR>
<BR>
/* Sanity check */<BR>
if (unlikely(!pctx || !pctx->dev_list[0].dmdev ||<BR>
!pctx->dev_list[0].dmdev->bdev))<BR>
return -EIO;<BR>
<BR>
bdev = pctx->dev_list[0].dmdev->bdev;<BR>
mode = pctx->dev_list[0].dmdev->mode;<BR>
return __blkdev_driver_ioctl(bdev, mode, cmd, arg);<BR>
}<BR>
<BR>
static struct target_type __g_switch_target = {<BR>
.name = "switch",<BR>
.version= {1, 0, 0},<BR>
.module = THIS_MODULE,<BR>
.ctr = switch_ctr,<BR>
.dtr = switch_dtr,<BR>
.map = switch_map,<BR>
.status = switch_status,<BR>
.ioctl = switch_ioctl,<BR>
};<BR>
<BR>
// Generic Netlink attribute policy (single attribute, NETLINK_ATTR_MSG)<BR>
static struct nla_policy __g_attr_policy[NETLINK_ATTR_MAX + 1] =<BR>
{<BR>
[NETLINK_ATTR_MSG] = { .type = NLA_BINARY, .len = MAX_IPC_MSG_LEN },<BR>
};<BR>
<BR>
// Define the Generic Netlink family<BR>
static struct genl_family __g_family =<BR>
{<BR>
.id = GENL_ID_GENERATE, // Assign channel when family is registered<BR>
.hdrsize = 0,<BR>
.name = "DM_SWITCH",<BR>
.version = 1,<BR>
.maxattr = NETLINK_ATTR_MAX,<BR>
};<BR>
<BR>
/*<BR>
* Generic Netlink socket read function that handles communication from the userland<BR>
* for downloading the page table.<BR>
*/<BR>
static int get_page_tbl(struct sk_buff *skb_2, struct genl_info *info)<BR>
{<BR>
uint32_t rc, pte_mask, pte_fields, ptbl_bytes, offset, size;<BR>
uint32_t status = 0;<BR>
unsigned long flags;<BR>
char *mydata;<BR>
void *msg_head;<BR>
struct nlattr *na;<BR>
struct sk_buff *skb;<BR>
struct switch_ctx *pctx, *next;<BR>
struct switch_ptbl *ptbl, *pnew;<BR>
IpcPgTable_t *pgp;<BR>
IpcResponse_t resp;<BR>
dev_t dev;<BR>
static const char *invmsg = "Invalid Page Table message";<BR>
<BR>
/*<BR>
* For each attribute there is an index in info->attrs which points to a nlattr structure<BR>
* in this structure the data is given<BR>
*/<BR>
if (info == NULL) {<BR>
printk(KERN_ERR "%s missing genl_info parameter\n", __FUNCTION__);<BR>
return 0;<BR>
} <BR>
na = info->attrs[NETLINK_ATTR_MSG];<BR>
if (na == NULL) {<BR>
printk(KERN_ERR "%s no info->attrs %i\n", __FUNCTION__, NETLINK_ATTR_MSG);<BR>
return 0;<BR>
}<BR>
mydata = (char *) nla_data(na);<BR>
if (mydata == NULL) {<BR>
printk(KERN_ERR "%s error while receiving data\n", __FUNCTION__);<BR>
return 0;<BR>
}<BR>
<BR>
/*<BR>
* Format the reply message. Return positve error codes to userland.<BR>
*/<BR>
skb = nlmsg_new(NLMSG_GOODSIZE, GFP_KERNEL);<BR>
if (skb == NULL) {<BR>
printk(KERN_ERR "%s cannot allocate reply message\n", __FUNCTION__);<BR>
return 0;<BR>
}<BR>
msg_head = genlmsg_put(skb, 0, info->snd_seq, &__g_family, 0, NETLINK_CMD_GET_PAGE_TBL);<BR>
if (skb == NULL) {<BR>
printk(KERN_ERR "%s cannot format reply message header\n", __FUNCTION__);<BR>
return 0;<BR>
}<BR>
pgp = (IpcPgTable_t *) mydata;<BR>
if (na->nla_len < sizeof(IpcPgTable_t)) {<BR>
snprintf(resp.err_str, sizeof(resp.err_str), "%s: too short (%d)", invmsg, na->nla_len);<BR>
status = EINVAL;<BR>
goto failed_respond;<BR>
}<BR>
if ((pgp->page_offset + pgp->page_count) > pgp->page_total) {<BR>
snprintf(resp.err_str, sizeof(resp.err_str), "%s: too many page table entries (%d > %d)",<BR>
invmsg, (pgp->page_offset + pgp->page_count), pgp->page_total);<BR>
status = EINVAL;<BR>
goto failed_respond;<BR>
}<BR>
pte_mask = (1 << pgp->pte_bits) - 1;<BR>
if (((pgp->dev_count - 1) & (~pte_mask)) != 0) {<BR>
snprintf(resp.err_str, sizeof(resp.err_str), "%s: invalid mask 0x%x for %d devices",<BR>
invmsg, pte_mask, pgp->dev_count);<BR>
status = EINVAL;<BR>
goto failed_respond;<BR>
}<BR>
pte_fields = 32 / pgp->pte_bits;<BR>
size = ((pgp->page_count + pte_fields - 1) / pte_fields) * sizeof(uint32_t);<BR>
if ((sizeof(*pgp) - 1 + size) > na->nla_len) {<BR>
snprintf(resp.err_str, sizeof(resp.err_str), "Invalid Page Table message: incomplete messsage");<BR>
status = EINVAL;<BR>
goto failed_respond;<BR>
}<BR>
<BR>
// Look for the corresponding switch context block to create or update the page table.<BR>
rc = 0;<BR>
dev = MKDEV(pgp->dev_major, pgp->dev_minor);<BR>
spin_lock_irqsave(&__g_spinlock, flags);<BR>
list_for_each_entry_safe(pctx, next, &__g_context_list, list) {<BR>
if (dev == pctx->dev_this) {<BR>
rc = 1;<BR>
break;<BR>
}<BR>
}<BR>
if (rc == 0) {<BR>
snprintf(resp.err_str, sizeof(resp.err_str), "%s: invalid target device %d:%d",<BR>
invmsg, pgp->dev_major, pgp->dev_minor);<BR>
status = EINVAL;<BR>
goto failed_unlock;<BR>
}<BR>
<BR>
ptbl = pctx->ptbl;<BR>
if ( ( (ptbl != NULL) && (pgp->page_offset > (ptbl->ptbl_num + 1)) ) ||<BR>
( (ptbl == NULL) && (pgp->page_offset != 0) ) ) {<BR>
snprintf(resp.err_str, sizeof(resp.err_str), "%s: missing entries", invmsg);<BR>
status = EINVAL;<BR>
goto failed_unlock;<BR>
}<BR>
// Don't allow userland to change context parameters unless the page table is being rebuilt.<BR>
if (pgp->page_offset != 0) {<BR>
if ((pgp->dev_count) != pctx->dev_count) {<BR>
snprintf(resp.err_str, sizeof(resp.err_str), "%s: invalid device count %d",<BR>
invmsg, pgp->dev_count);<BR>
status = EINVAL;<BR>
goto failed_respond;<BR>
}<BR>
if (ptbl != NULL) {<BR>
if (pgp->pte_bits != ptbl->pte_bits) {<BR>
snprintf(resp.err_str, sizeof(resp.err_str), "%s: number of bits changed", invmsg);<BR>
status = EINVAL;<BR>
goto failed_unlock;<BR>
}<BR>
if (pgp->page_total != ptbl->ptbl_max) {<BR>
snprintf(resp.err_str, sizeof(resp.err_str), "%s: total number of entries changed", invmsg);<BR>
status = EINVAL;<BR>
goto failed_unlock;<BR>
}<BR>
}<BR>
}<BR>
<BR>
// Create a Page Table if needed. Most of the time, the size of the table<BR>
// doesn't change. In that case, re-use the existing table.<BR>
ptbl_bytes = ((pgp->page_total + pte_fields - 1) / pte_fields) * sizeof(uint32_t);<BR>
if ((ptbl != NULL) && (ptbl_bytes == ptbl->ptbl_bytes)) {<BR>
pnew = ptbl;<BR>
}<BR>
else {<BR>
pnew = kmalloc((sizeof(*pnew) + ptbl_bytes), GFP_KERNEL);<BR>
if (pnew == NULL) {<BR>
snprintf(resp.err_str, sizeof(resp.err_str), "Cannot allocate Page Table");<BR>
status = EINVAL;<BR>
goto failed_unlock;<BR>
}<BR>
pnew->ptbl_bytes = ptbl_bytes;<BR>
}<BR>
pnew->pte_bits = pgp->pte_bits;<BR>
pnew->pte_mask = pte_mask;<BR>
pnew->pte_fields = pte_fields;<BR>
pnew->ptbl_max = pgp->page_total;<BR>
pnew->ptbl_num = pgp->page_offset + pgp->page_count;<BR>
offset = (pgp->page_offset + pte_fields - 1) / pte_fields;<BR>
memcpy(&pnew->ptbl_buff[offset], pgp->ptbl_buff, size);<BR>
pctx->userland[0] = pgp->userland[0];<BR>
pctx->userland[1] = pgp->userland[1];<BR>
<BR>
if (pnew != ptbl) {<BR>
rcu_assign_pointer(pctx->ptbl, pnew);<BR>
if (ptbl != NULL)<BR>
kfree(ptbl);<BR>
}<BR>
<BR>
failed_unlock:<BR>
spin_unlock_irqrestore(&__g_spinlock, flags);<BR>
<BR>
failed_respond:<BR>
if (status != 0)<BR>
printk("%s WARNING: %s\n", __FUNCTION__, resp.err_str);<BR>
<BR>
// Format the response message<BR>
resp.total_len = sizeof(IpcResponse_t);<BR>
resp.opcode = OPCODE_PAGE_TABLE_UPLOAD;<BR>
resp.userland[0] = pgp->userland[0];<BR>
resp.userland[1] = pgp->userland[1];<BR>
resp.dev_major = pgp->dev_major;<BR>
resp.dev_minor = pgp->dev_minor;<BR>
resp.status = status;<BR>
rc = nla_put(skb, NLA_BINARY, sizeof(IpcResponse_t), &resp);<BR>
if( rc != 0 ) {<BR>
printk("%s WARNING: Cannot format reply message\n", __FUNCTION__);<BR>
return 0;<BR>
}<BR>
genlmsg_end(skb, msg_head);<BR>
rc = genlmsg_unicast(&init_net, skb, info->snd_pid); <BR>
if( rc != 0 ) {<BR>
printk("%s WARNING: Cannot send reply message\n", __FUNCTION__);<BR>
return 0;<BR>
}<BR>
return 0;<BR>
}<BR>
<BR>
// Operation for getting the page table<BR>
static struct genl_ops __g_op_get_page_tbl =<BR>
{<BR>
.cmd = NETLINK_CMD_GET_PAGE_TBL,<BR>
.flags = 0,<BR>
.policy = __g_attr_policy,<BR>
.doit = get_page_tbl,<BR>
.dumpit = NULL,<BR>
};<BR>
<BR>
/*<BR>
* Use the sysfs interface to inform the userland process of the family id to be used<BR>
* by the Generic Netlink socket.<BR>
*/<BR>
static ssize_t sysfs_familyid_show(struct kobject *kobj, struct attribute *attr, char *buff)<BR>
{<BR>
return snprintf(buff, PAGE_SIZE, "%d", __g_family.id);<BR>
}<BR>
<BR>
static ssize_t sysfs_familyid_store(struct kobject *kobj, struct attribute *attr,<BR>
const char *buff, size_t size)<BR>
{<BR>
return size;<BR>
}<BR>
<BR>
static struct {<BR>
struct attribute attr;<BR>
struct sysfs_ops ops;<BR>
}<BR>
__g_sysfs_familyid = {<BR>
{ "familyid", 0644 },<BR>
{ &sysfs_familyid_show, &sysfs_familyid_store },<BR>
};<BR>
<BR>
int __init dm_switch_init(void)<BR>
{<BR>
int r;<BR>
<BR>
spin_lock_init(&__g_spinlock);<BR>
r = dm_register_target(&__g_switch_target);<BR>
if (r) {<BR>
DMERR("dm_register_target() failed %d", r);<BR>
return r;<BR>
}<BR>
<BR>
// Initialize Generic Netlink communications<BR>
r = genl_register_family(&__g_family);<BR>
if (r) {<BR>
DMERR("genl_register_family() failed");<BR>
goto failed;<BR>
}<BR>
r = genl_register_ops(&__g_family, &__g_op_get_page_tbl);<BR>
if (r) {<BR>
DMERR("genl_register_ops(get_page_tbl) failed %d", r);<BR>
goto failed;<BR>
}<BR>
r = sysfs_create_file(&__g_switch_target.module->mkobj.kobj, &__g_sysfs_familyid.attr);<BR>
if (r) {<BR>
DMERR("/sys/module/familyid create failed %d", r);<BR>
goto failed;<BR>
}<BR>
return 0;<BR>
<BR>
failed:<BR>
dm_unregister_target(&__g_switch_target);<BR>
return r;<BR>
}<BR>
<BR>
void dm_switch_exit(void)<BR>
{<BR>
int r;<BR>
<BR>
dm_unregister_target(&__g_switch_target);<BR>
r = genl_unregister_family(&__g_family);<BR>
if (r)<BR>
DMWARN("genl_unregister_family() failed %d", r);<BR>
return;<BR>
}<BR>
<BR>
module_init(dm_switch_init);<BR>
module_exit(dm_switch_exit);<BR>
</FONT>
</P>
</BODY>
</HTML>