On Mon, Nov 19, 2012 at 05:30:11PM +0800, Osier Yang wrote:
> This proposal is trying to figure out a solution for migration
> of domain which uses LUN behind vHBA as disk device (QEMU
> emulated disk only at this stage). And other related NPIV
> improvements which are not related with migration. I'm not
> luck to get a environment to test if the thoughts are workable,
> but I'd like see if guys have good idea/suggestions earlier.
> 1) Persistent vHBA support
>   This is the useful stuff missed for long time. Assuming
> that one created a vHBA, did masking/zoning, everything works
> as expected. However, after a system rebooting, everything is
> just lost. If the user wants to get things back, he has to
> find out the preivous WWNN & WWPN, and create the vHBA again.
>   On the other hand, Persistent vHBA support is actually required
> for domain which uses LUN behind a vHBA. Othewise the domain
> could fail to start after a system rebooting.
>   To support the persistent vHBA, new APIs like virNodeDeviceDefineXML,
> virNodeDeviceUndefine is required. Also it's useful to introduce
> "autostart" for vHBA, so that the vHBA could be started automatically
> after system rebooting.
>   Proposed APIs:
>   virNodeDevicePtr
>   virNodeDeviceDefineXML(virConnectPtr conn,
>                          const char *xml,
>                          unsigned int flags);
>   int
>   virNodeDeviceUndefine(virConnectPtr conn,
>                         virNodeDevicePtr dev,
>                         unsigned int flags);
>   int
>   virNodeDeviceSetAutostart(virNodeDevicePtr dev,
>                             int autostart,
>                             unsigned int flags);
>   int
>   virNodeDeviceGetAutostart(virNodeDevicePtr dev,
>                             int *autostart,
>                             unsigned int flags);

I don't really much like this approach. IMHO, this should
all be done via the virStoragePool APIs instead. Adding
define/undefine/autostart to virNodeDevice is really just
duplicating the storage pool functionality.

> 2) Associate vHBA with domain XML
>   There are two ways to attach a LUN to a domain: as an QEMU emulated
> device; or passthrough. Since passthrough a LUN is not supported in
> libvirt yet, let's focus on the emulated LUN at this stage.
>   New attributes "wwnn" and "wwpn" are introduced to indicate the
> LUN behind the vHBA. E.g.
>    <disk type='block' device='disk'>
>      <driver name='qemu' type='raw'/>
>      <source wwnn="2001001b32a9da4e" wwpn="2101001b32a90004"/>

If you change the schema of the <source> element, then you must
also create a new type='XXX' attribute to identify it, not just
re-use type='block'

>      <target dev='vda' bus='virtio'/>
>      <address type='pci' domain='0x0000' bus='0x00' slot='0x07'
> function='0x0'/>
>    </disk>
>   Before the domain starting, we have to check if there is LUN
> assigned to the vHBA, error out if not.
>   Using the stable path of LUN also works, e.g.
>   <source dev="/dev/disk/by-path/pci-0000\:00\:07.0-scsi-0\:0\:0\:0"/>
>   But the disadvantage is the user have to figure out the stable
> path himself; And we have to do checking of every stable path to
> see if it's behind a vHBA in migration "Begin" stage. Or an new
> XML tag for element "source" to indicate that it's behind a vHBA?
> such as:
>   <source dev="disk-by-path" model="vport"/>

I don't much like the idea of mapping vHBA to <disk> elements,
because you have a cardinality mis-match. A <disk> is equivalent
of a single LUN, but a vHBA is something that provides multiple

If you want to directly associate a vHBA with a virtual guest,
then this is really in the realm of SCSI HBA passthrough, not
<disk> devices.

If you want something mapped to the <disk> device, then the
approach should be to map to a storage pool volume - something
we've long talked about as broadly useful for all storage types,
not just NPIV.

> 3) Migration with vHBA
>   One possible solution for migration with vHBA is to use one pair
> of WWNN & WWPN on source host, one is using for domain, one is
> reserved for migration purpose. It requires the storage admin maps
> the same LUN to the two vHBAs when doing the masking and zoning.
> One of the two vHBA is called "Primary vHBA", another is called
> "secondary vHBA". To maitain the relationship between these two
> vHBAs, we have to introduce new XMLs to vHBA. E.g.
>    In XML of primary vHBA:
>    <secondary wwpn="2101001b32a90004"/>
>    In XML of secondary vHBA:
>    <primary wwpn="2101001b32a90002"/>
> Primary vHBA is going to be guaranteed not used by any domain which
> is driven by libvirt (we do some checking eariler before the domain
> starting). And it's also guaranteed that the LUN can't be used by
> other domain with sVirt or Sanlock. So it's safe to have two vHBAs
> on source host too.
> To prevent one using the LUN by creating vHBA using the same WWNN &
> WWPN on another host, we must create the secondary vHBA on source
> host, even it's not being used.
> Both primary and secondary vHBA must be defined and marked as
> "autostart" so that the domain could be started after system
> rebooting.
> When do migration, we have to bake a bigger cookie with secondary
> vHBA's info (basically it's WWNN and WWPN) in migration "Begin"
> stage, and eat that in migration "Prepare" stage on target host.
> In "Begin" stage, the XMLs represents the secondary vHBA is
> constructed. And the secondary vHBA is destoyed on source host,
> not undefined though.
> In "Prepare" stage, a new vHBA is created (define and start)
> on target host with the same WWNN & WWPN as secondary vHBA on
> source host. The LUN then should be visible to target host
> automatically? and thus migration can be performed. After migration
> is finished on target host, the primary vHBA on source host is
> destroyed, not undefined.
> If migration fails, the new vHBA created on target host will
> be destroyed and undefined. And both primary and secondary
> vHBA on source host will be started, so that the domain could
> be resumed.
> Finally if migration succeeds, primary vHBA on source host
> will be transtered to target host as secondary vHBA (defined).
> And both primary and secondary vHBA on source host will be
> undefined.

If we do the mapping of HBAs to guest domains using storage
pools, then at a guest level, migration requires zero work.

It is simply upto the management app to create the storage
pool on the destination host with the same Name + UUID, but
with the secondary WWNN/WWPN. The nice thing about this, is
that you don't need to hardcode details of a secondary
WWNN/WWPN up-front. The management app can just decide on
those at the time it performs the migration, so 99% of the
time there will only need to be a single vHBA setup on the
SAN. During migration the mgmt app can setup a second
vHBA for the target host, and once complete, delete the
original vHBA entirely. 

> 4) Enrich HBA's XML
>   It's hard to known the vHBAs created from a HBA with current
> implementation. One have to dump XML of each (v)HBAs and find
> out the clue with element "parent" of vHBAs. It's good to introduce
> new element for HBA like "vports", so that one can easily known
> what (how many) vHBAs are created from the HBA?
>   And also it's good to have the maximum vports the HBA supports.
>   Except these, other useful information should be exposed too,
> such as the vendor name, the HBA state, PCI address, etc.
>   The new XMLs should be like:
>   <vports num='2' max='64'>
>     <vport name="scsi_host40" wwpn="2101001b32a90004"/>
>     <vport name="scsi_host40" wwpn="2101001b32a90005"/>
>   </vports>
>   <online/>
>   <vendor>QLogic</vendor>
>   <address type="pci" domain="0" bus="0" slot="5" function="0"/>
>   "online", "vendor", "address" make sense to vHBA too.

I'm trying to remember how we modelled the parent/child relationship
for SR-IOV PCI cards. NPIV is a very similar concept, so we should
ideally seek to model the parent/child relationship in the same

