[libvirt] RFC: Migration with NPIV
jyang at redhat.com
Wed Nov 21 04:10:25 UTC 2012
On 2012年11月20日 18:17, Daniel P. Berrange wrote:
> On Mon, Nov 19, 2012 at 05:30:11PM +0800, Osier Yang wrote:
>> This proposal is trying to figure out a solution for migration
>> of domain which uses LUN behind vHBA as disk device (QEMU
>> emulated disk only at this stage). And other related NPIV
>> improvements which are not related with migration. I'm not
>> luck to get a environment to test if the thoughts are workable,
>> but I'd like see if guys have good idea/suggestions earlier.
>> 1) Persistent vHBA support
>> This is the useful stuff missed for long time. Assuming
>> that one created a vHBA, did masking/zoning, everything works
>> as expected. However, after a system rebooting, everything is
>> just lost. If the user wants to get things back, he has to
>> find out the preivous WWNN& WWPN, and create the vHBA again.
>> On the other hand, Persistent vHBA support is actually required
>> for domain which uses LUN behind a vHBA. Othewise the domain
>> could fail to start after a system rebooting.
>> To support the persistent vHBA, new APIs like virNodeDeviceDefineXML,
>> virNodeDeviceUndefine is required. Also it's useful to introduce
>> "autostart" for vHBA, so that the vHBA could be started automatically
>> after system rebooting.
>> Proposed APIs:
>> virNodeDeviceDefineXML(virConnectPtr conn,
>> const char *xml,
>> unsigned int flags);
>> virNodeDeviceUndefine(virConnectPtr conn,
>> virNodeDevicePtr dev,
>> unsigned int flags);
>> virNodeDeviceSetAutostart(virNodeDevicePtr dev,
>> int autostart,
>> unsigned int flags);
>> virNodeDeviceGetAutostart(virNodeDevicePtr dev,
>> int *autostart,
>> unsigned int flags);
> I don't really much like this approach. IMHO, this should
> all be done via the virStoragePool APIs instead. Adding
> define/undefine/autostart to virNodeDevice is really just
> duplicating the storage pool functionality.
Agreed, though it means I have to quit the nearly finished
patches. Actually I'm not comfortable with the way either
while facing the conflicts between the device configurations
probed by udev or HAL and the persistent configuration
trying to support.
So the left work is to improve storage pool's XML so that
the vHBA it refers to could be stable. And also manage the
lifecyle of vHBA with pool's lifecyle.
For how to make sure the pool should not be destroyed if
there is volume of the pool is being used by domain, IMO
it's time to integrate the storage pool with domain? I.e
mapping storage volume to domain disk, and ref/unref the
storage volume with domain's lifecyle.
>> 2) Associate vHBA with domain XML
>> There are two ways to attach a LUN to a domain: as an QEMU emulated
>> device; or passthrough. Since passthrough a LUN is not supported in
>> libvirt yet, let's focus on the emulated LUN at this stage.
>> New attributes "wwnn" and "wwpn" are introduced to indicate the
>> LUN behind the vHBA. E.g.
>> <disk type='block' device='disk'>
>> <driver name='qemu' type='raw'/>
>> <source wwnn="2001001b32a9da4e" wwpn="2101001b32a90004"/>
> If you change the schema of the<source> element, then you must
> also create a new type='XXX' attribute to identify it, not just
> re-use type='block'
>> <target dev='vda' bus='virtio'/>
>> <address type='pci' domain='0x0000' bus='0x00' slot='0x07'
>> Before the domain starting, we have to check if there is LUN
>> assigned to the vHBA, error out if not.
>> Using the stable path of LUN also works, e.g.
>> <source dev="/dev/disk/by-path/pci-0000\:00\:07.0-scsi-0\:0\:0\:0"/>
>> But the disadvantage is the user have to figure out the stable
>> path himself; And we have to do checking of every stable path to
>> see if it's behind a vHBA in migration "Begin" stage. Or an new
>> XML tag for element "source" to indicate that it's behind a vHBA?
>> such as:
>> <source dev="disk-by-path" model="vport"/>
> I don't much like the idea of mapping vHBA to<disk> elements,
> because you have a cardinality mis-match. A<disk> is equivalent
> of a single LUN, but a vHBA is something that provides multiple
> If you want to directly associate a vHBA with a virtual guest,
> then this is really in the realm of SCSI HBA passthrough, not
> <disk> devices.
Agreed, I missed that multiple LUNs can be mapped to one HBA.
> If you want something mapped to the<disk> device, then the
> approach should be to map to a storage pool volume - something
> we've long talked about as broadly useful for all storage types,
> not just NPIV.
Okay, finally we are at the point to integerate storage with domain.
>> 3) Migration with vHBA
>> One possible solution for migration with vHBA is to use one pair
>> of WWNN& WWPN on source host, one is using for domain, one is
>> reserved for migration purpose. It requires the storage admin maps
>> the same LUN to the two vHBAs when doing the masking and zoning.
>> One of the two vHBA is called "Primary vHBA", another is called
>> "secondary vHBA". To maitain the relationship between these two
>> vHBAs, we have to introduce new XMLs to vHBA. E.g.
>> In XML of primary vHBA:
>> <secondary wwpn="2101001b32a90004"/>
>> In XML of secondary vHBA:
>> <primary wwpn="2101001b32a90002"/>
>> Primary vHBA is going to be guaranteed not used by any domain which
>> is driven by libvirt (we do some checking eariler before the domain
>> starting). And it's also guaranteed that the LUN can't be used by
>> other domain with sVirt or Sanlock. So it's safe to have two vHBAs
>> on source host too.
>> To prevent one using the LUN by creating vHBA using the same WWNN&
>> WWPN on another host, we must create the secondary vHBA on source
>> host, even it's not being used.
>> Both primary and secondary vHBA must be defined and marked as
>> "autostart" so that the domain could be started after system
>> When do migration, we have to bake a bigger cookie with secondary
>> vHBA's info (basically it's WWNN and WWPN) in migration "Begin"
>> stage, and eat that in migration "Prepare" stage on target host.
>> In "Begin" stage, the XMLs represents the secondary vHBA is
>> constructed. And the secondary vHBA is destoyed on source host,
>> not undefined though.
>> In "Prepare" stage, a new vHBA is created (define and start)
>> on target host with the same WWNN& WWPN as secondary vHBA on
>> source host. The LUN then should be visible to target host
>> automatically? and thus migration can be performed. After migration
>> is finished on target host, the primary vHBA on source host is
>> destroyed, not undefined.
>> If migration fails, the new vHBA created on target host will
>> be destroyed and undefined. And both primary and secondary
>> vHBA on source host will be started, so that the domain could
>> be resumed.
>> Finally if migration succeeds, primary vHBA on source host
>> will be transtered to target host as secondary vHBA (defined).
>> And both primary and secondary vHBA on source host will be
> If we do the mapping of HBAs to guest domains using storage
> pools, then at a guest level, migration requires zero work.
> It is simply upto the management app to create the storage
> pool on the destination host with the same Name + UUID, but
> with the secondary WWNN/WWPN. The nice thing about this, is
> that you don't need to hardcode details of a secondary
> WWNN/WWPN up-front. The management app can just decide on
> those at the time it performs the migration, so 99% of the
> time there will only need to be a single vHBA setup on the
> SAN. During migration the mgmt app can setup a second
> vHBA for the target host, and once complete, delete the
> original vHBA entirely.
Agreed. And it shows again that it's good to integrate storage
pool with domain? Otherwise, the management app have to iterate
over the domain XML, and look up the pool by the volume paths
which are used by domain disks, before setup the pools on
>> 4) Enrich HBA's XML
>> It's hard to known the vHBAs created from a HBA with current
>> implementation. One have to dump XML of each (v)HBAs and find
>> out the clue with element "parent" of vHBAs. It's good to introduce
>> new element for HBA like "vports", so that one can easily known
>> what (how many) vHBAs are created from the HBA?
>> And also it's good to have the maximum vports the HBA supports.
>> Except these, other useful information should be exposed too,
>> such as the vendor name, the HBA state, PCI address, etc.
>> The new XMLs should be like:
>> <vports num='2' max='64'>
>> <vport name="scsi_host40" wwpn="2101001b32a90004"/>
>> <vport name="scsi_host40" wwpn="2101001b32a90005"/>
>> <address type="pci" domain="0" bus="0" slot="5" function="0"/>
>> "online", "vendor", "address" make sense to vHBA too.
> I'm trying to remember how we modelled the parent/child relationship
> for SR-IOV PCI cards. NPIV is a very similar concept, so we should
> ideally seek to model the parent/child relationship in the same
More information about the libvir-list