[libvirt] RFC: Migration with NPIV

Dave Allan dallan at redhat.com
Tue Nov 20 16:26:53 UTC 2012


On Tue, Nov 20, 2012 at 10:17:11AM +0000, Daniel P. Berrange wrote:
> On Mon, Nov 19, 2012 at 05:30:11PM +0800, Osier Yang wrote:
> > Hi,
> > 
> > This proposal is trying to figure out a solution for migration
> > of domain which uses LUN behind vHBA as disk device (QEMU
> > emulated disk only at this stage). And other related NPIV
> > improvements which are not related with migration. I'm not
> > luck to get a environment to test if the thoughts are workable,
> > but I'd like see if guys have good idea/suggestions earlier.
> > 
> > 1) Persistent vHBA support
> > 
> >   This is the useful stuff missed for long time. Assuming
> > that one created a vHBA, did masking/zoning, everything works
> > as expected. However, after a system rebooting, everything is
> > just lost. If the user wants to get things back, he has to
> > find out the preivous WWNN & WWPN, and create the vHBA again.
> > 
> >   On the other hand, Persistent vHBA support is actually required
> > for domain which uses LUN behind a vHBA. Othewise the domain
> > could fail to start after a system rebooting.
> > 
> >   To support the persistent vHBA, new APIs like virNodeDeviceDefineXML,
> > virNodeDeviceUndefine is required. Also it's useful to introduce
> > "autostart" for vHBA, so that the vHBA could be started automatically
> > after system rebooting.
> > 
> >   Proposed APIs:
> > 
> >   virNodeDevicePtr
> >   virNodeDeviceDefineXML(virConnectPtr conn,
> >                          const char *xml,
> >                          unsigned int flags);
> > 
> >   int
> >   virNodeDeviceUndefine(virConnectPtr conn,
> >                         virNodeDevicePtr dev,
> >                         unsigned int flags);
> > 
> >   int
> >   virNodeDeviceSetAutostart(virNodeDevicePtr dev,
> >                             int autostart,
> >                             unsigned int flags);
> > 
> >   int
> >   virNodeDeviceGetAutostart(virNodeDevicePtr dev,
> >                             int *autostart,
> >                             unsigned int flags);
> 
> I don't really much like this approach. IMHO, this should
> all be done via the virStoragePool APIs instead. Adding
> define/undefine/autostart to virNodeDevice is really just
> duplicating the storage pool functionality.

I like the idea of making vHBAs persist as part of pools; how do you
envision it should work?  Extend the scsi pools to take a vHBA
descriptor and then instantiating the vHBA as part of starting the
pool, or something else?

> > 2) Associate vHBA with domain XML
> > 
> >   There are two ways to attach a LUN to a domain: as an QEMU emulated
> > device; or passthrough. Since passthrough a LUN is not supported in
> > libvirt yet, let's focus on the emulated LUN at this stage.
> > 
> >   New attributes "wwnn" and "wwpn" are introduced to indicate the
> > LUN behind the vHBA. E.g.
> > 
> >    <disk type='block' device='disk'>
> >      <driver name='qemu' type='raw'/>
> >      <source wwnn="2001001b32a9da4e" wwpn="2101001b32a90004"/>
> 
> If you change the schema of the <source> element, then you must
> also create a new type='XXX' attribute to identify it, not just
> re-use type='block'
> 
> >      <target dev='vda' bus='virtio'/>
> >      <address type='pci' domain='0x0000' bus='0x00' slot='0x07'
> > function='0x0'/>
> >    </disk>
> > 
> >   Before the domain starting, we have to check if there is LUN
> > assigned to the vHBA, error out if not.
> > 
> >   Using the stable path of LUN also works, e.g.
> > 
> >   <source dev="/dev/disk/by-path/pci-0000\:00\:07.0-scsi-0\:0\:0\:0"/>
> > 
> >   But the disadvantage is the user have to figure out the stable
> > path himself; And we have to do checking of every stable path to
> > see if it's behind a vHBA in migration "Begin" stage. Or an new
> > XML tag for element "source" to indicate that it's behind a vHBA?
> > such as:
> > 
> >   <source dev="disk-by-path" model="vport"/>
> 
> I don't much like the idea of mapping vHBA to <disk> elements,
> because you have a cardinality mis-match. A <disk> is equivalent
> of a single LUN, but a vHBA is something that provides multiple
> LUNs.
> 
> If you want to directly associate a vHBA with a virtual guest,
> then this is really in the realm of SCSI HBA passthrough, not
> <disk> devices.
> 
> 
> If you want something mapped to the <disk> device, then the
> approach should be to map to a storage pool volume - something
> we've long talked about as broadly useful for all storage types,
> not just NPIV.

+1, we really should take this as an opportunity to add storage
volumes as <disk> devices.

> > 3) Migration with vHBA
> > 
> >   One possible solution for migration with vHBA is to use one pair
> > of WWNN & WWPN on source host, one is using for domain, one is
> > reserved for migration purpose. It requires the storage admin maps
> > the same LUN to the two vHBAs when doing the masking and zoning.
> > 
> > One of the two vHBA is called "Primary vHBA", another is called
> > "secondary vHBA". To maitain the relationship between these two
> > vHBAs, we have to introduce new XMLs to vHBA. E.g.
> > 
> >    In XML of primary vHBA:
> > 
> >    <secondary wwpn="2101001b32a90004"/>
> > 
> >    In XML of secondary vHBA:
> > 
> >    <primary wwpn="2101001b32a90002"/>
> > 
> > Primary vHBA is going to be guaranteed not used by any domain which
> > is driven by libvirt (we do some checking eariler before the domain
> > starting). And it's also guaranteed that the LUN can't be used by
> > other domain with sVirt or Sanlock. So it's safe to have two vHBAs
> > on source host too.
> > 
> > To prevent one using the LUN by creating vHBA using the same WWNN &
> > WWPN on another host, we must create the secondary vHBA on source
> > host, even it's not being used.
> > 
> > Both primary and secondary vHBA must be defined and marked as
> > "autostart" so that the domain could be started after system
> > rebooting.
> > 
> > When do migration, we have to bake a bigger cookie with secondary
> > vHBA's info (basically it's WWNN and WWPN) in migration "Begin"
> > stage, and eat that in migration "Prepare" stage on target host.
> > 
> > In "Begin" stage, the XMLs represents the secondary vHBA is
> > constructed. And the secondary vHBA is destoyed on source host,
> > not undefined though.
> > 
> > In "Prepare" stage, a new vHBA is created (define and start)
> > on target host with the same WWNN & WWPN as secondary vHBA on
> > source host. The LUN then should be visible to target host
> > automatically? and thus migration can be performed. After migration
> > is finished on target host, the primary vHBA on source host is
> > destroyed, not undefined.
> > 
> > If migration fails, the new vHBA created on target host will
> > be destroyed and undefined. And both primary and secondary
> > vHBA on source host will be started, so that the domain could
> > be resumed.
> > 
> > Finally if migration succeeds, primary vHBA on source host
> > will be transtered to target host as secondary vHBA (defined).
> > And both primary and secondary vHBA on source host will be
> > undefined.
> 
> If we do the mapping of HBAs to guest domains using storage
> pools, then at a guest level, migration requires zero work.
> 
> It is simply upto the management app to create the storage
> pool on the destination host with the same Name + UUID, but
> with the secondary WWNN/WWPN. The nice thing about this, is
> that you don't need to hardcode details of a secondary
> WWNN/WWPN up-front. The management app can just decide on
> those at the time it performs the migration, so 99% of the
> time there will only need to be a single vHBA setup on the
> SAN. During migration the mgmt app can setup a second
> vHBA for the target host, and once complete, delete the
> original vHBA entirely. 

Agreed, although there will of course need to be some degree of
up-front coordination between the management app and the SAN
administrators to avoid having to involve them to migrate a VM.

> > 4) Enrich HBA's XML
> > 
> >   It's hard to known the vHBAs created from a HBA with current
> > implementation. One have to dump XML of each (v)HBAs and find
> > out the clue with element "parent" of vHBAs. It's good to introduce
> > new element for HBA like "vports", so that one can easily known
> > what (how many) vHBAs are created from the HBA?
> > 
> >   And also it's good to have the maximum vports the HBA supports.
> > 
> >   Except these, other useful information should be exposed too,
> > such as the vendor name, the HBA state, PCI address, etc.
> > 
> >   The new XMLs should be like:
> > 
> >   <vports num='2' max='64'>
> >     <vport name="scsi_host40" wwpn="2101001b32a90004"/>
> >     <vport name="scsi_host40" wwpn="2101001b32a90005"/>
> >   </vports>
> >   <online/>
> >   <vendor>QLogic</vendor>
> >   <address type="pci" domain="0" bus="0" slot="5" function="0"/>
> > 
> >   "online", "vendor", "address" make sense to vHBA too.
> 
> I'm trying to remember how we modelled the parent/child relationship
> for SR-IOV PCI cards. NPIV is a very similar concept, so we should
> ideally seek to model the parent/child relationship in the same
> manner.

Physical function:

<device>
  <name>pci_0000_01_00_0</name>
  <parent>pci_0000_00_01_0</parent>
  <driver>
    <name>igb</name>
  </driver>
  <capability type='pci'>
    <domain>0</domain>
    <bus>1</bus>
    <slot>0</slot>
    <function>0</function>
    <product id='0x10c9'>82576 Gigabit Network Connection</product>
    <vendor id='0x8086'>Intel Corporation</vendor>
    <capability type='virt_functions'>
      <address domain='0x0000' bus='0x01' slot='0x10' function='0x0'/>
      <address domain='0x0000' bus='0x01' slot='0x10' function='0x2'/>
      <address domain='0x0000' bus='0x01' slot='0x10' function='0x4'/>
      <address domain='0x0000' bus='0x01' slot='0x10' function='0x6'/>
      <address domain='0x0000' bus='0x01' slot='0x11' function='0x0'/>
      <address domain='0x0000' bus='0x01' slot='0x11' function='0x2'/>
      <address domain='0x0000' bus='0x01' slot='0x11' function='0x4'/>
    </capability>
  </capability>
</device> 

Virtual function:

<device>
  <name>pci_0000_01_10_0</name>
  <parent>pci_0000_00_01_0</parent>
  <driver>
    <name>igbvf</name>
  </driver>
  <capability type='pci'>
    <domain>0</domain>
    <bus>1</bus>
    <slot>16</slot>
    <function>0</function>
    <product id='0x10ca'>82576 Virtual Function</product>
    <vendor id='0x8086'>Intel Corporation</vendor>
    <capability type='phys_function'>
      <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
    </capability>
    <capability type='virt_functions'>
    </capability>
  </capability>
</device>

Interesingly, I think there's a bug there; the VF should not be
showing <capability type='virt_functions'> but that's unrelated to the
present discussion.

Dave



> Daniel
> -- 
> |: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
> |: http://libvirt.org              -o-             http://virt-manager.org :|
> |: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
> |: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|
> 
> --
> libvir-list mailing list
> libvir-list at redhat.com
> https://www.redhat.com/mailman/listinfo/libvir-list




More information about the libvir-list mailing list