[libvirt] migration of vnlink VMs

Wed Jun 8 18:30:35 UTC 2011

I'll send a separate email (in a new thread, so it doesn't get lost! ;-) 
with a new draft of what the network XML should look like, but wanted to 
respond to Dan's comments inline...

On 05/24/2011 10:21 AM, Daniel P. Berrange wrote:
> On Fri, Apr 29, 2011 at 04:12:55PM -0400, Laine Stump wrote:
>> Okay, here's a brief description of what I *think* will work. I'll
>> build up the RNG based on this pseudo-xml:
>>
>>
>> For the<interface>  definition in the guest XML, the main change
>> will be that<source .. mode='something'>  will be valid (but
>> optional) when interface type='network' - in this case, it will just
>> be used to match against the source mode of the network on the host.
>> <virtualport>  will also become valid for type='network', and will
>> serve two purposes:
>>
>> 1) if there is a mismatch with the virtualport on the host network,
>> the migrate/start will fail.
>> 2) It will be ORed with<virtualport>  on the host network to arrive
>> at the virtualport settings actually used.
>>
>> For example:
>>
>> <interface type='network'>
>> <source network='red-network' mode='vepa'/>
> IMHO having a 'mode' here is throwing away the main reason for
> using type=network in the first place - namely independance
> from this host config element.

I agree, but was being accommodating :-) Since then, Dave has pointed 
out that the same functionality can be achieved by having the management 
application grab the XML for the network on the targetted host, and 
check for matches of any important parameters before deciding to migrate 
to that host. This has 2 advantages:

1) It is more flexible. The management application can check for more 
than just mode='vepa', but also any number of other attributes of the 
network on the target.

2) The result of a host's network not matching the desired mode will be 
"management app looks elsewhere", rather than "migration fails".

The management application will need to do this anyway (even if just to 
check that the given network is present at all) or, again, face the 
prospect of the migration failing.

So I'll withdraw this piece from the next draft.

>> <virtualport type='802.1Qbg'>
>> <parameters instanceid='09b11c53-8b5c-4eeb-8f00-d84eaa0aaa4f'/>
>> </virtualport>
>> <mac address='xx:xx:.....'/>
>> </interface>
>>
>> (NB: if "mode" isn't specified, and the host network is actually a
>> bridge or virtual network, the contents of virtualport will be
>> ignored.)
>>
>>
>> <network>  will be expanded by giving it an optional "type" attribute
>> (which will default to 'virtual'),<source>  subelement, and
>> <virtualport>  subelement. When type='bridge', you can specify source
>> exactly as you would in a domain<interface>  definition:
>>
>> <network type='bridge'>
>> <name>red-network</name>
>> <source bridge='br0'/>
>> </network>
>>
>> When type='direct', again you can specify source and virtualport
>> pretty much as you would in an interface definition:
>>
>> <network type='direct'>
>> <name>red-network</name>
>> <source dev='eth0' mode='vepa'/>
>> <virtualport type='802.1Qbg'>
>> <parameters managerid="11" typeid="1193047" typeidversion="2"
>>         instanceid='09b11c53-8b5c-4eeb-8f00-d84eaa0aaa4f'/>
>> </virtualport>
>> </network>
> None of this really feels right to me. With this proposed
> schema, there is basically nothing in common between the
> existing functionality for<network>  and this new functionality
> except for the<name>  and<uuid>  elements.
>
> Apps which know how to deal with existing<network>  schema
> will have no ability to interpret this new data at all.
> Quite probably they will mis-interpet it as providing an
> isolated virtual network, with no IP addr set, since this
> design isn't actually changing any attribute value that
> they currently look for.
>
> Either we need to make this align with the existing schema,
> or we need to put this under a completely separate set of
> APIs. I think we can likely do better with the schema design
> and achieve the former.

So the problem is that the new uses are so orthogonal to the current 
usage that existing management apps encountering this new XML will 
mistakenly believe that it's "old" XML with a bit of extra stuff that 
can be ignored (thus leading to mayhem).

I think the most important thing is to make sure that a config for one 
of these new types will have at least one change to an *existing* 
element/attribute (mine just added a *new* attribute specifying type) 
that causes existing apps to realize this isn't just an old school 
network definition that happens to have a few kinks on the side. Your 
suggestion of using new values for <forward mode="..."> seems like as 
good an idea as any (actually I can't think of anything else that works 
as well :-)

>> However, dev would be optional - if not specified, we would expect a
>> pool of interfaces to be defined within source, eg:
>>
>> <network type='direct'>
>> <name>red-network</name>
>> <source mode='vepa'>
>> <pool>
>> <interface name='eth10' maxConnect='1'/>
>> <interface name='eth11' maxConnect='1'/>
>> <interface name='eth12' maxConnect='1'/>
>> <interface name='eth13' maxConnect='1'/>
>> <interface name='eth14' maxConnect='1'/>
>> <interface name='eth25' maxConnect='5'/>
>> </pool>
>> </source>
>> <virtualport ...... />
>> </network>
> I don't really like the fact that this design has special
> cased the num(intefaces) == 1 case to have a completely
> different XML schema. eg we have this:
>
>    <source dev='eth0' mode='vepa'/>
>
> And this
>
>    <source mode='vepa'>
>    <pool>
>    <interface name='eth10' maxConnect='1'/>
>    </pool>
>
> both meaning the same thing. There should only be one
> representation in the schema for this kind of thing.
>> BTW, for all the people asking about sectunnel, openvswitch, and vde
>> - can you see how those would fit in with this? In particular, do
>> you see any conflicts? (It's easy to add more stuff on later if
>> something is just missing, but much more problematic if I put
>> something in that is just plain wrong).
> As mentioned above, I think this design is wrong, because it is not
> taking any account of the current schema for<network>  which defines
> the various routed modes.
>
> Currently<network>  supports 3 connectivity modes
>
>   - Non-routed network, separate subnet        (no<forward>  element present)
>   - Routed network, separate subnet with NAT   (<forward mode='nat'/>)
>   - Routed network, separate subnet            (<forward mode='route'/>)
>
> Following on from this, I can see another couple of routed modes
>
>   - Routed network, IP subnetting
>   - Routed network, separate subnet with VPN
>
> And the core goal here is to replae type=bridge and type=direct from the
> domain XML, which means we're adding several bridging modes
>
>   - Bridged network, eth + bridge + tap        (akin to type=bridge)
>   - Bridged network, eth + macvtap             (akin to type=direct)
>   - Bridged network, sriov eth + bridge + tap  (akin to type=bridge)
>   - Bridged network, sriov eth + macvtap       (akin to type=direct)
>
> The macvtap can be in 4 modes, so perhaps it is probably better to
> consider them separately
>
>   - Bridged network, eth + bridge + tap
>   - Bridged network, eth + macvtap + vepa
>   - Bridged network, eth + macvtap + private
>   - Bridged network, eth + macvtap + passthrough
>   - Bridged network, eth + macvtap + bridge
>   - Bridged network, sriov eth + bridge + tap
>   - Bridged network, sriov eth + macvtap + vepa
>   - Bridged network, sriov eth + macvtap + private
>   - Bridged network, sriov eth + macvtap + passthrough
>   - Bridged network, sriov eth + macvtap + bridge
>
> I can also perhaps imagine another VPN mode:
>
>   - Bridged network, with VPN
>
> The current routed modes can route to anywhere, or be restricted to
> a particular network interface eg with<forward dev='eth0'/>. It
> only allows for a single interface, though even for routed modes it
> could be desirable to list multiple devs.
>
> The other big distinction is that the<network>  modes which do routing,
> include interface configuration data (ie the IP addrs&  bridge name)
> which is configured on the fly. It looks like with the bridged modes,
> you're assuming the app has statically configured the interfaces via
> the virInterface APIs already, and this just points to an existing
> configured interface. This isn't neccessarily a bad thing, just an
> observation of a significant difference.
Right. Perhaps later it can be expanded (at least in some of the modes) 
to setup these devices when the network is started, but right now the 
network definition is just used to point to something that already 
exists and is functioning.

> So if we ignore the<ip>  and<domain>  elements from the current<network>
> schema, then there are a handful of others which we need to have a plan
> for
>
>    <forward mode='nat|route'/>    (omitted completely for isolated networks)
>    <bridge name="virbr0" />       (auto-generated/filled if omitted)
>    <mac address='....'/>          (auto-generated/filled if omitted)
>
> The<forward>  element can have an optional dev= attribute.
>
> I think the key attribute is the<forward>  mode= attribute. I think we
> should be adding further values to that attribute for the new network
> modes we want to support. We should also make use of the dev= attribute
> on<forward>  where practical, and/or extend it.
>
> We could expand the list of<foward>  mode values in a flat list
>
>    - route
>    - nat
>    - bridge (brctl)
>    - vepa
>    - private
>    - passthru
>    - bridge (macvtap)
>
> NB: really need to avoid using 'bridge' in terminology, since all
> 5 of the last options are really 'bridge'.
>
> Or we could introduce a extra attribute, and have a 2 level list
>
>    -<forward layer='link'/>    (for all ethernet layer bridging)

Does that gain us anything, though? While it's correct information, it 
seems redundant (the layer can always be implied from the mode).

>    -<forward layer='network'/>  (for all IP layer bridging aka routing)
>
> So the current modes would be
>
>     <forward layer='network' mode='route|nat'/>
>
> And new bridging modes would be
>
>     <forward layer='link' mode='bridge-brctl|vepa|private|passthru|bridge-macvtap'/>
>
> For the brctl/macvtap modes, the dev= attribute on<forward>  could point to
> the NIC being used, while with brctl modes,<bridge>  would also be present.

Are you saying that in the case of a brctl mode, it would be required to 
fill in both of these?

<forward mode="bridge-brctl" dev="br0" .../>
<bridge name="br0" .../>

I think I would prefer to only use the one in <forward>. Are you 
suggesting putting it there to help older management apps cope with the 
new modes? I don't really think it would help; it's really just an 
accident of implementation that the device in "bridge-brctl" mode 
happens to be a bridge device.

> In the SRIOV case, we potentiallly need a list of interfaces. For this we
> probably want to use

BTW, just to clarify, when you say "SRIOV", what you really mean is "any 
situation where there are multiple network interface devices connected 
to the same physical network, and identical connectivity to the guest 
could be provided by any one of these devices". In other words, it 
doesn't need to be an SRIOV ethernet card with multiple virtual 
functions, it could also be an older style setup with multiple physical 
cards, or multiple complete devices on a single card.

>     <forward dev='eth0'>
>       <interface dev='eth0'/>
>       <interface dev='eth1'/>
>       <interface dev='eth2'/>
>       ...
>     </forward>
>
> NB, the first interface is always to be listed both as a dev= attribute
> (for compat with existing apps) *and* as a child<interface>  element (for
> apps knowing the new schema).

But since the pool of devices would only ever be used in one of the new 
forward modes, which an existing app wouldn't understand anyway, would 
that really buy us anything?

> The maxConnect= attribute from your examples above is an interesting
> thing. I'm not sure whether that is neccessarily a good idea. It feels
> similar to VMWare's  "port group" idea, but I don't think having a
> simple 'maxConnect=' attribute is sufficient to let us represent the
> vmware port group idea. I think we might need an more explicit
> element eg
>
>     <portgroup count='5'>
>        <interface dev='eth2'/>
>     </portgroup>
>
> eg, so this associates a port group which allows 5 clients (VM NICs)
> with the uplink provided by eth2 (which is assumed to be listed
> under<forward>).

I've thought about this a bit, and I think portgroup is a good idea, but 
I don't think the name of the device being used fits there. portgroup is 
a good place to put information about the characteristics of a set of 
connections, but which device to use is a backend implementation detail, 
and there isn't necessarily a 1:1 correspondence between the two. 
portgroup would be used, for example, to configure bandwidth (that's 
pretty much all VMWare uses it for, plus a blob of "vendor-specific" 
data), and the guest interface XML would specify which portgroup a guest 
was going to belong to - if you also set which physical device to use 
based on portgroup, that would leave the guest XML specifying which 
physical device to use, which is what we're trying to get away from. 
(and also it would mean that each physical device would need its own 
portgroup, which I don't think we want.

Thinking more about the maxCount thing, it seems like it might be 
overkill for now. The case where there must be a limitation of 1 guest 
per NIC is macvtap passthrough mode, but that's already implied by the 
fact that it's passthrough. Other than that, libvirt can just attempt to 
load-balance as best as possible by keeping track of how many 
connections there are on each device, but not force any artificial 
limit. We may need to provide some method of reporting the number of 
connections to any particular network, to be used by a management 
application for load balancing decisions (although the amount of traffic 
is probably more important, and that can already be learned).

Conclusion on portgroup - a good idea, but not for this, probably for 
configuration of bandwidth limiting.

> So a complete SRIOV example might be
>
>    <network>
>      <name>Foo</name>
>      <forward dev='eth0' layer='link' mode='vepa'>
>        <interface dev='eth0'/>
>        <interface dev='eth1'/>
>        <interface dev='eth2'/>
>        ...
>      </forward>
>      <portgroup count='10'>
>        <interface dev='eth0'/>
>      </portgroup>
>      <portgroup count='5'>
>        <interface dev='eth1'/>
>      </portgroup>
>      <portgroup count='5'>
>        <interface dev='eth2'/>
>      </portgroup>
>    </network>
>
>
> The<virtualport>  parameters for VEPA/VNLink could either be stored at
> the top level under<network>, or inside<portgroup>  or both.

Ah, now *there's* something that fits in portgroup (since that's likely 
exactly what it's used for on the vepa/vnlink capable switch).

I think it's reasonable to put it in both places, at the top-level 
(which would apply to all connections) and in portgroup (which would 
override the global setting for connections using that portgroup). (I 
think the bandwidth config could be done in the same way.