Live migration support for Cloud-Hypervisor VMs

Wed Aug 3 15:38:28 UTC 2022

Thanks for the details and recommendations Daniel!!

On 8/2/2022 11:19 AM, Daniel P. Berrangé wrote:
> On Mon, Aug 01, 2022 at 11:03:49AM -0500, Praveen K Paladugu wrote:
>> Folks,
>>
>> We are implementing Live Migration support in "ch" driver of Libvirt. I'd
>> like to confirm if the approach we have chosen would be accepted upstream
>> once implemented.
>>
>>
>> Our immediate goal is to implement "Hypervisor Native" + "Managed Direct"
>> mode of migration. "Hypervisor Native" here referring to VMM(ch) being
>> responsible for data flow. This in contrast to TUNNELED migration where data
>> is sent over libvirt rpc.
> 
> Avoiding TUNNELLED migration is a very good idea. This was a short term
> hack to workaround the lack of TLS support in QEMU. It is more efficient
> to have TLS natively integrated in the hypervisor layer than libvirt.
> 
> IOW, "Hypervisor native" is a good choice.
> 
>>
>> "Managed Direct" referring to virsh client responsible for control flow
>> between source and dest hosts. The libvirtd daemons on source and
>> destination do not have to communicate with each other. These modes are
>> described further at
>> https://libvirt.org/migration.html#network-data-transports.
> 
> I'd caution that I think 'managed direct' migration leaves you with
> fewer nice options for ensuring resilience of the migration.
> 
> IOW, if the client application goes away, I think it'll be harder
> for the libvirt CH driver to recover from that scenario.
> 
> Also if a client app is using the DigitalOcean 'go-libvirt' API
> instead of our 'libvit-go-module' API, things are even more
> limited since thg 'go-libvirt' API directly speaks to the RPC
> protocol, bypassing libvirt.so logic related to migration
> process steps.
> 
> With the peer-to-peer mode, migration can carry on even if the
> client app goes away, since the client app isn't a part of the
> control loop.
> 
> So overall, I'd encourage peer-to-peer migration as the preferrable
> option, unless you can hand-off absolutely everything to the CH
> code and not have libvirt involved in orchestrating the migration
> steps at all ?
Makes sense to prioritize peer-to-peer migration. Our current project is 
an internship and has strict time constraints. As we are well under way 
for "Managed Direct" mode, we will finish this and focus on peer-to-peer 
migration mode right after.
>   
>> At the moment, Cloud-Hypervisor supports receiving migration data only on
>> Unix Domain Sockets. Also, Cloud-Hypervisor does not encrypt the VM data
>> while sending.
> 
> Hmm, that's quite limiting.
> 
>>
>> We are considering forking "socat" processes as documented at https://github.com/cloud-hypervisor/cloud-hypervisor/blob/main/docs/live_migration.md.
>> The socat processes will be forked in "Prepare" and "Perform" phases on
>> Destination and Source hosts respectively.
>>
>> I couldn't find any existing implementation in libvirt to connect Domain
>> Sockets on different hosts. Please let me know, if you'd recommend a
>> different approach from forking socat processes to connect Domain Sockets on
>> source and dest hosts to allow Live VM Migration.
> 
> I think building something around socat will get you going quickly, but
> ultimately be harmful over the long term.
Makes sense. We were also concerned about long term maintenance so 
wanted to check on this mailing list. As there isn't better mechanism to 
connect domain sockets on source and dest hosts, we will finish up the 
"socat" based implementation and get it to work end-to-end.
> 
> Our experiance with QEMU has been that to maximise performance you need
> the lowest level in full control. These days QEMU can open multiple TCP
> connections concurrently from multiple, so that throughput isn't limited
> by data copy performance of a single CPU. It also has ability to take
> advantage of kernel features like zerocopy. Use of an socat proxy is
> going to add many data copies to the transport which can only harm your
> performance.
> 
> So my recommendation would be to invest time in first extending CH so
> that it natively supports opening TCP connections, and then take advantage
> of that in libvirt from the start. You then have the basic foundation
> right on which to add stuff like TLS, zerocopy, multi-conection, and more
> 
> 
Again, thanks for the details and the recommendation. Enabling TCP 
connections and other low-level features in cloud-hypervisor isn't 
something we can tackle within our current time constraints. But will 
follow up with cloud-hypervisor community and open a tracking issue for 
this work.

> With regards,
> Daniel

-- 
Regards,
Praveen K Paladugu