[libvirt] Secure migration performance numbers

Fri Apr 17 13:17:35 UTC 2009

All,
     I've finally gotten the third implementation of secure migration basically
working, and have some performance numbers to share.  Just to refresh everyone's
memory, I'm looking both at host CPU usage and total time to migrate a guest,
using 4 different implementations:

1)  Qemu->qemu migration.  This is today's, unencrypted, migration
2)  GnuTLS->GnutTLS migration.  This implementation opens a side-band gnutls
channel, and then dumps the data over that side channel.
3)  RPC migration.  This version of the RPC migration uses the standard RPC
mechanisms to dump all of the migration over the wire.
4)  "Fire and forget" RPC migration.  This version of the RPC migration uses the
in-place RPC mechanism, but doesn't wait for an explicit acknowlegement from the
other side.

In all cases, I took a RHEL-5 KVM guest running on an F-11 host, booted up the
guest with 3GB of memory, ran a program that touched all of memory in the guest
until it started to swap, and then live-migrated the guest.  This isn't *quite*
worst case (worst case would be a program that allocates all of memory and
continually changes it as fast as possible), but it's a good baseline.

For case 1, the migrations took an average of 29 seconds to complete.  During
the migration, idle CPU percentage on the host varied between 33 and 97 percent.
 This is our baseline.

For case 2, the migrations took an average of 44 seconds to complete.  During
the migration, idle CPU percentage on the host was pretty uniformly 0 (all
processor time was approximately evenly split between user and system).

For case 3, the migrations took an average of 42 seconds to complete.  During
the migration, idle CPU percentage on the host varied between 12 and 48 percent.

For case 4, the migrations took an average of 112 seconds to complete.  During
the migration, idle CPU percentage on the host varied between 18 and 50 percent.

One thing of note; I'm least confident in the implementation of case 4.  It
still has a race condition that hits sometimes (which effectively kills the
test), and it's the most invasive in terms of the areas of code it touches.  The
patch is really ugly at the moment, but if people want their eyes to bleed, I
can post it.

Based on the above, it actually looks like case 3 might be the best to pursue.
While it may be possible to get better results out of case 4, I have some
concerns with the approach, particularly around error reporting.  The "fire and
forget" nature of the implementation, combined with the departure from the
normal RPC mechanism, makes it less attractive in my opinion.  Thoughts?

-- 
Chris Lalancette