[vfio-users] An epic wall of text with questions and comments from a Xen user

Mon Nov 23 12:11:16 UTC 2015

Hi Zir,

On November 22, 2015 at 20:41:59, Zir Blazer (zir_blazer at hotmail.com) wrote:
During the last two years I have been a happy Xen user with a decently working VGA Passthrough setup. ... 
My first experience with GPU passthrough was also from Xen. At that time I’m still using HIS Radeon R9 270X IceQ X2 and it was working flawlessly with Dom0 using Ubuntu 12.04 LTS (if I recall correctly). I didn’t really like having a Type-1 Hypervisor like Xen as my primary machine (desktop), but since Xen seemed like the only option for VGA passthrough at that time, I have no other choice.

So far, there are the reasons why I'm interesed in testing KVM at this point: 

1) KVM-VFIO is usually ahead of Xen in features since it can use them straight from standalone QEMU, while Xen has to adapt the features for use with its toolstack. There is a specific niche where KVM-VFIO is very ahead: PCI/VGA Passthrough, where Xen has a critical feature that it does not support: GeForce Passthrough… 
Until I saw nbhs started a legendary thread at Arch Linux Forum [1]. I was eager to try NVIDIA cards since for most of my games it had more features than AMD, but like you said Xen lacks NVIDIA passthrough capabilities. So I dare myself and trade in my card with Gigabyte GTX 770 Windforce and using KVM VFIO way. To my surprise, the passthrough works (thanks to nbhs for opening up the thread, and aw *Alex* for developing VFIO). A lot of quirks coming along the way, as the battle against the green company started. 

2) … It would be even more wonderful if I can switch the Radeon froma VM with W10 to WXP x64 or viceversa without having to reboot the computer (At least once it worked, but results are totally inconsistent and BSOD is what happens most often). This would make comparing OSes far easier. 
I haven’t got the chance to try again AMD on my system, but switching between Windows 10 and OS X in my system without rebooting is working flawlessly. Currently my machine is a Haswell (Intel Core i7-4770, ASRock Z87 Extreme6) and an NVIDIA card (Gigabyte GTX 980 G1 Gaming).

3) … The only Juniper based cards that I'm aware that has UEFI GOP support are two models coming from Apple itself, but I hear some comments about the Juniper Flash ROM capacity being too small to fit the required code. 
Yes, QEMU supports both split and combined OVMF image and with split image the settings are not volatile. Though I still haven’t succeeded in supplying the split image to Mac OS X guest. The UEFI GOP needs some minimal ROM capacity in order for the manufacturer to provide it for us. I found that also when I was browsing for the UEFI for my friend’s MSI GTX 560 Ti Lightning. 

4) … I see this as Xen biggest drawback for a consumer setup, as you can't get out-of-the-box sound from multiple VMs with the host as the sound mixer. 
I can’t agree more to you. I personally use ALSA as my backend sound driver for my KVM guests (except again, Mac OS X refuses to use it, I worked around using NoMachine audio adapter). With this setup I can playback game playthrough in my Linux host whenever I stuck at my game. :P

5) … Performance benchmarks of the same setup under different VMMs would also be very interesing. ... 
Probably benchmark between Native vs Xen Seabios vs KVM Seabios vs KVM OVMF would be interesting.

6) ... Not only that, since Arch Linux is a rolling release, you usually have a lot of new libraries and stuff that can break using or building older things. ...
I’m interested in trying out Gentoo Linux because of this [2]. Arch Linux is great but we need to update it really often. But since I don’t have any significant issues, I’m still holding up with Arch Linux. Probably I’ll try to familiarize myself with Gentoo in my Atom machine.

… since I'm bend that Windows must not have direct Hardware access ...
My motivation is to have a ZFS storage to hold all of my data. But I also want to be able to play Windows games in NVIDIA. Just these 2 reasons are enough to make me want to do GPU passthrough from my Linux host which also serves as the ZFS storage.

DEALING WITH MULTIPLE MONITORS AND THE PHYSICAL DESKTOP 
...
Most of the time I only launch one Windows 10 guest which has the passthrough GPU using OVMF. Since almost other things that I needed are available from the host. I can serve Samba share and the guest can map network drive to the host. Nothing fancy.

I have 2 monitors, both of them connect to both GTX 980 (DisplayPort + DVI) and Intel HD 4600 (DVI + HDMI). At boot time, both monitors display my Linux host desktop. When I start the guest, I issue a xrandr command inside my QEMU script [4] to disable the DVI of Intel HD 4600 and enable GTX 980 DisplayPort, thus shrinking the Linux desktop to about half of the original screen real estate, and enabling my Windows 10 guest to have a dedicated monitor. And when I’ve done playing games with it and want to shutdown the guest, I put another xrandr line to restore the display to its original state, making the Linux host occupy both monitors again.

I have 2 sets of keyboard+mouse, one set for the host and the other better set connected to a USB switch [3] which I can switch either connect to the host or to the guest with a button press. If I want to access the hardware features of my mouse and keyboard, I can just press the switch to direct the USB connection to the guest’s USB card. But I also setup Synergy on both guest and host, so if I just want to do daily update etc, I don’t have to press the switch. The Synergy performance was laggy (I actually one of the first early adopters and lucky to have Synergy Pro for just US$1.00), but these days the performance is very good. I tried using the Synergy mouse and keyboard to test out one of Call of Duty games and the response is good.

USING PERIPHERALS 
...
I think it’s better if you passthrough the whole USB controller expansion card. You can attach keyboard, mouse, external drives, USB-based sound card, etc.

PLANNING STORAGE 
...
As host, I like a minimalistic one with nothing but the bare Arch Linux install to start X.org (Or better, Wayland in some future) with OpenBox and launch VMs, as its sole purpose is to be an administrative domain, not a full OS - the less it does, the more secure it will be. ...
This was my idea also when I use Xen VGA passthrough. But personally I kinda dislike the idea, since if I need ZFS storage I have to setup and run another dedicated guest for service storage (just like All-in-One concept in VMware ESXi platform). Instead of throwing the memory for running a dedicated storage, I went to run the host as both Hypervisor and ZFS storage. This way I can allocate the memory for ZFS operation, I can store all my guest image on the ZFS, guest which needs ISO images, installers, etc can access my file server on the ZFS, etc.

Currently, I have a 4 TB HD, which is GPT formatted and has three physical partitions: ...
Hmm, a single disk to use as host OS and multiple guests? Sounds like possible I/O hunger for me. 

...
- Using ZFS for the storage partition, with ZVols as LVM volumes replacement. While I love all ZFS and BTRFS next-generation features for data reliability, I dropped the ZFS idea since it seems to be a pain in the arse to set up. … Basically, I have a hard time trying to think if all the extra complexity and resources to get ZFS running is worth it or not in a small setup. 
Actually ZFS is not that complex. I’ve been using ZFS since the dawn of OpenSolaris, and ZFS on Linux started to show up as a “competitor” to ZFS FUSE. At that time ZFS on Linux was not even ready for production, but I took the courage and migrate all of my EXT4-based storage to ZFS and I don’t have any regrets. I like the concept of a storage pool. Single, centralized, flexible, self-healing, etc. One of the features that I like and I think it really benefits in hypervisor environment is compression. It’s slightly using more CPU cycles to compress/decompress, but improve the throughput since the I/O needed will be decreased.

My setup is as follows, I have my Arch Linux host OS as the hypervisor on an Intel 530 120 GB (EXT4), 500 GB Hitachi for Steam/Origin/Uplay/GOG games (NTFS), 2 TB Hitachi for redownloadable contents (such as my iTunes music, Steam/Origin/Uplay/GOG backups, my game recordings, etc.), 160 GB Hitachi for my guest OS clone (in case I want to boot natively), and 8 x 1 TB WD Caviar Blue configured as a single ZFS pool using striped of 2-way-mirror (RAID10) configuration, thus I have the benefit of up to 8 times read performance and up to 4 times write performance of a single WD Caviar Blue.

Yes, I know my example is one of the multi-disks array of ZFS storage example, but I think the real benefit of ZFS would be if it is configured in a multi-disks array. Having a single disk ZFS system is not that as beneficial, though you can still have the compression and multiple copies of your data. But still the I/O performance will be bounded to just one drive.

My guests reside on a ZFS dataset, something like filesystem in a modern sense. I haven’t tried myself putting my guests on a ZVOL (ZFS Volume), but I think it’s simpler to migrate if I use ZFS dataset right now. I’m intrigued to try ZVOL, but haven’t got the chance to do it. If I need to try something that would break my guests, I usually snapshot my guest dataset first, and if something bad happened, I just rollback it to that snapshot. I can also clone that snapshot to create a cloned guests without additional storage (just a merely few kB for start).

As for the resources that I allocate for ZFS workload, currently from my 32 GB system memory, I only reserve 2 GB of system memory for my ZFS ARC and experience no issues so far.

...
- … I recall some people having a VM with a SATA Controller with multiple HDs specifically for a NAS type distribution, but I have no idea how they get the Passthrough working since the Intel Chipsets from consumer platforms usually have the SATA Controller with ugly companions. 
I think you need a dedicated HBA/SATA/RAID expansion card like these [5], which I also use for my ZFS storage array.

… As I wouldn't need the HD to be bootable (Which means a partition for the ESP), I was thinking that it could be made a whole-disk ZFS instead of merely a ZFS partition. I don't know if it is worth for me to do so, all the previous ZFS statements applies. Regardless, doing whole-disk ZFS is easier said that done, since to repartition the HD I have to move a lot of data to several smallers HDs among my family members computers then back to it after formatting. I would do it if someone can convince me that going ZFS is a must, otherwise, the HD would stay as it is (Including my working host with Xen, in case I can't get something running with KVM). 
As my understanding ZFS prefers to be given direct physical disk, not partition. ZFS also prefers to have full control of its storage pool, so in case you plan to create a dedicated ZFS storage, you have to passthrough SATA expansion card(s) to it so from its point of view is exactly the same as when it sees the disks natively. Otherwise, possible corruption can occur if you pass the disk as virtual disk(s).

...
VM LAG DURING FULL LOAD 
...
The config at that time was that the host sees all 4 physical Cores (I had Hyper Threading disabled at that moment) with 2 GB RAM, my main VM had 4 vCPUs pinned to each of the 4 physical Cores with 20 or 24 GB RAM, and the guest VM had just 2 vCPUs pinned to Cores 2 and 3. When the guest VM was doing renders or artificial loads like OCCT CPU test, there was an infernal lag on my main VM that made it totally unusable, mainly due to the Mouse issue. ...
I think you have to allocate at least a dedicated core for the Dom0 itself to use it exclusively. So probably you can try Dom0 1 physical core, main VM 2 physical cores, and your family’s VM 1 physical core. Though I imagine it would still hurt since your family’s VM ran a CPU-intensive application.

USING QEMU WITH -NODEFAULTS 
...
Anyways, I failed to find details or working examples of hand-made trees, or what FX440 and Q35 are equivalent to, and how much they can be reduced if you remove unused emulated or legacy stuff, yet still have a complete system worth of using. If anyone has at hand links about the possibilites that -nodefaults brings, drop them. 
Hmm, I just know about this, interested in finding this also.

...
Congratulations, you reached the end of my mail. This pretty much sums up all my two years of experience, and all the things that I think that needs to be improved before a system centered around a Hypervisor or VMM can overtake native systems in ease of use. 
I’ve managed to read your complete email. Hooray for me I guess. :) 

If I copy-paste them on a word processor, here are the statistics:

24 pages
11,364 words
53,076 characters (without spaces)
64,808 characters (with spaces)
219 paragraphs
1,059 lines

[1] https://bbs.archlinux.org/viewtopic.php?id=162768

[2] http://blog.lottspot.com/blog/2014/01/17/a-farewell-to-arch-linux-how-gentoo-conquered-my-desktop/

[3] http://www.amazon.com/Plugable-One-Button-Swapping-Between-Computers/dp/B006Z0Q2SI

[4] http://pastebin.com/fpfiQg46

[5] http://www.ebay.com/bhp/ibm-serveraid

Best regards,

-- 
Okky Hendriansyah
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/vfio-users/attachments/20151123/bfaa2aa6/attachment.htm>