Libvirt NVME support

Mon Nov 23 15:19:51 UTC 2020

On Mon, Nov 23, 2020 at 15:01:31 +0000, Daniel Berrange wrote:
> On Mon, Nov 23, 2020 at 03:36:42PM +0100, Peter Krempa wrote:
> > On Mon, Nov 23, 2020 at 15:32:20 +0100, Michal Privoznik wrote:

[...]

> > No, the NVMe controller lives on PCIe. Here we are trying to emulate a
> > NVMe controller (as <contoller> if you look elsewhere in the other
> > subthread. The <disk> element here maps to individual emulated
> > namespaces for the emulated NVMe controller.
> > 
> > If we'd try to map one <disk> per PCIe device, you'd prevent us from
> > emulating multiple namespaces.
> 
> The odd thing here is that we're trying expose different host backing
> store for each namespace, hence the need to expose multiple <disk>.
> 
> Does it even make sense if you expose a namespace "2" without first
> exposing a namespace "1" ?

[1]

> 
> It makes me a little uneasy, as it feels like  trying to export an
> regular disk, where we have a different host backing store for each
> partition. The difference I guess is that partition tables are a purely
> software construct, where as namespaces are a hardware construct.

For this purpose I viewed the namespace to be akin to a LUN on a
SCSI bus. For now controllers usually usually have just one namespace
and the storage is directly connected to it.

In the other subthread I've specifically asked whether the nvme standard
has a notion of namespace hotplug. Since it does it seems to be very
similar to how we deal with SCSI disks.

Ad [1]. That can be a limitation here. I wonder actually if you can have
0 namespaces. If that's possible then the model still holds. Obviously
if we can't have 0 namespaces hotplug would be impossible.

> Exposing individual partitions to a disk was done in Xen, but most
> people think it was kind of a mistake, as you could get a partition
> without any containing disk. At least in this case we do have a
> NVME controller present so the namespace isn't orphaned, like the
> old Xen partitons.

Well, the difference is that the nvme device node in linux actually 
consists of 3 separate parts:

/dev/nvme0n1p1:

/dev/nvme0
    - controller

          n1

    - namespace

            p1

    - partition

In this case we end up at the namespace component, so we don't really
deal in any way with partition. It's actually more similar to SCSI
albeit the SCSI naming in linux does in no way include the controller
which actually creates a mess.

> The alternative is to say only one host backing store, and then either
> let the guest dynamically carve it up into namespaces, or have some
> data format in the host backing store to represent the namespaces, or
> have an XML element to specify the regions of host backing that
> correspond to namespaces, eg
> 
>   <disk type="file" device="nvme">
>      <source file="/some/file.qcow"/>
>      <target bus="nvme"/>
>      <namespaces>
>         <region offset="0" size="1024000"/>
>         <region offset="1024000" size="2024000"/>
>         <region offset="2024000" size="4024000"/>
>      </namespaces>
>      <address type="pci" .../>
>   </disk>
> 
> this is of course less flexible, and I'm not entirely serious about
> suggesting this, but its an option that exists none the less.

Eww. This is disgusting and borderline useless if you ever want to
modify the backing image, but it certainly can be achieved with multiple
'raw' format drivers.

I don't think the NVMe standard mandates that the memory backing the
namespace must be the same for all namespaces.

For a less disgusting and more usable setup, the namespace element can
be a collection of <source> elements.

The above also will require use of virDomainUpdateDevice if you'd want
to change the backing store in any way since that's possible.