Libvirt NVME support

Mon Nov 23 17:19:22 UTC 2020

> -----Original Message-----
> From: Peter Krempa <pkrempa at redhat.com>
> Sent: 23 November 2020 15:20
> To: Daniel P. Berrangé <berrange at redhat.com>
> Cc: Michal Prívozník <mprivozn at redhat.com>; Thanos Makatos
> <thanos.makatos at nutanix.com>; Suraj Kasi <suraj.kasi at nutanix.com>;
> libvirt-list at redhat.com; John Levon <john.levon at nutanix.com>
> Subject: Re: Libvirt NVME support
> 
> On Mon, Nov 23, 2020 at 15:01:31 +0000, Daniel Berrange wrote:
> > On Mon, Nov 23, 2020 at 03:36:42PM +0100, Peter Krempa wrote:
> > > On Mon, Nov 23, 2020 at 15:32:20 +0100, Michal Privoznik wrote:
> 
> [...]
> 
> > > No, the NVMe controller lives on PCIe. Here we are trying to emulate a
> > > NVMe controller (as <contoller> if you look elsewhere in the other
> > > subthread. The <disk> element here maps to individual emulated
> > > namespaces for the emulated NVMe controller.
> > >
> > > If we'd try to map one <disk> per PCIe device, you'd prevent us from
> > > emulating multiple namespaces.
> >
> > The odd thing here is that we're trying expose different host backing
> > store for each namespace, hence the need to expose multiple <disk>.
> >
> > Does it even make sense if you expose a namespace "2" without first
> > exposing a namespace "1" ?
> 
> [1]
> 
> >
> > It makes me a little uneasy, as it feels like  trying to export an
> > regular disk, where we have a different host backing store for each
> > partition. The difference I guess is that partition tables are a purely
> > software construct, where as namespaces are a hardware construct.
> 
> For this purpose I viewed the namespace to be akin to a LUN on a
> SCSI bus. For now controllers usually usually have just one namespace
> and the storage is directly connected to it.
> 
> In the other subthread I've specifically asked whether the nvme standard
> has a notion of namespace hotplug. Since it does it seems to be very
> similar to how we deal with SCSI disks.
> 
> Ad [1]. That can be a limitation here. I wonder actually if you can have
> 0 namespaces. If that's possible then the model still holds. Obviously
> if we can't have 0 namespaces hotplug would be impossible.

It is possible to have a controller with no namespaces at all or to have gaps in
the namespace IDs, there's no requirement to start from 1. Controllers start
from 1 since that's the sensible thing to do. We can end up in situations with
random namespace IDs simply by adding and deleting namespaces.

> 
> > Exposing individual partitions to a disk was done in Xen, but most
> > people think it was kind of a mistake, as you could get a partition
> > without any containing disk. At least in this case we do have a
> > NVME controller present so the namespace isn't orphaned, like the
> > old Xen partitons.
> 
> Well, the difference is that the nvme device node in linux actually
> consists of 3 separate parts:
> 
> /dev/nvme0n1p1:
> 
> /dev/nvme0
>     - controller
> 
>           n1
> 
>     - namespace
> 
>             p1
> 
>     - partition
> 
> In this case we end up at the namespace component, so we don't really
> deal in any way with partition. It's actually more similar to SCSI
> albeit the SCSI naming in linux does in no way include the controller
> which actually creates a mess.

Agreed, the partition exists solely within the host, so this isn't a problem.
Also, I think the analogy of SCSI controller == NVMe controller and
SCSI LUN == NVMe namespace is pretty accurate for all practical purposes.

> 
> > The alternative is to say only one host backing store, and then either
> > let the guest dynamically carve it up into namespaces, or have some
> > data format in the host backing store to represent the namespaces, or
> > have an XML element to specify the regions of host backing that
> > correspond to namespaces, eg
> >
> >   <disk type="file" device="nvme">
> >      <source file="/some/file.qcow"/>
> >      <target bus="nvme"/>
> >      <namespaces>
> >         <region offset="0" size="1024000"/>
> >         <region offset="1024000" size="2024000"/>
> >         <region offset="2024000" size="4024000"/>
> >      </namespaces>
> >      <address type="pci" .../>
> >   </disk>
> >
> > this is of course less flexible, and I'm not entirely serious about
> > suggesting this, but its an option that exists none the less.
> 
> Eww. This is disgusting and borderline useless if you ever want to
> modify the backing image, but it certainly can be achieved with multiple
> 'raw' format drivers.

I agree that this is too limiting.

> 
> I don't think the NVMe standard mandates that the memory backing the
> namespace must be the same for all namespaces.

The NVMe spec says:

"A namespace is a quantity of non-volatile memory that may be formatted into
logical blocks." (v1.4)

So we can pretty much do whatever we want. Having a single NVMe controller
through which we can pass all disks to a VM can be useful because it simplifies
management and reduces resource consumption both in the guest and the host. But
we can definitely add as many controllers as we want should we need to.

> 
> For a less disgusting and more usable setup, the namespace element can
> be a collection of <source> elements.
> 
> The above also will require use of virDomainUpdateDevice if you'd want
> to change the backing store in any way since that's possible.