[Libguestfs] nbdkit / exposing disk images in containers

Sun Jul 12 20:16:01 UTC 2020

On Sat, Jul 11, 2020 at 11:18 AM Richard W.M. Jones <rjones at redhat.com> wrote:
>
> KubeVirt is a custom resource (a kind of plugin) for Kubernetes which
> adds support for running virtual machines.  As part of this they have
> the same problems as everyone else of how to import large disk images
> into the system for pets, templates, etc.
>
> As part of the project they've defined a format for embedding a disk
> image into a container (unclear why?  perhaps so these can be
> distributed using the existing container registry systems?):
>
>   https://github.com/kubevirt/containerized-data-importer/blob/master/doc/image-from-registry.md
>
> An example of such a disk-in-a-container is here:
>
>   https://hub.docker.com/r/kubevirt/fedora-cloud-container-disk-demo
>
> We've been asked if we can help with tools to efficiently import these
> disk images, and I have suggested a few things with nbdkit and have
> written a couple of filters (tar, gzip) to support this.

I don't think gzip filter matches nbdkit very well. Having to decompress the
entire disk before you can serve it does not sound right.

> This email is my thoughts on further development work in this area.
>
> ----------------------------------------------------------------------
>
> (1) Accessing the disk image directly from the Docker Hub.
>
> When you get down to it, what this actually is:
>
>   * There is a disk image in qcow2 format.
>
>   * It is embedded as "./disk/downloaded" in a gzip-compressed tar
>     file.  (This is a container with a single layer).
>
>   * This tarball is uploaded to (in this case) the Docker Hub and can
>     be accessed over a URL.  The URL can be constructed using a few
>     json requests.
>
>   * The URL is served by nginx and this supports HTTP range requests.
>
> I encapsulated all of this in the attached script.  This is an
> existence proof that it is possible to access the image with nbdkit.
>
> One problem is that the auth token only lasts for a limited time
> (seems to be 5 minutes in my test), and it doesn't automatically renew
> as you download the layer, so if the download takes longer than 5
> minutes you'll suddenly get unrecoverable authorization failures.
>
> There seem to be two possible ways to solve this:
>
>   (a) Write a new nbdkit-container-plugin which does the authorization
>       (essentially hiding most of the details in the attached script
>       from the user).  It could deal with renewing the key as
>       required.
>
>   (b) Modify nbdkit-curl-plugin so the user could provide a script for
>       renewing authorization.  This would expose the full gory details
>       to the end user, but on the other hand might be useful in other
>       situations that require authorization.

docker/podman already solved this, why should nbdkit solve it again?

Do you get timeouts while you download the image with a single request?

> (2) nbdkit-tar-filter exportname and listing files.
>
> This has already been covered by an email from Nir Soffer, so I'll
> simply link to that:
>
> https://lists.gnu.org/archive/html/qemu-discuss/2020-06/msg00058.html
>
> It basically requires a fairly simple change to nbdkit-tar-filter to
> map the tar filenames into export names, and a deeper change to nbdkit
> core server to allow listing all export names.  The end result would
> be that an NBD client could query the list of files [ie exports] in
> the tarball and choose one to download.

We know the tar member name upfront, so why do we need to list the contents?

> (3) gzip & tar require full downloads - why not “docker/podman save/export”?

This looks like a better direction.

The nice thing about embedding the disk in the container image is being able
to use existing infrastructure (docker, quay) to host the images, and
to transfer
them to the hosts. We don't need to write any code for this.

Even better, we have automatic caching on the host by docker/podman, so we
have to pull the image from the registry only once on every host. Then we can
access the local cache.

> Stepping back to get the bigger picture: Because the OCI standard uses
> gzip for compression (https://stackoverflow.com/a/9213826), and
> because the tar index is interspersed with the tar data, you always
> need to download the whole container layer before you can access the
> disk image inside.

You need to download most of the tar, but you don't need to keep the tar
in a temporary file. For example in python you can create a tarfile over with
the http response object in streaming with transparent decompression mode
("r|*"), and stream the disk contents from the tar without a temporary file.

    with tarfile.open(mode="r|*", fileobj=response) as tar:
        for member in tar:
            if member.name == "./disk/downloaded":
                with tar.extractfile(member) as f
                    shutil.copyfileobj(f, sys.stdout.buffer)
                    sys.exit(0)

I think this is what cdi import code does, and is the most efficient way
to copy the disk directly from the registry with the current format.

> Currently nbdkit-gzip-filter hides this from the
> end user, but it's still downloading the whole thing to a temporary
> file.  There's no way round that unless OCI can be persuaded to use a
> better format.

The way is to use the container image downloaded by podman/docker.

> But docker/podman already has a way to export container layers,
> ie. the save and export commands.  These also have the advantage that
> it will cache the downloaded layers between runs.  So why aren't we
> using that?
>
> In this world, nbdkit-container-plugin would simply use docker/podman
> save (or export?) to grab the container as a tar file, and we would
> use the tar filter as above to expose the contents as an NBD endpoint
> for further consumption.  IOW:
>
>   nbdkit container docker.io/kubevirt/fedora-cloud-container-disk-demo \
>          --filter=tar tar-entry=./downloaded/disk

This will work but there are 2 issues:

1. podman save/export copy the tar locally. This is pretty fast for the example
image but copying the tar and deleting it seems wasteful.

2. If we have the tar locally, why not use qemu-img directly? we can find the
offset of the disk inside the tar and use:

$ time podman save --format oci-dir -o demo-oci
docker.io/kubevirt/fedora-cloud-container-disk-demo

real 0m2.795s
user 0m2.011s
sys 0m0.878s

$ time qemu-img convert -O raw 'json:{"file": {"driver": "raw",
"offset": 1536, "file": {"driver": "file", "filename":
"demo-oci/8162f3eda33d5a87df56e969dcd9777523bd53278a0701b2e53b93c33c01853e"}}}'
out.raw

real 0m1.036s
user 0m3.237s
sys 0m1.326s

But I think we have a better way - using a self-extracting-disk
container. Start a container with
a disk image, and run qemu-img inside this container to convert the
disk to the target PV.

It can work like this:

1. We create a base image - this will be used for all disks container images.

$ cat Dockerfile.kubevirt-img
FROM alpine
RUN apk add qemu-img

$ podman build -t kubevirt-img -f Dockerfile.kubevirt-img .
...

You pull this from quay.io/nirsof/kubevirt-img.

2. Create a disk container image, based on the base image

$ cat Dockerfile.kubevirt-fedora-cloud-disk
FROM quay.io/nirsof/kubevirt-fimg
COPY disk.qcow2 /disk.qcow2
CMD ["qemu-img", "convert", "-p", "-f", "qcow2", "-O", "raw",
"/disk.qcow2", "/target/disk.img"]

$ podman build -t kubevirt-fedora-cloud-disk -f
Dockerfile.kubevirt-fedora-cloud-disk .
...

This container is a little larger, but the common layer with qemu-img
and its dependencies is
shared between all disk container images. In this example it adds only 25 MiB.

You can pull this from quay.io/nirsof/kubevirt-fedora-cloud-disk.

With this we can create a copy of the disk using:

$ time podman run --volume ./:/target:Z --rm -it
quay.io/nirsof/kubevirt-fedora-cloud-disk
Trying to pull quay.io/nirsof/kubevirt-fedora-cloud-disk...
Getting image source signatures
Copying blob 0d9094d70e9c skipped: already exists
Copying blob a3ed95caeb02 done
Copying blob a3ed95caeb02 done
Copying blob 18717781bd09 done
Copying blob fe5cd0d8bf32 done
Writing manifest to image destination
Storing signatures
    (100.00/100%)

real 0m59.800s
user 0m8.988s
sys 0m7.437s

$ ls -lhs disk.img
728M -rw-r--r--. 1 nsoffer nsoffer 4.0G Jul 12 21:45 disk.img

$ podman images | grep fedora-cloud
quay.io/nirsof/kubevirt-fedora-cloud-disk             latest
097ef06b6d71   About an hour ago   326 MB
docker.io/kubevirt/fedora-cloud-container-disk-demo   latest
6494830c6dc7   50 years ago        303 MB

The next time we run this we get the container from the cache:

$ time podman run --volume ./:/target:Z --rm -it
quay.io/nirsof/kubevirt-fedora-cloud-disk
    (100.00/100%)

real 0m2.244s
user 0m0.070s
sys 0m0.253s

Nir