[Libguestfs] ZFS-on-NBD

Wed Sep 14 18:49:06 UTC 2022

Wanted to share a personal journey since myself and a partner went down a
similar route on a past project of using NBD + userland loopback + S3 for
block devices.

5 years ago we actually forked nbdkit and nbd-client to create async,
optimized versions for use privately as a userland loopback block device
driver that we then built support for various cloud object storages as
backends (including S3).

It has been a few years since I've worked on it but there were a number of
gotchyas we had to overcome when running block devices in userspace and
mounted on the same system. The largest of which is that the userland
process needs to be extremely careful with memory allocations (ideally all
memory will be allocated at startup of the userland driver and memlocked)
because you can quite easily cause deadlock on the system if your userland
block device driver needs to allocate memory at any point during an
operation and the kernel decides it first needs to reclaim memory and
chooses to clean out dirty filesystem buffer pages to satisfy the malloc.
This causes the kernel to again try to re-enter the filesystem and block
device kernel drivers to flush those dirty pages and that causes deadlocks
in the kernel as locks were already being held by the previous operation.

There are also a number of places where you might not be expecting the
kernel to allocate memory such as your userland process makes a network
call to S3 and the kernel decides to allocate additional socket buffer
memory in the kernel which can also re-enter the reclaim memory by flushing
dirty fs buffers paths.

Drivers within the kernel space are able to avoid this issue by flagging
memory allocation requests in a way that the kernel doesn't try to re-enter
those subsystems to attempt to reclaim memory pages. With userland drivers
we didn't have a mechanism for that at the time (not sure if we still
don't) so it was critical to avoid any memory allocations in the userland
driver process.

That said, once we had worked through lots of these system stressed, malloc
related issues that could wedge the system by carefully pre-allocating and
memlocking everything and fine tuning all of our socket buffer settings...
things worked pretty well honestly. The sequential write speed of our nbd /
userland / S3 block device driver was able to max out AWS c5n instances
with 25 gigabit networking (between 2 and 2.5 GB/s read/write throughput).
We also had tested and worked with leveraging a number of other linux block
device systems such as LVM and crypto on top of it and everything worked
well and we even had support for TRIM and snapshotting. Adding a thin
filesystem on top of the block device such as btrfs and you could even
create a thin 4 exabyte disk backed by s3, format it with btrfs, and mount
it to your system in under a minute.

A follow up project that we prototyped but did not release was launching a
GlusterFS cluster built on top of our S3 backed disks with LVM. We were
able to get extremely scalable filesystem throughputs. We tested 3, 6, 9,
and 12 node clusters in AWS and were able to achieve 10+GB/s (yes bytes not
bits) filesystem read and write throughputs. We were also able to leverage
Gluster's support for raid striping to increase durability and availability
in case of loss of nodes and had point in time snapshot capabilities
through LVM and Gluster. The stack here consisted of 3-node groupings where
each node was on a c5n.2xl instance and had a device stack of [Our NBD
userland S3 backed disk -> LVM -> Ext4 -> GlusterFS] in a distributed
disperse 3 redundancy 1 configuration. Testing was done using distributed
FIO with 10 client servers and 5-10 streams per client writing to the
gluster FS.

If your primary use case is larger files (10+ MB), parallel streams of
sequential read and write, object storage backed block devices can be
highly performant and extremely durable but this was approximately a 2-3
year (2 developers) effort to get these products built and highly stable.

I think I would personally recommend most folks reconsider if their
architecture could just leverage object storage directly rather than going
through the indirection of a filesystem -> S3 or a block device -> S3. If
not I would strongly consider building direct plugins to S3 for userland
filesystem backends like GlusterFS or NFSGanesha that avoids the kernel +
fs + blockdev layer entirely to avoid most of the risky deadlock scenarios
that can occur. If that still doesn't work and you are going down NBD ->
userland path, I would at the very least avoid mounting the NBD disk on the
same system where the NBD userland process is running so that you have
malloc requests needing to flush dirty fs cache recursively back through
the nbd driver (unless this problem has since been solved).

Shaun

On Tue, Sep 13, 2022 at 10:54 AM Richard W.M. Jones <rjones at redhat.com>
wrote:

>
> As an aside, we'll soon be adding the feature to use nbdkit plugins as
> Linux ublk (userspace block) devices.  The API is nearly the same so
> there's just a bit of code needed to let nbdkit plugins be loaded by
> ubdsrv.  Watch this space.
>
> Of course it may not (probably will not) fix other problems you mentioned.
>
> Rich.
>
> --
> Richard Jones, Virtualization Group, Red Hat
> http://people.redhat.com/~rjones
> Read my programming and virtualization blog: http://rwmj.wordpress.com
> virt-top is 'top' for virtual machines.  Tiny program with many
> powerful monitoring features, net stats, disk stats, logging, etc.
> http://people.redhat.com/~rjones/virt-top
> _______________________________________________
> Libguestfs mailing list
> Libguestfs at redhat.com
> https://listman.redhat.com/mailman/listinfo/libguestfs
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/libguestfs/attachments/20220914/09834de0/attachment.htm>