[rhelv6-list] Home-brew SAN for virtualization

Wed Feb 26 03:57:28 UTC 2014

Chris Adams <linux at cmadams.net> wrote:
> It has two controllers with two SAS ports each; it is set up where each
> storage server has a SAS connection to each MD3000 controller (multipath
> for redundancy).

So it's possibly for the two servers to share the same storage, correct?

If so, you don't need Clustering.  The oVirt stack handles it for you.
 LVs are enabled, disabled, snapshot, etc... for VMs, and only enabled
on one host -- i.e., where it is running -- at a time.  That's how it
works, and it works very well -- as long, of course -- you can
directly attached every "compute" (HyperVisor) node to it.

For those without hardware-based shared storage, Gluster is ideal as
well.  Gluster does handle file-based locking for true concurrency.
But in the case of QEMU-KVM leveraging the libgfapi (Gluster API) for
direct block access by the VM, all managed by the oVirt stack's POSIX
file system module, locking is really not even a need.  Let alone it's
cake to manage now that it's been integrated into oVirt's GUI and
agents (e.g., VDSM et al.).

Fully replicated volumes offer the best, local read-heavy performance,
although distributed-replicated also works well, depending on your
distribution and usage.  It's much more reliable than using a lot of
other software-based storage solutions, especially those without
something like the QEMU-KVM layer.  I've seen a lot of people hack
various things, just because they refuse to try Gluster ... when it's
not just designed _exactly_ for this scenario, but Red Hat has always
had this in mind for Gluster to solve a lot of the software-based
storage issues over the years.

> While the long-term goal is oVirt (or maybe RHEV, but the budget is
> tight for this setup),

The oVirt stack is the upstream that contains the GUI manager for both
Red Hat Enterprise Virtualization Manager (RHEV-M) Red Hat Storage
Console (RHS-C).  The oVirt agents on each node merely talk to the
stack.

So there is _nothing_ stopping you from doing this using Upstream code.

Although commercial RHEV has the nice option of pushing down pre-built
images for the "HyperVisor".  We'll see if a downstream project does
similar in the future.  E.g., what comes of the Red Hat-CentOS
partnership will be interesting.  I.e., the CentOS team had not been
building most of the RHEL add-ons like RHEV, RHSS, et al. in the past.

But one doesn't have to use pre-built images, and can manage the
components on any RHEL platform, with the oVirt agents configured for
the manager.  In fact, debugging can sometimes be better if you have a
full platform underneath.

> and possibly Gluster,

Software-defined storage is really for when you have no hardware
multi-targeting.  I'm biased towards Gluster, but in the oVirt stack,
it becomes a no-brainer with heavy Red Hat focus end-to-end --
management to platform to agent.

One _can_ run software-defined storage atop of hardware
multi-targeting, but it's almost redundant in many ways.  You might as
well just use the hardware directly, especially with a solution like
the oVirt stack.

> unfortunately that isn't an option at the moment.

I'm just pointing it out.  I see people building "houses of cards"
with NFS exports, various iSCSI targets, etc... all over the place,
along with other, software-defined storage solutions, let alone not
using their hardware effectively, when it comes to VM farms.  And it's
unmanaged, when it doesn't need to be, especially with something like
oVirt out there.

The oVirt stack itself, beyond just the libVirt and other foundations,
was designed to manage not just KVM, but Xen and ESXi from inception.
Unfortunately, they became stagnant without support from other vendors
for a variety of reasons.  I.e., an open source framework with full
GUI stack could be seen as a "competitor" to existing, commercial
frameworks and GUIs.  So oVirt today is heavily KVM-only, but not by
design.

> I have the stack of Xen servers with full hard drives

And that's exactly what I was pointing out, for "future consideration."  ;)

Storage with full hard drives is an _ideal_ application for directly
presented, QEMU-level block storage to VMs.  E.g., libgfapi (Gluster
API) support on your end "compute" (HyperVisor) nodes so the farm can
use _any_ Gluster volumes _anywhere_ in your Global Namespace.

I.e., _all_ of your nodes with _any_ storage.  ;)

> and no centralized management,

In this day'n age, a lot of sysadmins almost refuse to work without
such.  Beyond just libvirt and other foundations, oVirt was Red Hat's
project to provide a true, cross-vendor, open source framework and GUI
for KVM, Xen and ESXi.  It's succeeded brilliantly, and is even used
for Gluster management now, even if the other vendors -- at least the
HyperVisor ones -- have not supported it.

But that aside, you do _not_ need to use oVirt to get the underlying
platform "capability."  The oVirt agents for HyperVisor and Storage
(Gluster) management are separate from the "capability."

> two storage servers,

While "storage" nodes can certainly be segmented from "compute" nodes,
and that's even the requirement in the commercial Gluster flavor right
now (Red Hat Storage Server 2.x -- it's more of a channel-distribution
difference than anything), there's nothing stopping anyone in the
upstream from using both, together.

It's one thing if you can present direct storage to every HyperVisor.
In that solution, you don't have to re-export anything.  Every single
node has direct access to the shared storage blocks.  It's an ideal
situation for a VM farm.  Take advantage of that direct, hardware
path.

But once you find yourself in the segmented "storage" v. "compute"
solution, software-defined storage starts looking better and better.
Your VM instances write every single block to every Replica Brick in
the Volume across your entire Storage Pool.  Given that VMs are
usually heavy read, lighter write -- or I/O quickly becomes an issue
for the farm in general, regardless of solution -- it's an ideal
solution.

Again ...

Any time you need any more "compute" _or_ "storage," you just "add
another node" and get _both_.  ;)

> and the SAS shelf, and I need more disk ASAP (might have a couple of
> weeks to get it running but that's about it).

No reason you couldn't do this tomorrow with Gluster.  It's designed
to solve your problem, almost specifically.  You're going to run into
the case where you're going to be out of shelf-space, and looking at a
lot of costly procurement beyond JBOD in your "compute" (HyperVisor)
nodes.

Once you go Gluster in applications like this, you never want to go
back to NFS or other solutions.  Even NFS, iSCSI, etc... exported
"storage" nodes lose their benefits at some point.

Now I've never tried Xen with the direct presented, QEMU-level block
storage to VMs via the libgfapi support.  There may be other details
at work, although ultimately, it is QEMU in control of the block
storage.  I'd Google to see if anyone else is doing it.

> Now, getting something up and running to handle today's needs, with a
> path to oVirt/Gluster later, would be ideal.  I haven't actually used
> Gluster myself though (just read some about it in the past).

Whether it's Gluster or something else, if "cost" is a consideration,
you _will_ be looking at software-defined storage at some point.

In the 100% open source VM farm world ... there's really only one,
end-to-end option that is heavily developed and stablized to Red Hat's
standards.  And if you're using a Red Hat, or downstream,
distribution, that basically means -- "designed for it with 5-10+
years in mind" v. "have to hack other things in, and might not be
sustainable."  ;)

So that brings us back to RHEL KVM + Gluster libgfapi, whether you use
oVirt management, agents, etc... or not with it.

> In this kind of setup, what would the storage look like locally (on the
> directly-attached servers)?

Doesn't matter.  You can use whatever you want.  You can carve up hard
drives across any nodes into Bricks and assign them in Replicated or
Distributed-Replicated Volumes in the same, Trusted Storage Pool.
That's the power of Gluster's Global Namespace.  Doesn't matter if
Bricks are actually on a system -- let alone whether a system has any
storage at all! (could be a USB key) -- it looks like every node in
the Trusted Storage Pool has _all_ of the volumes, transparently.

Now if you do more careful planning, you can ensure select oVirt
Compute Clusters have nodes with local Bricks that are in a Replicated
Volume across all nodes.  I.e., when I/O performance is critical, make
a Compute Cluster that contains nodes with local Bricks where every
brick is a Replica Brick in a Replicated Volume (or a local Distribute
and Replica brick on every node in a Distribute-Replicated Volume), so
every node has a copy of every VM.  But that's only when performance
is critical, and it will allow you to best a lot of iSCSI solutions in
performance (let alone NFS).

Remember, you could have more than one Compute Cluster -- one of
"adhoc" nodes and one for "performance."  The VMs that need to be high
performing are on the latter.  Various, lower priority VMs can be on
the former and take advantage of every piece of hardware -- both
compute and storage -- you have, with local storage of varying sizes,
or not at all.

> Would I be able to create a regular LVM setup,

Yes.  Absolutely.  You can carve your storage as you see fit.

> with an FS (ext4/xfs) to share out some space via NFS today,
> and then later use the rest of the space for Gluster?

Yes.  Absolutely.  You can carve your storage as you see fit.

You just format one or more LVs with XFS (512B inode size highly
recommended), and it is a Brick ready to be used in a new or existing
Volume.  Once a Volume is created+started, if not already existing,
the storage is available to use.  Although in the case of existing,
and expanded, a rebalance is highly recommend if Distribute bricks are
involved.  I.e., Distributed or Distributed-Replicated Volumes, but
not pure Replicated Volumes.

Once you start using Gluster, it will really surprise you.  It is
designed to be direct access, no meta-data server, no downtime when
you expand and other things.  The caveats to Gluster, which are not
negatives, but by design -- always up, always fast, always direct --
so there is no meta-data server bottleneck, there is no downtime or
"pseudo up, but really degraded" modes, is that it uses a hash
algorithm.

E.g., if you expand a Distributed or Distributed-Replicated Volume,
after adding another Distribute Brick (for more net storage), you need
to Rebalance the Bricks.

GlusterFS is its own file system.  It's file-based storage its its
advantage, especially when it comes to locking coherency, especially
multi-protocol, multi-node.  That's something not easily done with
even Red Hat Cluster Suite and hardware-based shared storage.

Although if you are using a GlusterFS volume as a QEMU-libgfapi block
storage for a VM farm, I highly recommend you do not access that
Volume from other nodes with other mounts (Native, NFS, other API
calls, etc...).  And this is really a case where you want to be using
the oVirt stack, its agents (e.g., VDSM), etc... to manage that
volume.

-- bjs

P.S.  It's a bit dated, but still covers most of the specifics of the
newer capability, and its resulting performance.
- http://www.gluster.org/community/documentation/images/9/9d/QEMU_GlusterFS.pdf

--
Bryan J Smith - UCF '97 Engr - http://www.linkedin.com/in/bjsmith
-----------------------------------------------------------------
"In a way, Bortles is the personification of the UCF football
program.  Each has many of the elements that everyone claims to
want, and yet they are nobody's first choice.  Coming out of high
school, Bortles had the size and the arm to play at a more
prestigious program.  UCF likewise has the market size and the
talent base to play in a more prestigious conference than the
American Athletic.  But timing and circumstances conspired to put
both where they are now." -- Andy Staples, CNN-Sports Illustrated