[rhos-list] clustering Cinder with Gluster

Fri Feb 15 11:21:32 UTC 2013

On 02/14/2013 08:19 PM, Paul Robert Marino wrote:
> Hello
> 
> Ive been thinking of ways to make cinder redundant without shared
> storage and I think I have two possible quick answer using Gluster,
> but before I go down this rabbit hole in my test environment I wanted
> to see if any one has tried this before or if any one could point out
> any obvious problems.
> Now I know that support for exporting ISCSI block devices natively is
> in the Gluster road map but it doesn't look like it will happen soon.
> here is what I'm thinking

Using iSCSI from Gluster or even falling back to NFS in Gluster isn't
strictly necessary.

The right thing to do is to have a native gluster driver for Cinder,
which our engineers are busy working on.  I've cc'd Eric Harney from my
team who has been pushing forward on that effort with help from the Red
Hat Storage (Gluster) team.

We're tracking inclusion of this Cinder Gluster driver here for RHOS 2.1:
https://bugzilla.redhat.com/show_bug.cgi?id=892686

It's not guaranteed it'll get into 2.1, because it still needs to get in
upstream by G-3 in order for that to happen.  But that's the target.

In the future, this Cinder Gluster Driver can be updated to utilize the
qemu native support for Gluster, which should make things perform
better.  But that functionality is not yet in RHEL 6, so we wouldn't be
able to use it quite yet.  Hopefully by RHEL 6.5 it will be there.

> Scenario 1
> similar to the examples in the guide I'm thinking of creating a a disk
> image created with the truncate command.
> The big difference is I'm planing to create it on a Gluster share and
> creating a clustered LVM volume and managing it with the HA addon.

What does clvm provide here?

Ideally, use the Cinder Driver mentioned above, but in the absence of
that, if you had a Gluster storage environment set up and had that
glusterfs mounted on every Compute Node, I don't see why clvm would be
necessary, and in fact clvm would create a lot of architectural
complexity here.

The redundancy would be in the Gluster cluster itself (i.e. once you
point your compute node at one of the gluster bricks to mount the fs, my
understanding is that if a single brick fails, as long as the data you
need is accessible via replication on another brick, things will just
fail over w/o you needing add'l HA software like RHEL HA/CLVM)

I've cc'd a Gluster expert (Vijay) who can correct me if I have that
horribly wrong :)

> it should be fairly simple for me to create an init script to create
> and remove loop devices via losetup.
> in this scenario the thing that concerns me is the possibility of a
> system getting fenced on boot before the the Gluster volume is ready.
> 
> Scenario 2
> This one is a little simpler since I'm very familiar with keepalived I
> could create a VRRP instance with a floating VIP.
> when a node becomes primary it could initiate a script to start the
> loop device then start the Cinder service.
> on fault or if the node becomes backup I could have it ensure cinder
> has been stopped then remove the loop device.
> there are two things that I'm worried with this scenario
> 1) Since keepalived doesn't understand the concept of a quorum if they
> went into split brain mode this could possibly cause a significant
> problem. I can mitigate this risk by connecting the cinder nodes with
> a pair of dedicated cross over cables (preferably run via separate
> cable trays) but it can never be absolutely eliminate the possibility.
> I can also add a secondary check script that does a file based
> secondary heartbeat but that would be a little more complicated and
> wouldn't help if Gluster was split brained as well.
> 2) When a fault happens in keepalived there is either a lag before the
> backup notices and takes over based on the time in the heartbeat
> interval (approximately interval x 3) so there will be a 3 second or
> more delay before the second node attempts to take over. there are
> several patches for sub second intervals some of which I'm familiar
> with (I wrote one of them :-) ) but they add their own issue because
> they can make the system try to react too fast and may not allow
> sufficient time for the failed node to cleanly detach from the volume.
> 
> 
> scenario 2 is the easiest to implement and despite the concerns its
> the one i think is the safest mostly because I don't like to fence
> nodes because a single process or volume has an issue. my personal
> experiences with fencing is it usually causes more problems than it
> solves although admittedly my opinion of fencing has been tainted by a
> Oracle stretch cluster I use to support which liked to fence nodes any
> time someone half way around the world sneezed.
> 
> So does any one have any opinions or comments?

I think given how Gluster works and provides redundancy by having
multiple storage bricks and replication of the data, the above doesn't
seem like it is necessary and would provide a lot of
overhead/complication to the configuration.

But I'll let the Cinder/Gluster folks on the thread here weigh in and
let me know if that's not correct

Perry