[rhos-list] clustering Cinder with Gluster

Fri Feb 15 18:00:14 UTC 2013

On Fri, Feb 15, 2013 at 6:21 AM, Perry Myers <pmyers at redhat.com> wrote:
> On 02/14/2013 08:19 PM, Paul Robert Marino wrote:
>> Hello
>>
>> Ive been thinking of ways to make cinder redundant without shared
>> storage and I think I have two possible quick answer using Gluster,
>> but before I go down this rabbit hole in my test environment I wanted
>> to see if any one has tried this before or if any one could point out
>> any obvious problems.
>> Now I know that support for exporting ISCSI block devices natively is
>> in the Gluster road map but it doesn't look like it will happen soon.
>> here is what I'm thinking
>
> Using iSCSI from Gluster or even falling back to NFS in Gluster isn't
> strictly necessary.
>
> The right thing to do is to have a native gluster driver for Cinder,
> which our engineers are busy working on.  I've cc'd Eric Harney from my
> team who has been pushing forward on that effort with help from the Red
> Hat Storage (Gluster) team.
>
> We're tracking inclusion of this Cinder Gluster driver here for RHOS 2.1:
> https://bugzilla.redhat.com/show_bug.cgi?id=892686
>
> It's not guaranteed it'll get into 2.1, because it still needs to get in
> upstream by G-3 in order for that to happen.  But that's the target.
>
> In the future, this Cinder Gluster Driver can be updated to utilize the
> qemu native support for Gluster, which should make things perform
> better.  But that functionality is not yet in RHEL 6, so we wouldn't be
> able to use it quite yet.  Hopefully by RHEL 6.5 it will be there.
>
>> Scenario 1
>> similar to the examples in the guide I'm thinking of creating a a disk
>> image created with the truncate command.
>> The big difference is I'm planing to create it on a Gluster share and
>> creating a clustered LVM volume and managing it with the HA addon.
>
> What does clvm provide here?
well as I said this is a work around in the mean time until Gluster
can natively or even indirectly be used by Cinder.
what clvm would be doing is ensuring the volume group and the logical
volumes it contains on the disk image would be accessible by both the
primary and backup cinder hosts at the same time without having to
worry too much about race conditions, but also as I said it is
complicated and does add significant risks of its own. The one good
thing i can say about this method is could theoretically allow
multiple hosts running cinder to export the same logical volumes via
ISCSI at the same time, thus allowing a load balancer to handle fail
over and distribution of the traffic. Also keep in mind even after
cinder gets native support for Gluster this would be useful for shared
physical disks such a fiber channel sans and external drive chassis
that support being connected to multiple hosts concurrently. with a
standard logical volume locking prevents you from accessing the same
logical volume on multiple hosts concurrently so its really not
possible without it.

>
> Ideally, use the Cinder Driver mentioned above, but in the absence of
> that, if you had a Gluster storage environment set up and had that
> glusterfs mounted on every Compute Node, I don't see why clvm would be
> necessary, and in fact clvm would create a lot of architectural
> complexity here.
>
> The redundancy would be in the Gluster cluster itself (i.e. once you
> point your compute node at one of the gluster bricks to mount the fs, my
> understanding is that if a single brick fails, as long as the data you
> need is accessible via replication on another brick, things will just
> fail over w/o you needing add'l HA software like RHEL HA/CLVM)
Well I sort of agree Gluster does provide redundancy via replicated
bricks but it doesn't provide process fail over. So either way
something will have to manage that.
>
> I've cc'd a Gluster expert (Vijay) who can correct me if I have that
> horribly wrong :)
>
>> it should be fairly simple for me to create an init script to create
>> and remove loop devices via losetup.
>> in this scenario the thing that concerns me is the possibility of a
>> system getting fenced on boot before the the Gluster volume is ready.
>>
>> Scenario 2
>> This one is a little simpler since I'm very familiar with keepalived I
>> could create a VRRP instance with a floating VIP.
>> when a node becomes primary it could initiate a script to start the
>> loop device then start the Cinder service.
>> on fault or if the node becomes backup I could have it ensure cinder
>> has been stopped then remove the loop device.
>> there are two things that I'm worried with this scenario
>> 1) Since keepalived doesn't understand the concept of a quorum if they
>> went into split brain mode this could possibly cause a significant
>> problem. I can mitigate this risk by connecting the cinder nodes with
>> a pair of dedicated cross over cables (preferably run via separate
>> cable trays) but it can never be absolutely eliminate the possibility.
>> I can also add a secondary check script that does a file based
>> secondary heartbeat but that would be a little more complicated and
>> wouldn't help if Gluster was split brained as well.
>> 2) When a fault happens in keepalived there is either a lag before the
>> backup notices and takes over based on the time in the heartbeat
>> interval (approximately interval x 3) so there will be a 3 second or
>> more delay before the second node attempts to take over. there are
>> several patches for sub second intervals some of which I'm familiar
>> with (I wrote one of them :-) ) but they add their own issue because
>> they can make the system try to react too fast and may not allow
>> sufficient time for the failed node to cleanly detach from the volume.
>>
>>
>> scenario 2 is the easiest to implement and despite the concerns its
>> the one i think is the safest mostly because I don't like to fence
>> nodes because a single process or volume has an issue. my personal
>> experiences with fencing is it usually causes more problems than it
>> solves although admittedly my opinion of fencing has been tainted by a
>> Oracle stretch cluster I use to support which liked to fence nodes any
>> time someone half way around the world sneezed.
>>
>> So does any one have any opinions or comments?
>
> I think given how Gluster works and provides redundancy by having
> multiple storage bricks and replication of the data, the above doesn't
> seem like it is necessary and would provide a lot of
> overhead/complication to the configuration.

Again there is two different redundancies here. yes Gluster is
providing data replication but it doesn't handle process fail over for
Cinder its self.

>
> But I'll let the Cinder/Gluster folks on the thread here weigh in and
> let me know if that's not correct
>
> Perry