[rhos-list] clustering Cinder with Gluster

Fri Feb 15 01:19:33 UTC 2013

Hello

Ive been thinking of ways to make cinder redundant without shared
storage and I think I have two possible quick answer using Gluster,
but before I go down this rabbit hole in my test environment I wanted
to see if any one has tried this before or if any one could point out
any obvious problems.
Now I know that support for exporting ISCSI block devices natively is
in the Gluster road map but it doesn't look like it will happen soon.
here is what I'm thinking
Scenario 1
similar to the examples in the guide I'm thinking of creating a a disk
image created with the truncate command.
The big difference is I'm planing to create it on a Gluster share and
creating a clustered LVM volume and managing it with the HA addon.
it should be fairly simple for me to create an init script to create
and remove loop devices via losetup.
in this scenario the thing that concerns me is the possibility of a
system getting fenced on boot before the the Gluster volume is ready.

Scenario 2
This one is a little simpler since I'm very familiar with keepalived I
could create a VRRP instance with a floating VIP.
when a node becomes primary it could initiate a script to start the
loop device then start the Cinder service.
on fault or if the node becomes backup I could have it ensure cinder
has been stopped then remove the loop device.
there are two things that I'm worried with this scenario
1) Since keepalived doesn't understand the concept of a quorum if they
went into split brain mode this could possibly cause a significant
problem. I can mitigate this risk by connecting the cinder nodes with
a pair of dedicated cross over cables (preferably run via separate
cable trays) but it can never be absolutely eliminate the possibility.
I can also add a secondary check script that does a file based
secondary heartbeat but that would be a little more complicated and
wouldn't help if Gluster was split brained as well.
2) When a fault happens in keepalived there is either a lag before the
backup notices and takes over based on the time in the heartbeat
interval (approximately interval x 3) so there will be a 3 second or
more delay before the second node attempts to take over. there are
several patches for sub second intervals some of which I'm familiar
with (I wrote one of them :-) ) but they add their own issue because
they can make the system try to react too fast and may not allow
sufficient time for the failed node to cleanly detach from the volume.

scenario 2 is the easiest to implement and despite the concerns its
the one i think is the safest mostly because I don't like to fence
nodes because a single process or volume has an issue. my personal
experiences with fencing is it usually causes more problems than it
solves although admittedly my opinion of fencing has been tainted by a
Oracle stretch cluster I use to support which liked to fence nodes any
time someone half way around the world sneezed.

So does any one have any opinions or comments?