[rhelv6-list] Highly available OpenLDAP

Mon Feb 3 17:10:01 UTC 2014

Thanks - I'll take a look at this when I get a bit of time, but unless "KVM" translates directly to "VMWare" it won't happen in our environment, unfortunately. The current "answer" to VMs in our environment is "VMWare", and we don't really have any say beyond that.

Kevin
-----Original Message-----
From: Digimer [mailto:lists at alteeve.ca] 
Sent: Friday, January 31, 2014 5:05 PM
To: Red Hat Enterprise Linux 6 (Santiago) discussion mailing-list; rhelv5-list at redhat.com
Cc: Collins, Kevin [Contractor Acquisition Program]
Subject: Re: [rhelv6-list] Highly available OpenLDAP

On 31/01/14 06:14 PM, Collins, Kevin [Contractor Acquisition Program] wrote:
> Hi all,
>
>              I'm looking for a little input on what other folks are
> doing to solve a problem we are trying to address. The scenario is as
> follows:
>
> We were an NIS shop for many, many years. Our environment was (and still
> is) heavily dependant on NIS, and netgroups in particular, to function
> correctly.
>
> About 5 or 6 years ago we migrated from NIS to LDAP (using RFC2307 to
> provide NIS maps via LDAP). The environment at the time consisted of
> less than 200 servers (150 in primary site, the rest in a secondary
> site), mostly HP-UX with Linux playing the part of "utility" services
> (LDAP, DNS, mysql, httpd, VNC).
>
> We use LDAP only to provide the standard NIS "maps" (with a few small
> custom maps, too).
>
> We maintain our our LDAP servers with the RHEL-provided OpenLDAP, with a
> single master in our primary site in conjunction with 2 replica servers
> in our primary site and 2 replica servers in our secondary site.
> Replication was using the slurpd mechanism (we started on RHEL3).
>
> Life was good J
>
> Fast forward to current environment, and a merger with a different Unix
> team (and migrating that environment from NIS to LDAP as well). We now
> have close to 1000 servers (mix of physical and VM): roughly 400 each
> for our 2 primary sites and the rest scattered across another 3 sites.
> The mix is now much more heavily Linux (70%), which the remaining 30%
> split between HP-UX and Solaris.
>
> We have increased the number of replicas adding 2 more replicas in each
> of the new sites.
>
> We are still (mostly) using slurpd for replication, although with the
> impending migration of our LDAP master from RHEL5 to RHEL6, we must
> change to using sync-repl. No problem, as this is (IMO) a much better
> replication method and relieves the worries and headaches that occur
> when a replica for some reason becomes "broken" for some period of time.
> We have already started this migration, and our master now handles both
> slurpd (to old replicas) and sync-repl (from new replicas).
>
> In our environment, each site has is configured to point to LDAP
> services by IP address. Two IP addresses per site which are
> "load-balanced" by alternating which IP is first and second in the
> config files based on whether the last octet of the client IP address is
> even or odd. This is done as very basic way to distribute the load.
>
> Now comes the crux of the problem: what happens when an LDAP server
> becomes unavailable for some reason?
>
> If the client is HP-UX (ldapclientd), Solaris (ldap_cachemgr) or RHEL6
> (nslcd) there is not much of an issue as long as 1 LDAP replica in each
> site is functioning. The specific LDAP-daemon for each platform will
> have a small hiccup while it times out and falls over to the next LDAP
> replica... a few seconds, not a big deal.
>
> If, however, the client is RHEL4 (yes, still!) or RHEL5 then the problem
> is much bigger! On these versions, each process that needs to use LDAP
> must go thru the exact same timeout process - the systems become very
> bogged down, or even unusable depending on the server load.
>
> In one subset of our larger environment (about 40%), we run nscd which
> can help alleviate some of this issue but not all of it. We are planning
> to enable nscd on the remainder very soon - the historical reasoning for
> why those servers do not use nscd is unknown.
>
> Last year, I started investigating and testing the use of LVS (Linux
> Virtual Server) to provide a highly available (aka, clustered),
> load-balanced front-end that would direct client requests for a single
> IP address (per site) to the backend LDAP servers. Results were very
> good, and I proposed this plan to our management.
>
> DENIED!
>
> It was deemed to be "too complex to manage" by our team, and redundant
> to the BigIP F5 service offering with the company. I tend to favor
> self-management of infrastructure components which are critical to
> maintaining system functionality, but what do I know? J
>
> So, we are now looking down the route of using F5 (managed by another
> team) to front-ent our LDAP
>
> But, another option has been proposed: what if we make each linux server
> an LDAP replica that keeps itself up to date with sync-repl and have
> each server use only itself for LDAP services? The setup of this would
> be fairly straightforward, and could be easily integrated into our build
> process.
>
> Since we don't make massive volumes of changes, I feel like the network
> load for LDAP would probably drop significantly, and we don't have to
> worry about many of these other issues. I know that this solves the
> problem only for Linux, but Solaris and HP-UX already handle the problem
> case are are being phased out of our environment.
>
> Anyway, thanks for reading this novel - had not intended to write so
> much, but wanted to set the foundation for my question.
>
> What are you people doing to solve this problem? Are you using F5? Do
> you think the "every server a replica" approach makes sense?
>
> I am posting to both RHEL5 and RHEL6 lists, sorry if you see it twice.
>
> Thanks in advance for your input.
>
> Kevin

Hi Kevin,

   Full disclosure; I am recommending something I helped design, so I am 
biased. :)

   We've created a (totally open source) HA platform based on RHEL 6 for 
KVM VMs running in a two-node HA setup. The reason I am mentioning this 
is because I think it might directly address your manager's concern 
about complexity. We've build a web front-end for managing the HA side 
designed to be used by non-IT people. This said, it's a *pure* HA 
solution, no load balancing, which might make it ineligible for you.

   Here are the build instructions:

   https://alteeve.ca/w/AN!Cluster_Tutorial_2

   That is designed for people who want to build things from the ground 
up, so the WebUI isn't highlighted very strongly. For that, you can get 
a better idea of the (still being written manual) here:

   https://alteeve.ca/w/AN!CDB

   Understanding this might be coming off as spamming, I want to 
underline that it's all open code and design. The platform itself has 
been field tested for years in mission-critical (mostly manufacturing, 
some scientific/imaging) environments.

   One more note;

   The design was built around not touching the guests at all. Beyond 
the (optional) virtio block and net drivers, the guests are effectively 
oblivious to the HA underneath them. This is nothing special to our 
project, of course. This is a general benefit of KVM. We've also tested 
this with Solaris 11, FreeBSD (which I don't think you use) and several 
flavours of linux and windows. The only thing not tested are true 
Unixes, though I suspect you're not concerned about that here.

   Hope this helps offer an option for your predicament. :)

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?