[rhelv6-list] IPsec Endpoint Failover/Redundancy

Mon Nov 19 16:07:39 UTC 2012

We're connecting a handful of remote leaf nodes back to a central network
via IPsec, but we seem to have run into a problem with redundancy and
failover.  I'm wondering if I can get an idea of what other people are doing
for IPsec and failover, and maybe figure out a way to stabilize our tunnels.

It's been a while since I've played with IPsec, so my grasp of most of these
concepts is fairly weak.  I find myself getting lost in the various bits of
documentation, and it seems that available docs on OpenS/WAN and NETKEY
(the implementation used in RHEL6?) are minimal.  While there are lots of
configuration examples, there's not much covering terminology, and many of
the options for the 'ipsec' binary don't seem to be available for NETKEY.

Configuration:

We've kept it as simple as possible.  The remote nodes need to communicate
with two local nodes, so we have two host-to-host tunnels configured.  RSA
is used for authentication, and to work around various firewalls in the
communications path, we've had to enable NAT-T (even though no NAT is being
used).  Main mode is used everywhere, as well as DPD with an action of
'restart'.

The problems:

1) The local 'endpoint' is actually two machines that handle failover via
ucarp.  Unfortunately, pluto doesn't seem to like having its interfaces
change underneath it, meaning pluto needs to be stopped/started in vip-down
and vip-up, respectively.  Alternatively, don't terminate on a VIP at all,
and figure out another way to route it on the remote endpoints.

2) Occasionally, we've seen the VIP be removed from the interface when
running 'service ipsec restart'.  It seems to be fine with a manual
stop/start.  ucarp itself doesn't log anything, and we're unclear as to
why/how this happens (no logs).

3) There are times when the local endpoint gets into a situation where
'service ipsec status' reports >5x the number of configured tunnels.
Looking at 'ipsec whack --status', it looks like there are multiple SAs
(both IPsec and ISAKMP) established per remote endpoint  This appears to be
fine, as there are times when the tunnels continue to operate properly, but
we've seen situations where the endpoint is unable/unwilling to use any of
the SAs, and does not respond to DPD R_U_THERE packets.

4) We've seen regular occurrences where the remote endpoint believes the
tunnel is up, and the local endpoint believes the tunnel is down, and
they're unable to come to a mutually-beneficial agreement.  This was when we
ran with 'dpdaction=clear' on the local endpoint, and have since moved to
'restart' to fix this.  Unfortunately, it seems that at times the local
endpoint holds on to the route for the remote endpoint even when the SA is
cleared, possibly resulting in problem #3 above: DPD fails from the local
endpoint, causing it to renegotiate the SA, but the route remains broken,
causing the new SA to fail DPD, etc.  NB: this is just a theory -- I don't
know how to IPsec.

I'm about to work towards managing ipsec from ucarp, where the service is
stopped and started as is appropriate via vip-up/vip-down.  This is largely
to address problem #1, which seems otherwise insurmountable (possibly routing
the VIP over one tunnel), but may end up resolving some of our other issues.

But before I do this, I figured I'd ask:

How are other people handling ipsec endpoint failovers?  Is there a 'best'
way to access a floating virtual IP, where 'overlapip' doesn't seem to be
available (KLIPS/mast only), and 'modecfg*' doesn't seem to offer an option
for routing?