[Freeipa-devel] certmonger/oddjob for DNSSEC key maintenance

Wed Sep 4 13:08:41 UTC 2013

On 09/03/2013 04:01 PM, Simo Sorce wrote:
> On Tue, 2013-09-03 at 12:36 -0400, Dmitri Pal wrote:
>> On 09/02/2013 09:42 AM, Petr Spacek wrote:
>>> On 27.8.2013 23:08, Dmitri Pal wrote:
>>>> On 08/27/2013 03:05 PM, Rob Crittenden wrote:
>>>>> Dmitri Pal wrote:
>>>>>> On 08/09/2013 08:30 AM, Petr Spacek wrote:
>>>>>>> Hello,
>>>>>>>
>>>>>>> I would like to get opinions about key maintenance for DNSSEC.
>>>>>>>
>>>>>>> Problem summary:
>>>>>>> - FreeIPA will support DNSSEC
>>>>>>> - DNSSEC deployment requires <2,n> cryptographic keys for each DNS
>>>>>>> zone (i.e. objects in LDAP)
>>>>>>> - The same keys are shared by all FreeIPA servers
>>>>>>> - Keys have limited lifetime and have to be re-generated on monthly
>>>>>>> basics (in very first approximation, it will be configurable and the
>>>>>>> interval will differ for different key types)
>>>>>>> - The plan is to store keys in LDAP and let 'something' (i.e.
>>>>>>> certmonger or oddjob?) to generate and store the new keys back into
>>>>>>> LDAP
>>>>>>> - There are command line tools for key-generation (dnssec-keygen from
>>>>>>> the package bind-utils)
>>>>>>> - We plan to select one super-master which will handle regular
>>>>>>> key-regeneration (i.e. do the same as we do for special CA
>>>>>>> certificates)
>>>>>>> - Keys stored in LDAP will be encrypted somehow, most probably by
>>>>>>> some
>>>>>>> symmetric key shared among all IPA DNS servers
>>>>>>>
>>>>>>> Could certmonger or oddjob do key maintenance for us? I can imagine
>>>>>>> something like this:
>>>>>>> - watch some attributes in LDAP and wait until some key expires
>>>>>>> - run dnssec-keygen utility
>>>>>>> - read resulting keys and encrypt them with given 'master key'
>>>>>>> - store resulting blobs in LDAP
>>>>>>> - wait until another key reaches expiration timestamp
>>>>>>>
>>>>>>> It is simplified, because there will be multiple keys with different
>>>>>>> lifetimes, but the idea is the same. All the gory details are in the
>>>>>>> thread '[Freeipa-devel] DNSSEC support design considerations: key
>>>>>>> material handling':
>>>>>>> https://www.redhat.com/archives/freeipa-devel/2013-July/msg00129.html
>>>>>>> https://www.redhat.com/archives/freeipa-devel/2013-August/msg00086.html
>>>>>>>
>>>>>>>
>>>>>>> Nalin and others, what do you think? Is certmonger or oddjob the
>>>>>>> right
>>>>>>> place to do something like this?
>>>>>>>
>>>>>>> Thank you for your time!
>>>>>>>
>>>>>> Was there any discussion of this mail?
>>>>>>
>>>>> I think at least some of this was covered in another thread, "DNSSEC
>>>>> support design considerations: key material handling" at
>>>>> https://www.redhat.com/archives/freeipa-devel/2013-August/msg00086.html
>>>>>
>>>>> rob
>>>>>
>>>>>
>>>> Yes, I have found that thread though I have not found it to come to some
>>>> conclusion and a firm plan.
>>>> I will leave to Petr to summarize outstanding issues and repost them.
>>> All questions stated in the first e-mail in this thread are still open:
>>> https://www.redhat.com/archives/freeipa-devel/2013-August/msg00089.html
>>>
>>> There was no reply to these questions during my vacation, so I don't
>>> have much to add at the moment.
>>>
>>> Nalin, please, could you provide your opinion?
>>> How modular/extendible the certmonger is?
>>> Does it make sense to add DNSSEC key-management to certmonger?
>>> What about CA rotation problem? Can we share some algorithms (e.g. for
>>> super-master election) between CA rotation and DNSSEC key rotation
>>> mechanisms?
>>>
>>>> BTW I like the idea of masters being responsible for generating a subset
>>>> of the keys as Loris suggested.
>>> E-mail from Loris in archives:
>>> https://www.redhat.com/archives/freeipa-devel/2013-August/msg00100.html
>>>
>>> The idea seems really nice and simple, but I'm afraid that there could
>>> be some serious race conditions.
>>>
>>> - How will it work when topology changes?
>>> - What if number of masters is > number of days in month? (=>
>>> Auto-tune interval from month to smaller time period => Again, what
>>> should we do after a topology change?)
>>> - What we should do if topology was changed when a master was
>>> disconnected from the rest of the network? (I.e. Link over WAN was
>>> down at the moment of change.) What will happen after re-connection to
>>> the topology?
>>>
>>> Example:
>>> Time 0: Masters A, B; topology:  A---B
>>> Time 1: Master A have lost connection to master B
>>> Time 2: Master C was added; topology:  A § B---C
>>> Time 3 (Day 3): A + C did rotation at the same time
>>> Time 4: Connection was restored;  topology: A---B---C
>>>
>>> Now what?
>>>
>>>
>>> I have a feeling that we need something like quorum protocol for
>>> writes (only for sensitive operations like CA cert and DNSSEC key
>>> rotations).
>>>
>>> http://en.wikipedia.org/wiki/Quorum_(distributed_computing)
>>>
>>>
>>> The other question is how should we handle catastrophic situations
>>> where more than half of masters were lost? (Two of three data centres
>>> were blown by a tornado etc.)
>>>
>> It becomes more and more obvious that there is no simple solution that
>> we can use out of box.
>> Let us start with a single nominated server. If the server is lost the
>> key rotation responsibility can be moved to some other server manually.
>> Not optimal but at least the first step.
>>
>> The next step would be to be able to define alternative (failover)
>> servers. Here is an example.
>> Let us say we have masters A, B, C. In topology A - B - C.
>> Master A is responsible for the key rotation B is the fail-over.
>> The key rotation time would be in some way recorded in the replication
>> agreement(s) between A & B.
>> If at the moment of the scheduled rotation A <-> B connection is not
>> present A would skip rotation and B would start rotation. If A comes
>> back and connects to B (or connection is just restored) the replication
>> will update the keys on A. If A is lost the keys are taken care of by B
>> for itself and C.
>> There will be a short window of race condition but IMO it can be
>> mitigated. If A clock is behind B then if A managed to connect to B it
>> would notice that B already started rotation. If B clock is behind and A
>> connects to B before B started rotation A has to perform rotation still
>> (sort of just made it case).
>>
>> Later if we want more complexity we can define subsets of the keys to
>> renew and assign them to different replicas and then define failover
>> servers per set.
>> But this is all complexity we can add later when we see the real
>> problems with the single server approach.
> Actually I thought about this for a while, and I think I have an idea
> about how to handle this for DNSSEC, (may not apply to other cases like
> CA).
>
> IIRC keys are generate well in advance from the time they are used and
> old keys and new keys are used side by side for a while, until old keys
> are finally expired and only new keys are around.
>
> This iso regulated by a series of date attributes that determine when
> keys are in used when they expire and so on.
>
> Now the idea I have is to add yet another step.
>
> Assume we have key "generation 1" (G1) in use and we approach the time
> generation 1 will expire and generation 2 (G2) is needed, and G2 is
> created X months in advance and all stuff is signed with both G1 and G2
> for a period.
>
> Now if we have a pre-G2 period we can have a period of time when we can
> let multiple servers try to generate the G2 series, say 1 month in
> advance of the time they would normally be used to start signing
> anything. Then only after that 1 month they are actually put into
> services.
>
> How does this helps? Well it helps in that even if multiple servers
> generate keys and we have duplicates they have all the time to see that
> there are duplicates (because 2 server raced).
> now if e can keep a subsecond 'creation' timestamp for the new keys when
> replication goes around all servers can check and use only the set of
> keys that have been create first, and the servers that created the set
> of keys that lose the race will just remove the duplicates.
> given we have 1 month of time between the creation and the actual time
> keys will be used we have all the time to let servers sort out whether
> there are keys available or not and prune out duplicates.
>
> A diagram in case I have not been clear enough
>
>
> Assume servers A, B, C they all randomize (within a week) the time at
> which they will attempt to create new keys if it is time to and none are
> available already.
>
> Say the time come to create G2, A, B ,C each throw a dice and it turns
> out A will do it in 35000 seconds, B will do it in 40000 seconds, and C
> in 32000 seconds, so C should do it first and there should be enough
> time for the others to see that new keys popped up and just discard
> their attempts.
>
> However is A or C are temporarily disconnected they may still end up
> generating new keys, so we have G2-A and G2-B, once they get reconnected
> and replication flows again all servers see that instead of a single G2
> set we have 2 G2 sets available
> G2-A created at timestamp X+35000 and G2-B created at timestamp X+32000,
> so all servers know they should ignore G2-A, and they all ignore it.
> When A comes around to realize this itself it will just go and delete
> the G2-A set. Only G2-B set is left and that is what will be the final
> official G2.
>
> If we give a week of time for this operation to go on I think it will be
> easy to resolve any race or temporary diconnection that may happen.
> Also because all server can attempt (within that week) to create keys
> there is no real single point of failure.
>
> HTH,
> please poke holes in my reasoning :)
>
> Simo.
>

Reasonable just have couple comments.
If there are many keys and many replicas the chance would be that there
will be a lot of load. Generating keys is costly computation wise.
Replication is costly too.
Also you assume that topology works fine. I am mostly concerned about
the case when some replication is not working and data from one part of
the topology is not replicated to another. The concern is that people
would not notice that things are not replicating. So if there is a
problem and we let all these key to be generated all over the place it
would be pretty hard to untie this knot later.

I would actually suggest that if a replica X needs the keys in a month
from moment A and the keys have not arrived in 3 first days after moment
A and this replica is not entitled to generate keys it should start
sending messages to admin. That way there will be enough time for admin
to sort out what is wrong and nominate another replica to generate the
keys if needed. There should be command as simple as:

ipa dnssec-keymanager-set <replica>

that would make the mentioned replica the key generator.
There can be other commands like

ipa dnssec-keymanager-info

Appointed server: <server>
Keys store: <path>
Last time keys generated: <some time>
Next time keys need to be generated: <...>
...

IMO in this case we need to help admin to see that there is a problem
and provide tools to easily mitigate it rather than try to solve it
ourselves and build a complex algorythm.

-- 
Thank you,
Dmitri Pal

Sr. Engineering Manager for IdM portfolio
Red Hat Inc.

-------------------------------
Looking to carve out IT costs?
www.redhat.com/carveoutcosts/