[Freeipa-devel] certmonger/oddjob for DNSSEC key maintenance

Wed Sep 4 16:53:03 UTC 2013

On 09/04/2013 10:17 AM, Petr Spacek wrote:
> On 4.9.2013 15:50, Alexander Bokovoy wrote:
>> On Wed, 04 Sep 2013, Dmitri Pal wrote:
>>> On 09/04/2013 09:08 AM, Dmitri Pal wrote:
>>>> On 09/03/2013 04:01 PM, Simo Sorce wrote:
>>>>> On Tue, 2013-09-03 at 12:36 -0400, Dmitri Pal wrote:
>>>>>> On 09/02/2013 09:42 AM, Petr Spacek wrote:
>>>>>>> On 27.8.2013 23:08, Dmitri Pal wrote:
>>>>>>>> On 08/27/2013 03:05 PM, Rob Crittenden wrote:
>>>>>>>>> Dmitri Pal wrote:
>>>>>>>>>> On 08/09/2013 08:30 AM, Petr Spacek wrote:
>>>>>>>>>>> Hello,
>>>>>>>>>>>
>>>>>>>>>>> I would like to get opinions about key maintenance for DNSSEC.
>>>>>>>>>>>
>>>>>>>>>>> Problem summary:
>>>>>>>>>>> - FreeIPA will support DNSSEC
>>>>>>>>>>> - DNSSEC deployment requires <2,n> cryptographic keys for
>>>>>>>>>>> each DNS
>>>>>>>>>>> zone (i.e. objects in LDAP)
>>>>>>>>>>> - The same keys are shared by all FreeIPA servers
>>>>>>>>>>> - Keys have limited lifetime and have to be re-generated on
>>>>>>>>>>> monthly
>>>>>>>>>>> basics (in very first approximation, it will be configurable
>>>>>>>>>>> and the
>>>>>>>>>>> interval will differ for different key types)
>>>>>>>>>>> - The plan is to store keys in LDAP and let 'something' (i.e.
>>>>>>>>>>> certmonger or oddjob?) to generate and store the new keys
>>>>>>>>>>> back into
>>>>>>>>>>> LDAP
>>>>>>>>>>> - There are command line tools for key-generation
>>>>>>>>>>> (dnssec-keygen from
>>>>>>>>>>> the package bind-utils)
>>>>>>>>>>> - We plan to select one super-master which will handle regular
>>>>>>>>>>> key-regeneration (i.e. do the same as we do for special CA
>>>>>>>>>>> certificates)
>>>>>>>>>>> - Keys stored in LDAP will be encrypted somehow, most
>>>>>>>>>>> probably by
>>>>>>>>>>> some
>>>>>>>>>>> symmetric key shared among all IPA DNS servers
>>>>>>>>>>>
>>>>>>>>>>> Could certmonger or oddjob do key maintenance for us? I can
>>>>>>>>>>> imagine
>>>>>>>>>>> something like this:
>>>>>>>>>>> - watch some attributes in LDAP and wait until some key expires
>>>>>>>>>>> - run dnssec-keygen utility
>>>>>>>>>>> - read resulting keys and encrypt them with given 'master key'
>>>>>>>>>>> - store resulting blobs in LDAP
>>>>>>>>>>> - wait until another key reaches expiration timestamp
>>>>>>>>>>>
>>>>>>>>>>> It is simplified, because there will be multiple keys with
>>>>>>>>>>> different
>>>>>>>>>>> lifetimes, but the idea is the same. All the gory details
>>>>>>>>>>> are in the
>>>>>>>>>>> thread '[Freeipa-devel] DNSSEC support design
>>>>>>>>>>> considerations: key
>>>>>>>>>>> material handling':
>>>>>>>>>>> https://www.redhat.com/archives/freeipa-devel/2013-July/msg00129.html
>>>>>>>>>>>
>>>>>>>>>>> https://www.redhat.com/archives/freeipa-devel/2013-August/msg00086.html
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Nalin and others, what do you think? Is certmonger or oddjob
>>>>>>>>>>> the
>>>>>>>>>>> right
>>>>>>>>>>> place to do something like this?
>>>>>>>>>>>
>>>>>>>>>>> Thank you for your time!
>>>>>>>>>>>
>>>>>>>>>> Was there any discussion of this mail?
>>>>>>>>>>
>>>>>>>>> I think at least some of this was covered in another thread,
>>>>>>>>> "DNSSEC
>>>>>>>>> support design considerations: key material handling" at
>>>>>>>>> https://www.redhat.com/archives/freeipa-devel/2013-August/msg00086.html
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> rob
>>>>>>>>>
>>>>>>>>>
>>>>>>>> Yes, I have found that thread though I have not found it to
>>>>>>>> come to some
>>>>>>>> conclusion and a firm plan.
>>>>>>>> I will leave to Petr to summarize outstanding issues and repost
>>>>>>>> them.
>>>>>>> All questions stated in the first e-mail in this thread are
>>>>>>> still open:
>>>>>>> https://www.redhat.com/archives/freeipa-devel/2013-August/msg00089.html
>>>>>>>
>>>>>>>
>>>>>>> There was no reply to these questions during my vacation, so I
>>>>>>> don't
>>>>>>> have much to add at the moment.
>>>>>>>
>>>>>>> Nalin, please, could you provide your opinion?
>>>>>>> How modular/extendible the certmonger is?
>>>>>>> Does it make sense to add DNSSEC key-management to certmonger?
>>>>>>> What about CA rotation problem? Can we share some algorithms
>>>>>>> (e.g. for
>>>>>>> super-master election) between CA rotation and DNSSEC key rotation
>>>>>>> mechanisms?
>>>>>>>
>>>>>>>> BTW I like the idea of masters being responsible for generating
>>>>>>>> a subset
>>>>>>>> of the keys as Loris suggested.
>>>>>>> E-mail from Loris in archives:
>>>>>>> https://www.redhat.com/archives/freeipa-devel/2013-August/msg00100.html
>>>>>>>
>>>>>>>
>>>>>>> The idea seems really nice and simple, but I'm afraid that there
>>>>>>> could
>>>>>>> be some serious race conditions.
>>>>>>>
>>>>>>> - How will it work when topology changes?
>>>>>>> - What if number of masters is > number of days in month? (=>
>>>>>>> Auto-tune interval from month to smaller time period => Again, what
>>>>>>> should we do after a topology change?)
>>>>>>> - What we should do if topology was changed when a master was
>>>>>>> disconnected from the rest of the network? (I.e. Link over WAN was
>>>>>>> down at the moment of change.) What will happen after
>>>>>>> re-connection to
>>>>>>> the topology?
>>>>>>>
>>>>>>> Example:
>>>>>>> Time 0: Masters A, B; topology:  A---B
>>>>>>> Time 1: Master A have lost connection to master B
>>>>>>> Time 2: Master C was added; topology:  A § B---C
>>>>>>> Time 3 (Day 3): A + C did rotation at the same time
>>>>>>> Time 4: Connection was restored;  topology: A---B---C
>>>>>>>
>>>>>>> Now what?
>>>>>>>
>>>>>>>
>>>>>>> I have a feeling that we need something like quorum protocol for
>>>>>>> writes (only for sensitive operations like CA cert and DNSSEC key
>>>>>>> rotations).
>>>>>>>
>>>>>>> http://en.wikipedia.org/wiki/Quorum_(distributed_computing)
>>>>>>>
>>>>>>>
>>>>>>> The other question is how should we handle catastrophic situations
>>>>>>> where more than half of masters were lost? (Two of three data
>>>>>>> centres
>>>>>>> were blown by a tornado etc.)
>>>>>>>
>>>>>> It becomes more and more obvious that there is no simple solution
>>>>>> that
>>>>>> we can use out of box.
>>>>>> Let us start with a single nominated server. If the server is
>>>>>> lost the
>>>>>> key rotation responsibility can be moved to some other server
>>>>>> manually.
>>>>>> Not optimal but at least the first step.
>>>>>>
>>>>>> The next step would be to be able to define alternative (failover)
>>>>>> servers. Here is an example.
>>>>>> Let us say we have masters A, B, C. In topology A - B - C.
>>>>>> Master A is responsible for the key rotation B is the fail-over.
>>>>>> The key rotation time would be in some way recorded in the
>>>>>> replication
>>>>>> agreement(s) between A & B.
>>>>>> If at the moment of the scheduled rotation A <-> B connection is not
>>>>>> present A would skip rotation and B would start rotation. If A comes
>>>>>> back and connects to B (or connection is just restored) the
>>>>>> replication
>>>>>> will update the keys on A. If A is lost the keys are taken care
>>>>>> of by B
>>>>>> for itself and C.
>>>>>> There will be a short window of race condition but IMO it can be
>>>>>> mitigated. If A clock is behind B then if A managed to connect to
>>>>>> B it
>>>>>> would notice that B already started rotation. If B clock is
>>>>>> behind and A
>>>>>> connects to B before B started rotation A has to perform rotation
>>>>>> still
>>>>>> (sort of just made it case).
>>>>>>
>>>>>> Later if we want more complexity we can define subsets of the
>>>>>> keys to
>>>>>> renew and assign them to different replicas and then define failover
>>>>>> servers per set.
>>>>>> But this is all complexity we can add later when we see the real
>>>>>> problems with the single server approach.
>>>>> Actually I thought about this for a while, and I think I have an idea
>>>>> about how to handle this for DNSSEC, (may not apply to other cases
>>>>> like
>>>>> CA).
>>>>>
>>>>> IIRC keys are generate well in advance from the time they are used
>>>>> and
>>>>> old keys and new keys are used side by side for a while, until old
>>>>> keys
>>>>> are finally expired and only new keys are around.
>>>>>
>>>>> This iso regulated by a series of date attributes that determine when
>>>>> keys are in used when they expire and so on.
>>>>>
>>>>> Now the idea I have is to add yet another step.
>>>>>
>>>>> Assume we have key "generation 1" (G1) in use and we approach the
>>>>> time
>>>>> generation 1 will expire and generation 2 (G2) is needed, and G2 is
>>>>> created X months in advance and all stuff is signed with both G1
>>>>> and G2
>>>>> for a period.
>>>>>
>>>>> Now if we have a pre-G2 period we can have a period of time when
>>>>> we can
>>>>> let multiple servers try to generate the G2 series, say 1 month in
>>>>> advance of the time they would normally be used to start signing
>>>>> anything. Then only after that 1 month they are actually put into
>>>>> services.
>>>>>
>>>>> How does this helps? Well it helps in that even if multiple servers
>>>>> generate keys and we have duplicates they have all the time to see
>>>>> that
>>>>> there are duplicates (because 2 server raced).
>>>>> now if e can keep a subsecond 'creation' timestamp for the new
>>>>> keys when
>>>>> replication goes around all servers can check and use only the set of
>>>>> keys that have been create first, and the servers that created the
>>>>> set
>>>>> of keys that lose the race will just remove the duplicates.
>>>>> given we have 1 month of time between the creation and the actual
>>>>> time
>>>>> keys will be used we have all the time to let servers sort out
>>>>> whether
>>>>> there are keys available or not and prune out duplicates.
>>>>>
>>>>> A diagram in case I have not been clear enough
>>>>>
>>>>>
>>>>> Assume servers A, B, C they all randomize (within a week) the time at
>>>>> which they will attempt to create new keys if it is time to and
>>>>> none are
>>>>> available already.
>>>>>
>>>>> Say the time come to create G2, A, B ,C each throw a dice and it
>>>>> turns
>>>>> out A will do it in 35000 seconds, B will do it in 40000 seconds,
>>>>> and C
>>>>> in 32000 seconds, so C should do it first and there should be enough
>>>>> time for the others to see that new keys popped up and just discard
>>>>> their attempts.
>>>>>
>>>>> However is A or C are temporarily disconnected they may still end up
>>>>> generating new keys, so we have G2-A and G2-B, once they get
>>>>> reconnected
>>>>> and replication flows again all servers see that instead of a
>>>>> single G2
>>>>> set we have 2 G2 sets available
>>>>> G2-A created at timestamp X+35000 and G2-B created at timestamp
>>>>> X+32000,
>>>>> so all servers know they should ignore G2-A, and they all ignore it.
>>>>> When A comes around to realize this itself it will just go and delete
>>>>> the G2-A set. Only G2-B set is left and that is what will be the
>>>>> final
>>>>> official G2.
>>>>>
>>>>> If we give a week of time for this operation to go on I think it
>>>>> will be
>>>>> easy to resolve any race or temporary diconnection that may happen.
>>>>> Also because all server can attempt (within that week) to create keys
>>>>> there is no real single point of failure.
>>>>>
>>>>> HTH,
>>>>> please poke holes in my reasoning :)
>>>>>
>>>>> Simo.
>>>>>
>>>> Reasonable just have couple comments.
>>>> If there are many keys and many replicas the chance would be that
>>>> there
>>>> will be a lot of load. Generating keys is costly computation wise.
>>>> Replication is costly too.
>>>> Also you assume that topology works fine. I am mostly concerned about
>>>> the case when some replication is not working and data from one
>>>> part of
>>>> the topology is not replicated to another. The concern is that people
>>>> would not notice that things are not replicating. So if there is a
>>>> problem and we let all these key to be generated all over the place it
>>>> would be pretty hard to untie this knot later.
>>>>
>>>> I would actually suggest that if a replica X needs the keys in a month
>>>> from moment A and the keys have not arrived in 3 first days after
>>>> moment
>>>> A and this replica is not entitled to generate keys it should start
>>>> sending messages to admin. That way there will be enough time for
>>>> admin
>>>> to sort out what is wrong and nominate another replica to generate the
>>>> keys if needed. There should be command as simple as:
>>>>
>>>> ipa dnssec-keymanager-set <replica>
>>>>
>>>> that would make the mentioned replica the key generator.
>>>> There can be other commands like
>>>>
>>>> ipa dnssec-keymanager-info
>>>>
>>>> Appointed server: <server>
>>>> Keys store: <path>
>>>> Last time keys generated: <some time>
>>>> Next time keys need to be generated: <...>
>>>> ...
>>>>
>>>>
>>>>
>>>>
>>>> IMO in this case we need to help admin to see that there is a problem
>>>> and provide tools to easily mitigate it rather than try to solve it
>>>> ourselves and build a complex algorythm.
>>>>
>>> Thinking even more about this.
>>> May be we should start with the command that would be something like:
>>>
>>> ipa health
>>>
>>> This command would detect the topology, try to connect to all replicas
>>> check that they are all up and running, replicating, nothing is stuck
>>> and report any issues.
>>> The output of the command can be sent somewhere or as a mail to admin.
>>>
>>> Then it can be run periodically as a part of cron on couple servers and
>>> if there is any problem admin would know quite soon.
>>> Then admin would know things like:
>>> 1) The CRL generating server is down/unreachable
>>> 2) The DNSSEC key generating server is down/unreachable
>>> 3) Some CAs are unreachable
>>> 4) The server that rotates certificates is down/unreachable
>>> 5) The server that does AD sync is down/unreachable
>>>
>>> There might be other things.
>>> IMO we have enough sinlge point of failure services already. Adding
>>> DNSSEC key generation to that set is not a big deal but the utility
>>> like
>>> this would really go a long way making IPA more usable, manageable and
>>> useful.
>>>
>>> Should I file an RFE?
>> The tool you describe above should be able to perform operations on
>> the master.
>> it is in general better not to put master-specific operations into a
>> client tool that could be run from an arbitrary host where ipa admin
>> tools are installed.
>>
>> What about plugging the functionality into ipa-advise?
>>
>>    ipa-advise health-check-{cert|replication|dnssec|...}
>
> I agree with health check idea and also with the modular approach
> proposed by Alexander.

I assume you mean "a" master rather than "the" master :-)
Running this command on any master would be fine.

If it makes sense as an ipa-advise option I am fine with it too.

If others agree then let us open a ticket to add this functionality to
ipa-advise

>
> Side note: I think that the tool should have an option to enable
> machine parse-able output, because it would allow to third parties to
> connect it to monitoring systems like Zabbix etc.

Yes.
Agree.

>
> Today I spent some time with analysis of Simo's proposal and I wasn't
> able to find hole up to now. It seems as good idea and added code
> complexity should be relatively small. For that reason I vote for
> implementing it before we declare DNSSEC 'stable'.

Should we treat this functionality independent from the tool?
I am concerned with volume of the load and replication. I think it
should be an option - single master generates keys or you can enable
others to generate the keys and if they are enabled to generate the keys
they would follow the algorithm proposed by Simo.

>
> Don't forget that whole infrastructure will break if DNSSEC keys are
> not updated in time and that the rotation happens several times each
> month.
>
True but it is better if it is clear why it breaks and easy to fix and
does not require complex procedure to get back online.

-- 
Thank you,
Dmitri Pal

Sr. Engineering Manager for IdM portfolio
Red Hat Inc.

-------------------------------
Looking to carve out IT costs?
www.redhat.com/carveoutcosts/