[Freeipa-devel] [PATCH] 1079 address CA subsystem renewal issues

Tue Jan 29 17:54:37 UTC 2013

Petr Viktorin wrote:
> On 01/15/2013 05:15 PM, Rob Crittenden wrote:
>> Petr Viktorin wrote:
>>> On 01/15/2013 03:41 PM, Petr Viktorin wrote:
>>>> On 01/14/2013 10:56 PM, Rob Crittenden wrote:
>>>>> Petr Viktorin wrote:
>>>>>> On 01/12/2013 12:49 AM, Rob Crittenden wrote:
>>>>>>> Rob Crittenden wrote:
>>>>>>>> Petr Viktorin wrote:
>>>>>>>>> On 01/07/2013 05:42 PM, Rob Crittenden wrote:
>>>>>>>>>> Petr Viktorin wrote:
>>>>>>>>>>> On 01/07/2013 03:09 PM, Rob Crittenden wrote:
>>>>>>>>>>>> Petr Viktorin wrote:
>>>>>>>>> [...]
>>>>>>>>>>>>>
>>>>>>>>>>>>> Works for me, but I have some questions (this is an area I
>>>>>>>>>>>>> know
>>>>>>>>>>>>> little
>>>>>>>>>>>>> about).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Can we be 100% sure these certs are always renewed
>>>>>>>>>>>>> together? Is
>>>>>>>>>>>>> certmonger the only possible mechanism to update them?
>>>>>>>>>>>>
>>>>>>>>>>>> You raise a good point. If though some mechanism someone
>>>>>>>>>>>> replaces
>>>>>>>>>>>> one of
>>>>>>>>>>>> these certs it will cause the script to fail. Some
>>>>>>>>>>>> notification of
>>>>>>>>>>>> this
>>>>>>>>>>>> failure will be logged though, and of course, the certs
>>>>>>>>>>>> won't be
>>>>>>>>>>>> renewed.
>>>>>>>>>>>>
>>>>>>>>>>>> One could conceivably manually renew one of these certificates.
>>>>>>>>>>>> It is
>>>>>>>>>>>> probably a very remote possibility but it is non-zero.
>>>>>>>>>>>>
>>>>>>>>>>>>> Can we be sure certmonger always does the updates in parallel?
>>>>>>>>>>>>> If it
>>>>>>>>>>>>> managed to update the audit cert before starting on the
>>>>>>>>>>>>> others,
>>>>>>>>>>>>> we'd
>>>>>>>>>>>>> get
>>>>>>>>>>>>> no CA restart for the others.
>>>>>>>>>>>>
>>>>>>>>>>>> These all get issued at the same time so should expire at the
>>>>>>>>>>>> same
>>>>>>>>>>>> time
>>>>>>>>>>>> as well (see problem above). The script will hang around for 10
>>>>>>>>>>>> minutes
>>>>>>>>>>>> waiting for the renewal to complete, then give up.
>>>>>>>>>>>
>>>>>>>>>>> The certs might take different amounts of time to update, right?
>>>>>>>>>>> Eventually, the expirations could go out of sync enough for
>>>>>>>>>>> it to
>>>>>>>>>>> matter.
>>>>>>>>>>> AFAICS, without proper locking we still get a race condition
>>>>>>>>>>> when
>>>>>>>>>>> the
>>>>>>>>>>> other certs start being renewed some time (much less than 10
>>>>>>>>>>> min)
>>>>>>>>>>> after
>>>>>>>>>>> the audit one:
>>>>>>>>>>>
>>>>>>>>>>> (time axis goes down)
>>>>>>>>>>>
>>>>>>>>>>>          audit cert                  other cert
>>>>>>>>>>>          ----------                  ----------
>>>>>>>>>>>      certmonger does renew                .
>>>>>>>>>>>    post-renew script starts               .
>>>>>>>>>>>   check state of other certs: OK          .
>>>>>>>>>>>              .                   certmonger starts renew
>>>>>>>>>>>   certutil modifies NSS DB  +  certmonger modifies NSS DB  ==
>>>>>>>>>>> boom!
>>>>>>>>>>
>>>>>>>>>> This can't happen because we count the # of expected certs and
>>>>>>>>>> wait
>>>>>>>>>> until all are in MONITORING before continuing.
>>>>>>>>>
>>>>>>>>> The problem is that they're also in MONITORING before the whole
>>>>>>>>> renewal
>>>>>>>>> starts. If the script happens to check just before the state
>>>>>>>>> changes
>>>>>>>>> from MONITORING to GENERATING_CSR or whatever, we can get
>>>>>>>>> corruption.
>>>>>>>>>
>>>>>>>>>> The worse that would
>>>>>>>>>> happen is the trust wouldn't be set on the audit cert and dogtag
>>>>>>>>>> wouldn't be restarted.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> The state the system would be in is this:
>>>>>>>>>>>>
>>>>>>>>>>>> - audit cert trust not updated, so next restart of CA will fail
>>>>>>>>>>>> - CA is not restarted so will not use updated certificates
>>>>>>>>>>>>
>>>>>>>>>>>>> And anyway, why does certmonger do renewals in parallel? It
>>>>>>>>>>>>> seems
>>>>>>>>>>>>> that
>>>>>>>>>>>>> if it did one at a time, always waiting until the post-renew
>>>>>>>>>>>>> script is
>>>>>>>>>>>>> done, this patch wouldn't be necessary.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  From what Nalin told me certmonger has some coarse locking
>>>>>>>>>>>> such
>>>>>>>>>>>> that
>>>>>>>>>>>> renewals in a the same NSS database are serialized. As you
>>>>>>>>>>>> point
>>>>>>>>>>>> out, it
>>>>>>>>>>>> would be nice to extend this locking to the post renewal
>>>>>>>>>>>> scripts. We
>>>>>>>>>>>> can
>>>>>>>>>>>> ask Nalin about it. That would fix the potential corruption
>>>>>>>>>>>> issue.
>>>>>>>>>>>> It is
>>>>>>>>>>>> still much nicer to not have to restart dogtag 4 times.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Well, three extra restarts every few years seems like a small
>>>>>>>>>>> price to
>>>>>>>>>>> pay for robustness.
>>>>>>>>>>
>>>>>>>>>> It is a bit of a problem though because the certs all renew
>>>>>>>>>> within
>>>>>>>>>> seconds so end up fighting over who is restarting dogtag. This
>>>>>>>>>> can
>>>>>>>>>> cause
>>>>>>>>>> some renewals go into a failure state to be retried later.
>>>>>>>>>> This is
>>>>>>>>>> fine
>>>>>>>>>> functionally but makes QE a bit of a pain. You then have to make
>>>>>>>>>> sure
>>>>>>>>>> that renewal is basically done, then restart certmonger and check
>>>>>>>>>> everything again, over and over until all the certs are renewed.
>>>>>>>>>> This is
>>>>>>>>>> difficult to automate.
>>>>>>>>>
>>>>>>>>> So we need to extend the certmonger lock, and wait until Dogtag is
>>>>>>>>> back
>>>>>>>>> up before exiting the script. That way it'd still take longer
>>>>>>>>> than 1
>>>>>>>>> restart, but all the renews should succeed.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Right, but older dogtag versions don't have the handy servlet to
>>>>>>>> tell
>>>>>>>> that the service is actually up and responding. So it is
>>>>>>>> difficult to
>>>>>>>> tell from tomcat alone whether the CA is actually up and handling
>>>>>>>> requests.
>>>>>>>>
>>>>>>>
>>>>>>> Revised patch that takes advantage of new version of certmonger.
>>>>>>> certmonger-0.65 adds locking from the time renewal begins to the
>>>>>>> end of
>>>>>>> the post_save_command. This lets us be sure that no other certmonger
>>>>>>> renewals will have the NSS database open in read-write mode.
>>>>>>>
>>>>>>> We need to be sure that tomcat is shut down before we let certmonger
>>>>>>> save the certificate to the NSS database because dogtag opens its
>>>>>>> database read/write and two writers can cause corruption.
>>>>>>>
>>>>>>> rob
>>>>>>>
>>>>>>
>>>>>> stop_pkicad and start_pkicad need the Dogtag version check to select
>>>>>> pki_cad/pki_tomcatd.
>>>>>
>>>>> Fixed.
>>>>>
>>>>>>
>>>>>> A more serious issue is that stop_pkicad needs to be installed on
>>>>>> upgrades. Currently the whole enable_certificate_renewal step in
>>>>>> ipa-upgradeconfig is skipped if it was done before.
>>>>>
>>>>> I added a separate upgrade test for this. It currently won't work in
>>>>> SELinux enforcing mode because certmonger isn't allowed to talk to
>>>>> dbus
>>>>> in an rpm post script. It's being looked at.
>>>>>
>>>>>> In stop_pkicad can you change the first log message to "certmonger
>>>>>> stopping %sd"? It's before the action so we don't want past tense.
>>>>>
>>>>> Fixed.
>>>>>
>>>>> rob
>>>>
>>>> I get a bunch of errors when installing the RPM:
>>>>
>>> [...]
>>>>
>>>
>>> This is the SELinux issue you were talking about. Sorry for not catching
>>> that.
>>>
>>> With enforcing off, the patch looks & works well for me. I'm just
>>> concerned about this change in ipa-upgradeconfig:
>>>
>>> @@ -707,7 +754,7 @@ def main():
>>>           # configuration has changed, restart the name server
>>>           root_logger.info('Changes to named.conf have been made,
>>> restart named')
>>>           bindinstance.BindInstance(fstore).restart()
>>> -    ca_restart = ca_restart or enable_certificate_renewal(ca) or
>>> upgrade_ipa_profile(ca, api.env.domain, fqdn)
>>> +    ca_restart = ca_restart or enable_certificate_renewal(ca) or
>>> upgrade_ipa_profile(ca, api.env.domain, fqdn) or
>>> certificate_renewal_stop_ca(ca)
>>>
>>> If the enable_certificate_renewal step was done already, but
>>> upgrade_ipa_profile requests a CA restart, then the short-circuiting
>>> `or` will be satisfied and certificate_renewal_stop_ca won't be run.
>>>
>>> Since each upgrade step has its own checking, I think it would be safer
>>> to use something like:
>>>      ca_restart = certificate_renewal_stop_ca(ca) or ca_restart
>>>
>>> or even:
>>> ca_restart = any([
>>>      ca_restart,
>>>      enable_certificate_renewal(ca),
>>>      upgrade_ipa_profile(ca, api.env.domain, fqdn),
>>>      certificate_renewal_stop_ca(ca),
>>> ])
>>>
>>
>> I like this suggestion very much. Updated patch attached.
>>
>> rob
>>
>
> ACK, just remove the trailing space in the `]) ` line.
>
>
> We'll need to make sure the SELinux issue isn't forgotten.
>

I was waiting for a fixed selinux-policy package to get pushed and it 
finally has.

Pushed patch to master, ipa-3-1 and ipa-3-0

rob