[Freeipa-devel] [PATCH] 1079 address CA subsystem renewal issues

Tue Jan 15 15:53:23 UTC 2013

On 01/15/2013 03:41 PM, Petr Viktorin wrote:
> On 01/14/2013 10:56 PM, Rob Crittenden wrote:
>> Petr Viktorin wrote:
>>> On 01/12/2013 12:49 AM, Rob Crittenden wrote:
>>>> Rob Crittenden wrote:
>>>>> Petr Viktorin wrote:
>>>>>> On 01/07/2013 05:42 PM, Rob Crittenden wrote:
>>>>>>> Petr Viktorin wrote:
>>>>>>>> On 01/07/2013 03:09 PM, Rob Crittenden wrote:
>>>>>>>>> Petr Viktorin wrote:
>>>>>> [...]
>>>>>>>>>>
>>>>>>>>>> Works for me, but I have some questions (this is an area I know
>>>>>>>>>> little
>>>>>>>>>> about).
>>>>>>>>>>
>>>>>>>>>> Can we be 100% sure these certs are always renewed together? Is
>>>>>>>>>> certmonger the only possible mechanism to update them?
>>>>>>>>>
>>>>>>>>> You raise a good point. If though some mechanism someone replaces
>>>>>>>>> one of
>>>>>>>>> these certs it will cause the script to fail. Some notification of
>>>>>>>>> this
>>>>>>>>> failure will be logged though, and of course, the certs won't be
>>>>>>>>> renewed.
>>>>>>>>>
>>>>>>>>> One could conceivably manually renew one of these certificates.
>>>>>>>>> It is
>>>>>>>>> probably a very remote possibility but it is non-zero.
>>>>>>>>>
>>>>>>>>>> Can we be sure certmonger always does the updates in parallel?
>>>>>>>>>> If it
>>>>>>>>>> managed to update the audit cert before starting on the others,
>>>>>>>>>> we'd
>>>>>>>>>> get
>>>>>>>>>> no CA restart for the others.
>>>>>>>>>
>>>>>>>>> These all get issued at the same time so should expire at the same
>>>>>>>>> time
>>>>>>>>> as well (see problem above). The script will hang around for 10
>>>>>>>>> minutes
>>>>>>>>> waiting for the renewal to complete, then give up.
>>>>>>>>
>>>>>>>> The certs might take different amounts of time to update, right?
>>>>>>>> Eventually, the expirations could go out of sync enough for it to
>>>>>>>> matter.
>>>>>>>> AFAICS, without proper locking we still get a race condition when
>>>>>>>> the
>>>>>>>> other certs start being renewed some time (much less than 10 min)
>>>>>>>> after
>>>>>>>> the audit one:
>>>>>>>>
>>>>>>>> (time axis goes down)
>>>>>>>>
>>>>>>>>          audit cert                  other cert
>>>>>>>>          ----------                  ----------
>>>>>>>>      certmonger does renew                .
>>>>>>>>    post-renew script starts               .
>>>>>>>>   check state of other certs: OK          .
>>>>>>>>              .                   certmonger starts renew
>>>>>>>>   certutil modifies NSS DB  +  certmonger modifies NSS DB  == boom!
>>>>>>>
>>>>>>> This can't happen because we count the # of expected certs and wait
>>>>>>> until all are in MONITORING before continuing.
>>>>>>
>>>>>> The problem is that they're also in MONITORING before the whole
>>>>>> renewal
>>>>>> starts. If the script happens to check just before the state changes
>>>>>> from MONITORING to GENERATING_CSR or whatever, we can get corruption.
>>>>>>
>>>>>>> The worse that would
>>>>>>> happen is the trust wouldn't be set on the audit cert and dogtag
>>>>>>> wouldn't be restarted.
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> The state the system would be in is this:
>>>>>>>>>
>>>>>>>>> - audit cert trust not updated, so next restart of CA will fail
>>>>>>>>> - CA is not restarted so will not use updated certificates
>>>>>>>>>
>>>>>>>>>> And anyway, why does certmonger do renewals in parallel? It seems
>>>>>>>>>> that
>>>>>>>>>> if it did one at a time, always waiting until the post-renew
>>>>>>>>>> script is
>>>>>>>>>> done, this patch wouldn't be necessary.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  From what Nalin told me certmonger has some coarse locking such
>>>>>>>>> that
>>>>>>>>> renewals in a the same NSS database are serialized. As you point
>>>>>>>>> out, it
>>>>>>>>> would be nice to extend this locking to the post renewal
>>>>>>>>> scripts. We
>>>>>>>>> can
>>>>>>>>> ask Nalin about it. That would fix the potential corruption issue.
>>>>>>>>> It is
>>>>>>>>> still much nicer to not have to restart dogtag 4 times.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Well, three extra restarts every few years seems like a small
>>>>>>>> price to
>>>>>>>> pay for robustness.
>>>>>>>
>>>>>>> It is a bit of a problem though because the certs all renew within
>>>>>>> seconds so end up fighting over who is restarting dogtag. This can
>>>>>>> cause
>>>>>>> some renewals go into a failure state to be retried later. This is
>>>>>>> fine
>>>>>>> functionally but makes QE a bit of a pain. You then have to make
>>>>>>> sure
>>>>>>> that renewal is basically done, then restart certmonger and check
>>>>>>> everything again, over and over until all the certs are renewed.
>>>>>>> This is
>>>>>>> difficult to automate.
>>>>>>
>>>>>> So we need to extend the certmonger lock, and wait until Dogtag is
>>>>>> back
>>>>>> up before exiting the script. That way it'd still take longer than 1
>>>>>> restart, but all the renews should succeed.
>>>>>>
>>>>>
>>>>> Right, but older dogtag versions don't have the handy servlet to tell
>>>>> that the service is actually up and responding. So it is difficult to
>>>>> tell from tomcat alone whether the CA is actually up and handling
>>>>> requests.
>>>>>
>>>>
>>>> Revised patch that takes advantage of new version of certmonger.
>>>> certmonger-0.65 adds locking from the time renewal begins to the end of
>>>> the post_save_command. This lets us be sure that no other certmonger
>>>> renewals will have the NSS database open in read-write mode.
>>>>
>>>> We need to be sure that tomcat is shut down before we let certmonger
>>>> save the certificate to the NSS database because dogtag opens its
>>>> database read/write and two writers can cause corruption.
>>>>
>>>> rob
>>>>
>>>
>>> stop_pkicad and start_pkicad need the Dogtag version check to select
>>> pki_cad/pki_tomcatd.
>>
>> Fixed.
>>
>>>
>>> A more serious issue is that stop_pkicad needs to be installed on
>>> upgrades. Currently the whole enable_certificate_renewal step in
>>> ipa-upgradeconfig is skipped if it was done before.
>>
>> I added a separate upgrade test for this. It currently won't work in
>> SELinux enforcing mode because certmonger isn't allowed to talk to dbus
>> in an rpm post script. It's being looked at.
>>
>>> In stop_pkicad can you change the first log message to "certmonger
>>> stopping %sd"? It's before the action so we don't want past tense.
>>
>> Fixed.
>>
>> rob
>
> I get a bunch of errors when installing the RPM:
>
[...]
>

This is the SELinux issue you were talking about. Sorry for not catching 
that.

With enforcing off, the patch looks & works well for me. I'm just 
concerned about this change in ipa-upgradeconfig:

@@ -707,7 +754,7 @@ def main():
          # configuration has changed, restart the name server
          root_logger.info('Changes to named.conf have been made, 
restart named')
          bindinstance.BindInstance(fstore).restart()
-    ca_restart = ca_restart or enable_certificate_renewal(ca) or 
upgrade_ipa_profile(ca, api.env.domain, fqdn)
+    ca_restart = ca_restart or enable_certificate_renewal(ca) or 
upgrade_ipa_profile(ca, api.env.domain, fqdn) or 
certificate_renewal_stop_ca(ca)

If the enable_certificate_renewal step was done already, but 
upgrade_ipa_profile requests a CA restart, then the short-circuiting 
`or` will be satisfied and certificate_renewal_stop_ca won't be run.

Since each upgrade step has its own checking, I think it would be safer 
to use something like:
     ca_restart = certificate_renewal_stop_ca(ca) or ca_restart

or even:
ca_restart = any([
     ca_restart,
     enable_certificate_renewal(ca),
     upgrade_ipa_profile(ca, api.env.domain, fqdn),
     certificate_renewal_stop_ca(ca),
])

-- 
Petr³