Planning TLS certificate renewals – define a processKeyChest re-design, we created a sequence of meaningful checks for TLS certificates to get them always renewed before your web services go down.
You probably experienced certificate renewal reminders for your web or other certificates. They are coming on fixed days before expirations. An example of such a system is Let’s Encrypt, which sends the first reminder 20 days before expiration and the last one when your certificate expires (assuming you fail to renew it – actually when it thinks you didn’t renew it).
Drawbacks of fixed reminders
I was never happy with this “fixed days” approach. I like to see a reason behind actions. As there’s little more than “countdown” behind usual reminders, people often fail to act on them till the last moment. Alternatively we will create our own ad-hoc definition of urgency and start ignoring reminders that come too early and occasionally ignore one too many.
This increases the probability of our web services to become unavailable as we start relying on the “last reminder”, which we may miss or it never arrives for some reason.
What I wanted to design was a process, which would be simple to describe. Easy to automate its definition and also sounding sensible to anyone without knowledge of certificates, PKI, or encryption.
Dissecting key / certificate renewals
When you look at the process of renewing certificates there are three actions that need to be completed – regardless of whether you use Let’s Encrypt to get your free certificates, or whether you use expensive certificates with Extended Validations.
- request a new certificate – it includes generation of a new key and submitting a proper request to your issuing authority;
- issue the certificate – this step is managed by the issuing authority but you need to provide information and co-operation as needed; and
- deploy the new certificate – that’s the step when you put the new certificate where it is needed and make your application(s) to start using it.
Each of these steps needs some time to be completed. I imagine you’d say that you use Let’s Encrypt and it’s all done and dusted in about 10 seconds and you’re right … if all goes well. However, we want the new certificate be deployed on time EVERY time. As there is nothing like 100%, let’s aim for something like 99.9% or 99.99%, which is 10-100 times above the reliability of Let’s Encrypt CA (Let’s Encrypt uptime is 99.9% — or 98.8% without defects in 2017). In terms of MTBF and MDF (mean down time) it was 11.3 days and 0.24 days.
Configuring renewal process
So how do make sure a new certificate is always there on time? We need to set proper lead times for that we need to answer three “how long” questions:
- how long does it take is to request a certificate?
- how long does it take to validate a request and get a certificate issued?
- how long does it take to apply a certificate to my application?
As Let’s Encrypt works instantly for most of users, one may be forgiven to think that 1&2 are “zero”, while 3 … as well. An understandable belief based on experiences from small tens of certificate renewals.
The longest downtime of LE was just under 7 days in its first 18 months of being operational. Whilst this seems to be an outlier based on my analysis of incidents, we may say that 14 days is a “safe” maximum time for issuing a certificate – i.e., step 2.
Similarly, if you use DNS-based validation of your requests may need some time to update the DNS records to allow request validation. If this is the case, the lead time needs an appropriate adjustment.
Step 1 will depend, in the situation of automated LE renewals, on how often your renewal scripts are running. It would usually be a day, but it can be more than that.
Step 3 can be done instantaneously when applications pick-up changes automatically, or delayed if a manual application restart or reload is needed – let’s say up to 5 days as we don’t work at weekends and we may be off for a couple of days.
These parameters define the time window for certificate renewals and the only other parameter is an offset from the “technical expiration”.
In this case, the renewal window starts 27 days before expirations. This number is based on the following lead times:
- Step 1 – 1 day
- Step 2 – 14 days
- Step 3 – 5 days
- Offset – 7 days for the offset from the certificate expiring, i.e., the main boss will be told someone have screwed-up.
Now, the nice thing is that we can start automating computation of good values for some of the parameters:
Step 1 – computing smallest common denominator for Step 2- from statistical / performance data of the certificate issuer and of your own certificates;
Step 3 – from historical data of each service;
These 3 or 4 parameters fully describe the whole renewal process and define when the renewal time window should start. They also define three points at which it makes sense to send notifications of possible problems. While the operation teams should get a kick-off message, the following notifications can be escalations.
As a result, we can replace a “random” five or more messages with a set of very specific messages linked to particular steps of the renewal.
The second picture shows the security aspect of the certificate renewal process. The first risk is when a certificate is renewed before it’s due. The second notification is mostly relevant to situations when the certificate is compromised and it should be revoked within a defined time.
I believe this reasoning is more suitable in the context of IT operations and it also potentially improves responsiveness of people on the receiving side. It is the default approach in our re-designed KeyChest service, which we are preparing for launch within weeks.