If you think SSL certificates are dangerous, try seeing the dangers of NOT using them, specially for a service that is a central repository of artifacts meant to be automatically deployed.
It is not about encryption (that a self-signed certificate lasting till 2035 will suffice), but verification, who am I talking with, because reaching the right server can be messed up with DNS or routing, among other things. Yes, that adds complexity, but we are talking more about trust than technology.
And once you recognize that it is essential to have a trusted service, then give it the proper instrumentation to ensure that it work properly, including monitoring and expiration alerts, and documentation about it, not just "it works" and dismiss it.
May we retitle the post as "The dangers of not understanding SSL Certificates"?
Debian’s apt do not use SSL as far as I know and I am not aware of any serious security disaster. Their packages are signed and content is not considered confidental.
Debian 13 uses https://deb.debian.org by default. Even the upgrade docs from 12 to 13 mention the https variant. They were quite hostile for a while to https, but now it seems they bit the bullet.
If I'm not mistaken, apt repositories have very similar failure modes - just using PGP certs instead of SSL certs. The repository signing key can still expire or get revoked, and you'll have an even harder time getting every client to install a new one...
You need external monitoring of certificate validity. Your ACME client might not be sending failure notifications properly (like happened to Bazel here). The client could also think everything is OK because it acquired a new cert, meanwhile the certificate isn't installed properly (e.g., not reloading a service so it keeps using the old cert).
I have a simple Python script that runs every day and checks the certificates of multiple sites.
One time this script signaled that a cert was close to expiring even though I saw a newer cert in my browser. It turned out that I had accidentally launched another reverse proxy instance which was stuck on the old cert. Requests were randomly passed to either instance. The script helped me correct this mistake before it caused issues.
100%, I've run into this too. I wrote some minimal scripts in Bash, Python, Ruby, Node.js (JavaScript), Go, and Powershell to send a request and alert if the expiration is less than 14 days from now: https://heyoncall.com/blog/barebone-scripts-to-check-ssl-cer... because anyone who's operating a TLS-secured website (which is... basically anyone with a website) should have at least that level of automated sanity check. We're talking about ~10 lines of Python!
There is a Prometheus plugin called ssl_exporter that will provide the ability for Grafana to display a dashboard of all of your certs and their expirations. But, the trick is that you need to know where all your certs are located. We were using Venafi to do auto discovery but a simple script to basically nmap your network provides the same functionality.
TLS certificates… SSL is some old Java anachronism.
> There’s no natural signal back to the operators that the SSL certificate is getting close to expiry.
There is. The not after is right there in the certificate itself. Just look at it with openssl x509 -text and set yourself up some alerts… it’s so frustrating having to refute such random bs every time when talking to clients because some guy on the internet has no idea but blogs about their own inefficiencies.
Furthermore, their autorenew should have been failing loud and clear, everyone should know from metrics or logs… but nobody noticed anything.
You are so confused, it’s not funny. There is no such thing as SSL 3.4. OpenSSL is not SSL. There were 3 SSL versions: 1.0, 2.0, 3.0. Following the 3.0, the protocol has been renamed to TLS. As of 2025, all versions of SSL (1.0, 2.0, 3.0) and early versions of TLS (1.0, 1.1) are considered insecure and have been deprecated by major browsers and the IETF. Modern secure communications rely exclusively on TLS 1.2 and TLS 1.3.
I don’t think this is as simple as it seems. For example, we have our own CA and issue several mTLS certificates, with hundreds of them currently in use across our machines. We need to check every single one (which we don’t do yet) because there is an additional distribution step that might fail selectively. And that’s not even touching on expiring CAs, which is a total nightmare.
Why would it be difficult? You have a single CA, so a single place where certs are issued. That means there’s a single place with the knowledge of what certs are issued for which identity, how long are those valid for, and has there been a new cert issued for that identity prior to previous cert expiration. Could not be simpler, in fact.
If you have your own CA, you log every certificate with the expiry details. It's easier compared to an external CA because you automatically get the full asset list as long as you care to preserve it.
I agree with this. Certs are designed to function as digital cliff. They will either be accepted or they won't, with no safe middle ground. Therefore all certs in a chain can only be as reliable as the least understood cert in your certificate management.
Operationally, the issue is rooted in simple monitoring and accurate inventory. The article is apt: “ With SSL certificates, you usually don’t have the opportunity to build up operational experience working with them, unless something goes wrong”
You can update your cert to prepare for it by appending—-NEW CERT—-
To the same file as ——-OLD CERT—-
But you also need to know where all your certificates are located. We were using Venafi for the auto discovery and email notifications. Prometheus ssl_exporter with Grafana integration and email alerts works the same. The problem is knowing where all hosts, containers and systems that have certs are located. Simple nmap style scan of all endpoints can help. But, you might also have containers with certs or you might have certs baked into VM images. Sure, there all sorts of things like storing the cert in a CICD global variable, bind mounting secrets, Vault Secret Injector, etc
But it’s all rooted in maintaining a valid, up to date TLS inventory. And that’s hard. As the article states: “ There’s no natural signal back to the operators that the SSL certificate is getting close to expiry. To make things worse, there’s no staging of the change that triggers the expiration, because the change is time, and time marches on for everyone. You can’t set the SSL certificate expiration so it kicks in at different times for different cohorts of users.”
Every time this happens you whack a mole a change. You get better at it but not before you lose some credibility
> the failure mode is the opposite of graceful degradation. It’s not like there’s an increasing percentage of requests that fail as you get closer to the deadline. Instead, in one minute, everything’s working just fine, and in the next minute, every http request fails.
This has given me some interesting food for thought. I wonder how feasible it would be to create a toy webserver that did exactly this (failing an increasing percentage of requests as the deadline approaches)? My thought would be to start failing some requests as the deadline approaches a point where most would consider it "far too late" (e.g. 4 hours before `notAfter`). At this point, start responding to some percentage of requests with a custom HTTP status code (599 for the sake of example).
Probably a lot less useful than just monitoring each webserver endpoint's TLS cert using synthetics, but it's given me an idea for a fun project if nothing else.
Your idea shifts monitoring to end users, which doesn’t sound awesome.
Just check expiration of the active certificate; if it’s under a threshold (say 1 week, assuming you auto-renew it when it’s 3 weeks to expiry; still serving a cert when it’s 1 week to expiration is enough signal that something went wrong) then you alert.
Then you just need to test that your alerting system is reliable. No need to use your users as canaries.
Oh absolutely, I wouldn't use this for any production system. It would be a toy hobby project. I just find the notion of turning a no-degradation failure mode into a gradual-degradation one fascinating for some reason.
In real life, I guess there are people who don't monitor at all. For them failing requests would go unnoticed ... for the others monitoring must be easy.
But I think the core thing might be to make monitoring SSL lifetime the "obvious" default: All the grafana dashboards etc should have such an entry.
Then as soon as I setup a monitoring stack I get that reminder as well.
This canary is a good thought. The problem the article highlights is that people don’t practice updates enough and assume someone else or something is handling it. You only get better at it the more often it happens which is partly why long expirations are not ideal. This is what the article is highlighting as the main issue.
It’s not a good thought. Run a single client (uptime kuma) and ask it to alert you on expiration proximity. I.e. implement proper monitoring and alerting. No need to randomly degrade your users’ experience and hope they’ll notify you instead of shrugging and going to a site that doesn’t throw made-up http errors at them randomly.
Happened on the first day of my first on-call rotation - a cert for one of the key services expired. Autorenew failed, because one of the subdomains on the cert no longer resolved.
The main lesson we took from this was: you absolutely need monitoring for cert expiration, with alert when (valid_to - now) becomes less than typical refresh window.
It's easy to forget this, especially when it's not strictly part of your app, but essential nonetheless.
We need a way to set multiple SSL certificates with overlapping duration. So if one certificate expires the backup certificate will become active. If the overlap is a couple of months then you have plenty of time to detect and fix the issue.
Having only one SSL certificate is a single point of failure, we have eliminated single points of failure almost everywhere else.
> We need a way to set multiple SSL certificates with overlapping duration.
Both Apache (SSLCertificateFile) and nginx (ssl_certificate) allow for multiple files, though they cannot be of the same algorithm: you can have one RSA, one ECC, etc, but not (say) an ECC and another ECC. (This may be a limitation of OpenSSL.)
So if the RSA expires on Feb 1, you can have the ECC expire on Feb 14 or Mar 1.
You can do this pretty easily with Let’s Encrypt, to my knowledge. You can request resistance every 30 days, for example, which would give you a ladder of three 90 day certificates.
Edit: but to be clear, I don’t understand why you’d want this. If you’re worried about your CA going offline, you should shorten your renewal period instead.
I don’t think there’s a ton of benefit to the technique. If you’re worried about getting too close to your certificate expiry via automation, the solution is to renew earlier rather than complicate things with a ladder of valid certs.
There are plenty of other technologies whose failure mode is a total outage, it’s not exclusive to a failed certificate renewal.
A certificate renewal process has several points at which failure can be detected and action taken, and it sounds like this team was relying only on a “failed to renew” alert/monitor.
A broken alerting system is mentioned “didn’t alert for whatever reason”.
If this certificate is so critical, they should also have something that alerts if you’re still serving a certificate with less than 2 weeks validity - by that time you should have already obtained and rotated in a new certificate. This gives plenty of time for someone to manually inspect and fix.
Sounds like a case of “nothing in this automated process can fail, so we only need this one trivial monitor which also can’t fail so meh” attitude.
Additionally, warnings can be built into the clients themselves. If you connect to a host with less than 2 weeks cert expiry time, print a warning in your client. That will be further incentive to not let certs be not renewed in time.
For corporations, institutions, and for-profits this matters and there's no real good solution.
But for human persons and personal websites HTTP+HTTPS fixes this easily and completely. You get the best of both worlds. Fragile short lifetime pseudo-privacy if you want it (HTTPS) and long term stable access no matter what via HTTP. HTTPS-only does more harm than good. HTTP+HTTPS is far better than either alone.
If you think SSL certificates are dangerous, try seeing the dangers of NOT using them, specially for a service that is a central repository of artifacts meant to be automatically deployed.
It is not about encryption (that a self-signed certificate lasting till 2035 will suffice), but verification, who am I talking with, because reaching the right server can be messed up with DNS or routing, among other things. Yes, that adds complexity, but we are talking more about trust than technology.
And once you recognize that it is essential to have a trusted service, then give it the proper instrumentation to ensure that it work properly, including monitoring and expiration alerts, and documentation about it, not just "it works" and dismiss it.
May we retitle the post as "The dangers of not understanding SSL Certificates"?
Debian’s apt do not use SSL as far as I know and I am not aware of any serious security disaster. Their packages are signed and content is not considered confidental.
Debian 13 uses https://deb.debian.org by default. Even the upgrade docs from 12 to 13 mention the https variant. They were quite hostile for a while to https, but now it seems they bit the bullet.
If I'm not mistaken, apt repositories have very similar failure modes - just using PGP certs instead of SSL certs. The repository signing key can still expire or get revoked, and you'll have an even harder time getting every client to install a new one...
The selection of packages installed on a server should be treated as confidential, but you could probably infer it from file sizes.
You need external monitoring of certificate validity. Your ACME client might not be sending failure notifications properly (like happened to Bazel here). The client could also think everything is OK because it acquired a new cert, meanwhile the certificate isn't installed properly (e.g., not reloading a service so it keeps using the old cert).
I have a simple Python script that runs every day and checks the certificates of multiple sites.
One time this script signaled that a cert was close to expiring even though I saw a newer cert in my browser. It turned out that I had accidentally launched another reverse proxy instance which was stuck on the old cert. Requests were randomly passed to either instance. The script helped me correct this mistake before it caused issues.
100%, I've run into this too. I wrote some minimal scripts in Bash, Python, Ruby, Node.js (JavaScript), Go, and Powershell to send a request and alert if the expiration is less than 14 days from now: https://heyoncall.com/blog/barebone-scripts-to-check-ssl-cer... because anyone who's operating a TLS-secured website (which is... basically anyone with a website) should have at least that level of automated sanity check. We're talking about ~10 lines of Python!
There is a Prometheus plugin called ssl_exporter that will provide the ability for Grafana to display a dashboard of all of your certs and their expirations. But, the trick is that you need to know where all your certs are located. We were using Venafi to do auto discovery but a simple script to basically nmap your network provides the same functionality.
TLS certificates… SSL is some old Java anachronism.
> There’s no natural signal back to the operators that the SSL certificate is getting close to expiry.
There is. The not after is right there in the certificate itself. Just look at it with openssl x509 -text and set yourself up some alerts… it’s so frustrating having to refute such random bs every time when talking to clients because some guy on the internet has no idea but blogs about their own inefficiencies.
Furthermore, their autorenew should have been failing loud and clear, everyone should know from metrics or logs… but nobody noticed anything.
> TLS certificates… SSL is some old Java anachronism.
OpenSSL is still called OpenSSL. Despite "SSL" not being the proper name anymore, people are still going to use it.
By the way, TLS 1.3 is actually SSL v3.4 :)
You are so confused, it’s not funny. There is no such thing as SSL 3.4. OpenSSL is not SSL. There were 3 SSL versions: 1.0, 2.0, 3.0. Following the 3.0, the protocol has been renamed to TLS. As of 2025, all versions of SSL (1.0, 2.0, 3.0) and early versions of TLS (1.0, 1.1) are considered insecure and have been deprecated by major browsers and the IETF. Modern secure communications rely exclusively on TLS 1.2 and TLS 1.3.
If we're being picky, they're x.509 certificates, not TLS or SSL.
Thanks for the correction.
I don’t think this is as simple as it seems. For example, we have our own CA and issue several mTLS certificates, with hundreds of them currently in use across our machines. We need to check every single one (which we don’t do yet) because there is an additional distribution step that might fail selectively. And that’s not even touching on expiring CAs, which is a total nightmare.
Why would it be difficult? You have a single CA, so a single place where certs are issued. That means there’s a single place with the knowledge of what certs are issued for which identity, how long are those valid for, and has there been a new cert issued for that identity prior to previous cert expiration. Could not be simpler, in fact.
If you have your own CA, you log every certificate with the expiry details. It's easier compared to an external CA because you automatically get the full asset list as long as you care to preserve it.
X.509 certificates
I agree with this. Certs are designed to function as digital cliff. They will either be accepted or they won't, with no safe middle ground. Therefore all certs in a chain can only be as reliable as the least understood cert in your certificate management.
Operationally, the issue is rooted in simple monitoring and accurate inventory. The article is apt: “ With SSL certificates, you usually don’t have the opportunity to build up operational experience working with them, unless something goes wrong”
You can update your cert to prepare for it by appending—-NEW CERT—-
To the same file as ——-OLD CERT—-
But you also need to know where all your certificates are located. We were using Venafi for the auto discovery and email notifications. Prometheus ssl_exporter with Grafana integration and email alerts works the same. The problem is knowing where all hosts, containers and systems that have certs are located. Simple nmap style scan of all endpoints can help. But, you might also have containers with certs or you might have certs baked into VM images. Sure, there all sorts of things like storing the cert in a CICD global variable, bind mounting secrets, Vault Secret Injector, etc
But it’s all rooted in maintaining a valid, up to date TLS inventory. And that’s hard. As the article states: “ There’s no natural signal back to the operators that the SSL certificate is getting close to expiry. To make things worse, there’s no staging of the change that triggers the expiration, because the change is time, and time marches on for everyone. You can’t set the SSL certificate expiration so it kicks in at different times for different cohorts of users.”
Every time this happens you whack a mole a change. You get better at it but not before you lose some credibility
> the failure mode is the opposite of graceful degradation. It’s not like there’s an increasing percentage of requests that fail as you get closer to the deadline. Instead, in one minute, everything’s working just fine, and in the next minute, every http request fails.
This has given me some interesting food for thought. I wonder how feasible it would be to create a toy webserver that did exactly this (failing an increasing percentage of requests as the deadline approaches)? My thought would be to start failing some requests as the deadline approaches a point where most would consider it "far too late" (e.g. 4 hours before `notAfter`). At this point, start responding to some percentage of requests with a custom HTTP status code (599 for the sake of example).
Probably a lot less useful than just monitoring each webserver endpoint's TLS cert using synthetics, but it's given me an idea for a fun project if nothing else.
Your idea shifts monitoring to end users, which doesn’t sound awesome.
Just check expiration of the active certificate; if it’s under a threshold (say 1 week, assuming you auto-renew it when it’s 3 weeks to expiry; still serving a cert when it’s 1 week to expiration is enough signal that something went wrong) then you alert.
Then you just need to test that your alerting system is reliable. No need to use your users as canaries.
Oh absolutely, I wouldn't use this for any production system. It would be a toy hobby project. I just find the notion of turning a no-degradation failure mode into a gradual-degradation one fascinating for some reason.
For a fun project it certainly is a fun idea.
In real life, I guess there are people who don't monitor at all. For them failing requests would go unnoticed ... for the others monitoring must be easy.
But I think the core thing might be to make monitoring SSL lifetime the "obvious" default: All the grafana dashboards etc should have such an entry.
Then as soon as I setup a monitoring stack I get that reminder as well.
This canary is a good thought. The problem the article highlights is that people don’t practice updates enough and assume someone else or something is handling it. You only get better at it the more often it happens which is partly why long expirations are not ideal. This is what the article is highlighting as the main issue.
It’s not a good thought. Run a single client (uptime kuma) and ask it to alert you on expiration proximity. I.e. implement proper monitoring and alerting. No need to randomly degrade your users’ experience and hope they’ll notify you instead of shrugging and going to a site that doesn’t throw made-up http errors at them randomly.
If a “canary” is degrading users, it’s misdesigned.
The canary narrows the blast radius and time-to-detection.
Happened on the first day of my first on-call rotation - a cert for one of the key services expired. Autorenew failed, because one of the subdomains on the cert no longer resolved.
The main lesson we took from this was: you absolutely need monitoring for cert expiration, with alert when (valid_to - now) becomes less than typical refresh window.
It's easy to forget this, especially when it's not strictly part of your app, but essential nonetheless.
We need a way to set multiple SSL certificates with overlapping duration. So if one certificate expires the backup certificate will become active. If the overlap is a couple of months then you have plenty of time to detect and fix the issue.
Having only one SSL certificate is a single point of failure, we have eliminated single points of failure almost everywhere else.
> We need a way to set multiple SSL certificates with overlapping duration.
Both Apache (SSLCertificateFile) and nginx (ssl_certificate) allow for multiple files, though they cannot be of the same algorithm: you can have one RSA, one ECC, etc, but not (say) an ECC and another ECC. (This may be a limitation of OpenSSL.)
So if the RSA expires on Feb 1, you can have the ECC expire on Feb 14 or Mar 1.
You can do this pretty easily with Let’s Encrypt, to my knowledge. You can request resistance every 30 days, for example, which would give you a ladder of three 90 day certificates.
Edit: but to be clear, I don’t understand why you’d want this. If you’re worried about your CA going offline, you should shorten your renewal period instead.
Do services such as K8S ingress and Azure web apps allow you to specify multiple certificates?
Update: looks like the answer is yes. So then the issue is people not taking advantage of this technique.
I don’t think there’s a ton of benefit to the technique. If you’re worried about getting too close to your certificate expiry via automation, the solution is to renew earlier rather than complicate things with a ladder of valid certs.
Exactly. It's not like backup certificate have validity starting at a future date.
Yes the backup certificate can have validity starting at a future date. You just need to wait till that future date to create it.
And it get worse, as they are changing the max days to until 47 in 2029.
On the other hand, as the time gets shorter, it'll become less likely that something will go undetected for a long time.
There are plenty of other technologies whose failure mode is a total outage, it’s not exclusive to a failed certificate renewal.
A certificate renewal process has several points at which failure can be detected and action taken, and it sounds like this team was relying only on a “failed to renew” alert/monitor.
A broken alerting system is mentioned “didn’t alert for whatever reason”.
If this certificate is so critical, they should also have something that alerts if you’re still serving a certificate with less than 2 weeks validity - by that time you should have already obtained and rotated in a new certificate. This gives plenty of time for someone to manually inspect and fix.
Sounds like a case of “nothing in this automated process can fail, so we only need this one trivial monitor which also can’t fail so meh” attitude.
Additionally, warnings can be built into the clients themselves. If you connect to a host with less than 2 weeks cert expiry time, print a warning in your client. That will be further incentive to not let certs be not renewed in time.
For corporations, institutions, and for-profits this matters and there's no real good solution.
But for human persons and personal websites HTTP+HTTPS fixes this easily and completely. You get the best of both worlds. Fragile short lifetime pseudo-privacy if you want it (HTTPS) and long term stable access no matter what via HTTP. HTTPS-only does more harm than good. HTTP+HTTPS is far better than either alone.