Why Your TLS Certificates Keep Expiring (And How to Fix It)

If your team has ever had a certificate expire in production, you know the drill. The monitoring fires. Someone checks and finds that nginx is serving the old cert. Slack lights up. An engineer digs up the renewal process from a ticket or a wiki page. An hour later — maybe more — the cert is renewed and deployed, and the post-mortem includes an action item to "improve certificate management."

Then it happens again six months later.

It's tempting to treat this as a people problem — someone forgot to renew, someone ignored the calendar reminder. But that's not actually what's happening. The failure is structural. Manual certificate management is fundamentally incompatible with the way modern infrastructure grows.

The spreadsheet trap

Most teams start with a spreadsheet. It works fine at first. A dozen certificates, each with an expiry date, a domain, and a note about where it's deployed. Someone sets a calendar reminder 30 days out. The cert gets renewed before it expires. No incidents.

Then the team grows. New services get added. Microservices proliferate. You add staging environments, DR environments, and maybe a second region. Each one has its own certificates. Some of those certificates cover multiple domains (SANs). Some are internal certs from your own CA that don't show up in external monitoring.

The spreadsheet has 80 rows now. The person who created it left the company. The calendar reminders didn't transfer. You don't actually know if all the certificates in the spreadsheet are still active — some services were deprecated, some were renamed. And there are definitely certificates deployed in your infrastructure that aren't in the spreadsheet, because they were provisioned by a contractor or during an incident and nobody updated the tracker.

This is the spreadsheet trap: it works until it doesn't, and by the time it fails, you've lost visibility into your own certificate inventory.

What actually causes cert expiry incidents

When you examine cert expiry post-mortems honestly, a pattern emerges. It's rarely "nobody knew this cert existed." It's usually one of these:

1. The renewal was someone's responsibility, but that person was on vacation. Or changed teams. Or left the company. Cert management is often tribal knowledge held by one or two engineers. When they're unavailable, the process breaks.

2. The cert was renewed but not delivered. This is surprisingly common. Someone generates a new cert and puts it on their laptop or in a shared folder, meaning to deploy it, then gets pulled into something else. The cert expires because the new one never replaced the old one.

3. An orphaned certificate. A service exists that nobody's actively managing. It was set up by a contractor, or during a crisis, or by someone who has since left. The cert expires and service breaks — which is how the cert is discovered.

4. The reminder fired but was ignored. "30 days to expiry" feels like a long time when you're in the middle of a sprint. People defer it. Then 14 days feels like enough time. Then 7 days is urgent but it's Friday. Then it expires on Monday morning.

None of these are failures of intelligence or diligence. They're failures of system design. Manual processes that depend on humans remembering and acting at the right time will always fail eventually.

The 90-day problem

Let's Encrypt's push to 90-day certificate lifetimes — and the recent proposal to move toward 47-day lifetimes — has made this problem more acute. Shorter lifetimes are good for security (less time for a compromised cert to cause damage), but they also mean more frequent renewals.

A certificate with a 1-year validity requires 1 renewal per year. At 90 days, that's 4 renewals per year per certificate. At 47 days, it's 8. Multiply that by the number of certificates in your infrastructure, and you have a renewal cadence that's simply not compatible with manual processes.

The industry is moving toward shorter lifetimes by design — specifically to force automation. Organizations that haven't automated their certificate renewal process will feel increasing pressure as lifetimes continue to shrink.

Why monitoring isn't enough

Many teams respond to this by adding better monitoring — alerts at 30 days, 14 days, 7 days. This is better than nothing, but it doesn't fix the underlying problem. It just adds more alerts that people have to act on.

Alert fatigue is real. When monitoring fires for everything, nothing feels urgent. A 30-day expiry alert competes with deployment alerts, error rate alerts, latency alerts, and whatever else is in the on-call queue. It gets acknowledged and deferred.

Monitoring tells you what is happening. It doesn't make renewal and delivery happen automatically. The only thing that eliminates cert expiry incidents at scale is automation that handles renewal and delivery without human intervention.

What automated certificate management looks like

In a well-automated setup, certificate lifecycle looks like this:

A certificate is issued and registered in a central system. The system tracks its expiry date and all the services that use it. 30 days before expiry, renewal is triggered automatically — no human required. The renewed certificate is delivered to every service that uses it, also automatically. The service reloads and starts serving the new cert. An audit log records everything that happened.

The only time a human gets involved is if something fails — if the renewal can't complete because of an ACME challenge failure, or if a delivery fails because a target is unreachable. That's what monitoring should fire on: automation failures, not pending renewals that automation should be handling.

This is the model that CertLocker implements. The lifecycle — issuance, tracking, renewal, delivery — is owned by the platform, not by a person. Humans review the audit log, set policies, and act on escalations. They don't renew certificates by hand.

Getting from here to there

If your team is currently managing certificates manually, the path to automation doesn't require ripping everything out at once. A few practical steps:

Start with inventory. Before you can automate, you need to know what certs you have. Run a scan of your infrastructure to find all certificates, including internal ones. Most teams discover certs they didn't know existed.

Identify the highest-risk certs. Which certificates, if expired, would cause the most severe incident? Prioritize automating those first. Public-facing certs for customer-critical services are usually at the top of this list.

Pick a delivery model that fits your infrastructure. Pull-based delivery (where machines fetch their own certs using scoped tokens) scales better than push-based delivery (where a central system pushes certs to machines). Pull doesn't require inbound connectivity to every target.

Test rotation before you depend on it. The first automated renewal should happen in a non-production environment, where you can verify that the new cert is delivered and the service reloads correctly before it matters.

The goal is to get to a state where a certificate expiry generates a notification that everything went fine — not a Slack ping about an outage.

Summary

TLS certificates keep expiring in production because manual cert management doesn't scale to the complexity of modern infrastructure. Spreadsheets get stale. Reminders get deferred. Delivery is forgotten. Tribal knowledge leaves with people who leave.

The fix isn't more monitoring or more reminders — it's removing humans from the renewal and delivery loop entirely. Automated certificate lifecycle management handles issuance, tracking, renewal, and delivery without human intervention, so the only alerts you get are about failures in the automation, not about tasks you need to go do.