What Is TLS Certificate Management? A Guide for Infrastructure Teams

Most conversations about certificate management focus on renewal. Don't let certs expire. Set up a reminder. Use Let's Encrypt. That's not wrong, but it covers maybe a quarter of what certificate management actually involves.

For infrastructure teams — the people responsible for HAProxy, application servers, internal APIs, VPNs, and the dozen other services that depend on TLS — certificate management is the full lifecycle of every certificate: where it came from, where it lives, what systems depend on it, how it gets to those systems, when it expires, and what happens when it's revoked or replaced.

Getting any one of those pieces wrong produces an incident. Getting all of them right, consistently, across every cert in your inventory, requires deliberate system design.

What certificate management actually covers

A complete TLS certificate management practice has seven distinct stages. Most teams have some of these under control. Few have all of them.

1. Discovery — Knowing every certificate in your infrastructure. This sounds basic, but most teams discover gaps when something expires unexpectedly. Discovery isn't a one-time exercise; it's an ongoing inventory that tracks every cert, where it's deployed, what issued it, when it expires, and who owns it. Without this, everything else is guesswork.

2. Provisioning — Obtaining a certificate from a Certificate Authority. For public-facing infrastructure, this usually means Let's Encrypt via ACME or a commercial CA. For internal services, it might mean an internal PKI. Provisioning decisions — which CA, which validation method, DV vs OV vs EV, wildcard vs per-service — have long-tail consequences for every other lifecycle stage.

3. Deployment — Getting the certificate from wherever it was issued to wherever it needs to be used. This is the stage most commonly treated as an afterthought. HAProxy needs a PEM file that combines private key, certificate, and chain. IIS needs the cert imported into the Windows certificate store and bound to the right site. Kubernetes needs a TLS secret in the right namespace. Deployment is not a copy operation — it's a service-specific installation process that needs to complete correctly or the service won't serve the new cert.

4. Monitoring — Continuous visibility into the state of your certificate inventory. Monitoring should track expiry dates, flag approaching renewals, confirm that deployed certs match the expected cert, and alert when something diverges from expected state. Monitoring is distinct from alerting: you want to know the state of your certs continuously, not just when they're about to expire.

5. Renewal — Issuing a new certificate before the current one expires. With ACME, renewal can be fully automated. The renewal threshold — how far in advance of expiry to trigger renewal — should be large enough to allow for retries if the first attempt fails. For 90-day certs, renewing at 30 days remaining is standard. For 47-day certs, renewing at roughly 15 days remaining is reasonable.

6. Revocation — Removing a certificate from service before its natural expiry. Revocation happens when a private key is compromised, when a server is decommissioned, or when domain ownership changes. Effective revocation requires knowing everywhere a cert is deployed so all copies can be removed or replaced simultaneously.

7. Retirement — Cleaning up expired or replaced certificates. Stale certs sitting in certificate stores, configuration files, or vault systems create confusion and can be mistaken for active certs. Retirement is the bookkeeping stage that keeps the inventory clean.

Where infrastructure teams specifically get stuck

The website owner's certificate management problem is largely solved by Let's Encrypt and Certbot. One server, one cert, one ACME client that handles everything. If you're only running a web server, the tooling is mature and the process is nearly frictionless.

Infrastructure teams have a different problem. You're not managing one cert for one server. You're managing:

Multiple services, each with different certificate formats and installation requirements
Multiple environments (dev, staging, production) that need the same cert lifecycle but shouldn't share certificates
Internal services that need certificates from an internal CA, not a public CA
Long-lived infrastructure components (HAProxy, VPNs, database clusters) with specific reload requirements
Compliance requirements that mandate audit trails for every certificate action
Access control requirements that prevent any single system from holding all private keys

The deployment stage is where this complexity bites hardest. Each service has its own certificate format requirements, its own installation path, and its own reload mechanism. A HAProxy cert needs to be assembled differently than a Nginx cert. An IIS cert goes through a completely different process than a Linux service cert. Managing this diversity with a single ACME client and a shell script works until it doesn't — and it tends to stop working at the worst possible time.

The delivery problem

The gap between "cert issued" and "cert deployed and serving traffic" is where most infrastructure certificate incidents originate. A cert can be renewed successfully and still cause an outage if the delivery step fails or is never triggered.

Delivery problems come in several forms:

The cert renewed but the service didn't reload. HAProxy has the new cert file on disk, but it cached the old cert in memory when it started. Without a graceful reload, it keeps serving the expired cert indefinitely. This is probably the most common cert-related outage mode.

The cert was deployed to some targets but not all. A load balancer pool has six nodes. The deployment script ran successfully on four but failed silently on two. Now two nodes serve an expired cert while four serve the new one. Monitoring that checks a single endpoint misses this entirely.

The cert was deployed but to the wrong path. The service expects the cert at /etc/ssl/certs/api.pem. The deployment wrote it to /etc/ssl/certs/api.crt. The service continues serving the old cert from the path it was configured to use.

Good certificate management treats delivery as a first-class operation with verification: confirm that every target received the cert, confirm that the service is actually serving the new cert, and confirm that the cert being served matches what was issued. Not just "did the file write succeed."

The access control problem

Certificates contain private keys. Private keys are credentials. How you control access to those credentials matters as much as how you manage the certificates themselves.

Common patterns that create risk:

Shared storage — all certificates in a shared S3 bucket, NFS mount, or secrets manager path accessible to all services. Any compromised service can read every private key in your inventory.

Push via configuration management — Ansible or Chef writes cert files to servers. The system doing the writing has credentials that allow it to write to every server in the inventory. Those credentials are a valuable target.

Wildcard certificates shared across many services — one *.example.com cert deployed to 30 services means a compromise of any one of those services exposes the private key for all of them.

The better model is scoped access: each service gets a credential that only allows it to retrieve its own certificate. A token for service-a cannot be used to fetch service-b's private key. This limits the blast radius of any single compromise to that service's own certificate and nothing else.

Why 47-day certificate lifetimes make all of this urgent

The CA/Browser Forum is cutting maximum TLS certificate lifetimes to 47 days by 2029, with an intermediate cut to 100 days by March 2027. At these timescales, any manual step in the certificate lifecycle creates an unsustainable operational burden.

At 47-day lifetimes with a conservative renewal policy, a modest inventory of 20 certificates requires roughly 160 renewal-and-delivery cycles per year. Each one needs to succeed completely — issuance, delivery, reload, verification — without human intervention. Teams with manual steps in that chain will spend 2027 onward in continuous firefighting mode.

The teams that respond to shorter lifetimes by treating certificate management as fully automated infrastructure — the same way they treat deployment pipelines or database backups — will have a dramatically better experience than teams trying to keep up manually.

What a mature certificate management system looks like

A mature system has the following properties:

Complete inventory — every certificate is tracked, including those you didn't issue yourself. Shadow certs discovered through scanning are part of the inventory too.
Automated renewal — renewals trigger without human intervention at a configured threshold, retry automatically on failure, and alert only when retry is exhausted.
Automated delivery — new certs reach their targets through a defined, verified process, not a script someone wrote and forgot about.
Scoped access — private keys are retrievable only by the service that needs them, using a credential scoped to that certificate.
Reload automation — services reload after delivery is confirmed, gracefully, with verification that the new cert is being served.
Audit trail — every certificate action — issuance, delivery, renewal, revocation — is logged with timestamps and the identity of what triggered it.
Observable state — the current state of every cert in the inventory is visible without SSHing into servers or digging through logs.

A system with all of these properties is one where certificate expiry incidents essentially stop happening. Not because the team is more diligent, but because the human is no longer in the critical path.