Certificate Rotation Best Practices for Infrastructure Teams

Certificate rotation is one of those processes that looks simple in theory and turns out to be surprisingly complex in practice. The naive view: renew the cert before it expires, put the new one on the server. Done.

The reality at scale: dozens or hundreds of certs, multiple environments, services with different reload mechanisms, delivery that has to happen atomically (you can't have some services using the new cert while others still use the old one), an audit trail for compliance, and a process that works even when the person who built it is unavailable.

Here's what good certificate rotation actually looks like, from the principles down to implementation.

Principle 1: Rotate early, not just before expiry

Many teams treat rotation as an emergency: renew when the cert is about to expire. That's the wrong mental model. Rotation should be a routine maintenance event that happens so far ahead of expiry that the expiry date becomes irrelevant.

The standard recommendation is to renew at 30 days remaining — one third of a 90-day cert's lifetime. This gives you:

Time to detect and fix a failed renewal attempt
Time to identify delivery problems
A buffer if something in the renewal process needs manual intervention
Confidence that even a weeks-long delay won't cause an expiry

With 90-day Let's Encrypt certs, triggering renewal at day 60 (30 days remaining) is the standard. With shorter lifetimes — 47 days is being proposed — you'd trigger at roughly day 30 (15-17 days remaining).

The point is that renewal should be boring and predictable, not urgent.

Principle 2: Treat delivery as part of rotation

A common failure mode: rotation is defined as "issue a new cert," with delivery treated as a separate step. This creates a window where the cert has been renewed in the management system but hasn't been delivered to the services that need it — and the old cert is still running.

Delivery should be part of the rotation process, not a follow-on action. The rotation isn't complete until:

The new cert is issued
The new cert is delivered to all targets
All services have confirmed they're serving the new cert
The old cert is marked as retired

This is why certificate delivery needs to be a first-class feature of your cert management system, not an afterthought. Systems that treat issuance and delivery as separate concerns tend to have incidents in the gap between them.

Principle 3: Never share private keys between services

It's tempting to use a wildcard cert (*.example.com) for everything and deploy the same private key to dozens of services. This is convenient but creates a significant blast radius: compromise one service, and the private key that's on 30 other services is compromised too.

Best practice is to issue separate certificates for separate services. If service-a and service-b both serve traffic on example.com, they should have different certs (or at minimum, the private key should be stored differently, with scoped access).

The objection to this is operational overhead — more certs means more to manage. The answer to this objection is that your cert management system should handle this automatically. When you have good automated lifecycle management, having 200 certs instead of 20 wildcard certs costs very little operationally and dramatically reduces your blast radius.

Principle 4: Use scoped access for cert retrieval

How do machines get access to their certificates? This is a question that's often answered with "we put the cert in a shared location" or "we push it via Ansible." Both approaches have problems:

Shared location means any service that can reach that location can access any cert — not just its own. An over-privileged service or a compromised machine can exfiltrate private keys it shouldn't have.

Push-via-Ansible means the system doing the pushing has credentials that allow it to write files to every target — which is itself a large blast radius if those credentials are compromised.

The better model is certificate-scoped tokens: each service gets a unique token that only allows it to fetch its own certificate. A token for service-a can't be used to fetch service-b's private key, no matter who presents it. This is the model CertLocker uses — and it means a compromised service only exposes its own cert, not your entire certificate inventory.

Principle 5: Automate reload hooks

Rotating a certificate doesn't help if the service keeps serving the old one because it hasn't reloaded. Services like HAProxy, Nginx, and OpenVPN cache their TLS configuration in memory — updating the cert file on disk doesn't automatically make them use the new cert.

Your rotation process needs to trigger service reloads after delivery. These should be:

Conditional — only reload if the cert actually changed, not on every check
Graceful — use reload rather than restart to avoid connection interruption (HAProxy's graceful reload, Nginx's -s reload)
Verified — confirm the service is actually serving the new cert after reload
Logged — record that the reload happened and when

Principle 6: Test rotation in non-production first

Before your automated rotation process runs in production for the first time, test it end-to-end in a staging or development environment. Specifically verify:

Renewal actually completes (ACME challenges succeed, CA responds)
Delivery reaches every expected target
Services reload and serve the new cert
Old cert is properly retired
Audit log captures all expected events

A rotation process that's never been tested end-to-end is a process that will fail at the worst possible time — under time pressure, when the cert is actually expiring.

Principle 7: Build for team handoffs

Cert rotation is often set up by one engineer and then forgotten about — until something goes wrong and that engineer isn't available. Good cert management builds for this explicitly:

Documentation — the rotation process should be documented enough that someone unfamiliar with it can diagnose a failure. This doesn't mean extensive runbooks; it means the system's UI and logs are self-explanatory.

Audit trail — every cert action should be logged so that when something goes wrong, you can see exactly what happened and when.

Escalation paths — if automated rotation fails, who gets notified? Is it clear who can take manual action to fix it?

No tribal knowledge dependencies — the rotation process should work without someone knowing a special command or having a specific file on their laptop.

What good rotation looks like in practice

A mature certificate rotation process looks something like this:

Day 60 of a 90-day cert: The management system detects that renewal is due. It triggers issuance from the configured CA. If ACME, it handles the challenge automatically. The new cert is issued and stored securely.

The delivery process starts: every service registered as a consumer of this cert gets notified (or fetches, in a pull model). Each service receives the new cert file and private key, validates them, and executes its reload hook. A verification check confirms each service is serving the new cert.

The audit log records: cert renewed, issued at timestamp X, delivered to services A, B, C at timestamp Y, each service verified serving new cert at timestamp Z. Slack notification (or equivalent): "Cert for api.example.com renewed and deployed to 3 services. No action required."

That's the goal: rotation that's invisible to operations because it's working exactly as intended.

Summary

Certificate rotation best practices aren't complicated, but they do require deliberate system design:

Rotate early (30+ days before expiry), not in response to urgency
Treat delivery as part of rotation, not a separate step
Issue separate certs for separate services, not shared wildcards
Use scoped access tokens for cert retrieval, not shared credentials
Automate reload hooks — a deployed cert is useless until the service reloads
Test end-to-end before depending on it in production
Build for team handoffs, not individual knowledge

A system that implements all of these is one where cert expiry incidents essentially stop happening — because the human is never in the critical path.