Certificate Rotation Best Practices for Infrastructure Teams
Good certificate rotation isn't just about renewing certs before they expire. It's about building a process that's automated, audited, and resilient — one that doesn't break when the person who set it up leaves.
Certificate rotation is one of those processes that looks simple in theory and turns out to be surprisingly complex in practice. The naive view: renew the cert before it expires, put the new one on the server. Done.
The reality at scale: dozens or hundreds of certs, multiple environments, services with different reload mechanisms, delivery that has to happen atomically (you can't have some services using the new cert while others still use the old one), an audit trail for compliance, and a process that works even when the person who built it is unavailable.
Here's what good certificate rotation actually looks like, from the principles down to implementation.
Principle 1: Rotate early, not just before expiry
Many teams treat rotation as an emergency: renew when the cert is about to expire. That's the wrong mental model. Rotation should be a routine maintenance event that happens so far ahead of expiry that the expiry date becomes irrelevant.
The standard recommendation is to renew at 30 days remaining — one third of a 90-day cert's lifetime. This gives you:
- Time to detect and fix a failed renewal attempt
- Time to identify delivery problems
- A buffer if something in the renewal process needs manual intervention
- Confidence that even a weeks-long delay won't cause an expiry
With 90-day Let's Encrypt certs, triggering renewal at day 60 (30 days remaining) is the standard. With shorter lifetimes — 47 days is being proposed — you'd trigger at roughly day 30 (15-17 days remaining).
The point is that renewal should be boring and predictable, not urgent.
Principle 2: Treat delivery as part of rotation
A common failure mode: rotation is defined as "issue a new cert," with delivery treated as a separate step. This creates a window where the cert has been renewed in the management system but hasn't been delivered to the services that need it — and the old cert is still running.
Delivery should be part of the rotation process, not a follow-on action. The rotation isn't complete until:
- The new cert is issued
- The new cert is delivered to all targets
- All services have confirmed they're serving the new cert
- The old cert is marked as retired
This is why certificate delivery needs to be a first-class feature of your cert management system, not an afterthought. Systems that treat issuance and delivery as separate concerns tend to have incidents in the gap between them.
Principle 3: Never share private keys between services
It's tempting to use a wildcard cert (*.example.com) for everything and deploy the same private key to dozens of services. This is convenient but creates a significant blast radius: compromise one service, and the private key that's on 30 other services is compromised too.
Best practice is to issue separate certificates for separate services. If service-a and service-b both serve traffic on example.com, they should have different certs (or at minimum, the private key should be stored differently, with scoped access).
The objection to this is operational overhead — more certs means more to manage. The answer to this objection is that your cert management system should handle this automatically. When you have good automated lifecycle management, having 200 certs instead of 20 wildcard certs costs very little operationally and dramatically reduces your blast radius.
Principle 4: Use scoped access for cert retrieval
How do machines get access to their certificates? This is a question that's often answered with "we put the cert in a shared location" or "we push it via Ansible." Both approaches have problems:
Shared location means any service that can reach that location can access any cert — not just its own. An over-privileged service or a compromised machine can exfiltrate private keys it shouldn't have.
Push-via-Ansible means the system doing the pushing has credentials that allow it to write files to every target — which is itself a large blast radius if those credentials are compromised.
The better model is certificate-scoped tokens: each service gets a unique token that only allows it to fetch its own certificate. A token for service-a can't be used to fetch service-b's private key, no matter who presents it. This is the model CertLocker uses — and it means a compromised service only exposes its own cert, not your entire certificate inventory.
Principle 5: Automate reload hooks
Rotating a certificate doesn't help if the service keeps serving the old one because it hasn't reloaded. Services like HAProxy, Nginx, and OpenVPN cache their TLS configuration in memory — updating the cert file on disk doesn't automatically make them use the new cert.
Your rotation process needs to trigger service reloads after delivery. These should be:
- Conditional — only reload if the cert actually changed, not on every check
- Graceful — use reload rather than restart to avoid connection interruption (HAProxy's graceful reload, Nginx's
-s reload) - Verified — confirm the service is actually serving the new cert after reload
- Logged — record that the reload happened and when
Principle 6: Test rotation in non-production first
Before your automated rotation process runs in production for the first time, test it end-to-end in a staging or development environment. Specifically verify:
- Renewal actually completes (ACME challenges succeed, CA responds)
- Delivery reaches every expected target
- Services reload and serve the new cert
- Old cert is properly retired
- Audit log captures all expected events
A rotation process that's never been tested end-to-end is a process that will fail at the worst possible time — under time pressure, when the cert is actually expiring.
Principle 7: Build for team handoffs
Cert rotation is often set up by one engineer and then forgotten about — until something goes wrong and that engineer isn't available. Good cert management builds for this explicitly:
Documentation — the rotation process should be documented enough that someone unfamiliar with it can diagnose a failure. This doesn't mean extensive runbooks; it means the system's UI and logs are self-explanatory.
Audit trail — every cert action should be logged so that when something goes wrong, you can see exactly what happened and when.
Escalation paths — if automated rotation fails, who gets notified? Is it clear who can take manual action to fix it?
No tribal knowledge dependencies — the rotation process should work without someone knowing a special command or having a specific file on their laptop.
What good rotation looks like in practice
A mature certificate rotation process looks something like this:
Day 60 of a 90-day cert: The management system detects that renewal is due. It triggers issuance from the configured CA. If ACME, it handles the challenge automatically. The new cert is issued and stored securely.
The delivery process starts: every service registered as a consumer of this cert gets notified (or fetches, in a pull model). Each service receives the new cert file and private key, validates them, and executes its reload hook. A verification check confirms each service is serving the new cert.
The audit log records: cert renewed, issued at timestamp X, delivered to services A, B, C at timestamp Y, each service verified serving new cert at timestamp Z. Slack notification (or equivalent): "Cert for api.example.com renewed and deployed to 3 services. No action required."
That's the goal: rotation that's invisible to operations because it's working exactly as intended.
Summary
Certificate rotation best practices aren't complicated, but they do require deliberate system design:
- Rotate early (30+ days before expiry), not in response to urgency
- Treat delivery as part of rotation, not a separate step
- Issue separate certs for separate services, not shared wildcards
- Use scoped access tokens for cert retrieval, not shared credentials
- Automate reload hooks — a deployed cert is useless until the service reloads
- Test end-to-end before depending on it in production
- Build for team handoffs, not individual knowledge
A system that implements all of these is one where cert expiry incidents essentially stop happening — because the human is never in the critical path.
Automate certificate rotation with CertLocker
CertLocker implements all of these best practices — automated renewal, scoped delivery, reload hooks, and full audit trail.