When 'Is the Cert Valid?' Is the Wrong Question

A common incident pattern: something breaks. An engineer opens an investigation with "let me check the cert." They check the cert on the public-facing endpoint. It's valid, not expired, chain is intact. They report back: "cert's fine." The investigation stalls.

Meanwhile, the actual problem is a backend service whose internal cert expired three days ago. Or the load balancer's cert is valid but mismatched with the private key it has in memory. Or the monitoring system is checking a different hostname than the one clients actually use.

"Is the cert valid?" sounds like a diagnostic question. In multi-certificate environments it's almost useless without the follow-up: valid where? Which cert? Against which trust store? Serving which traffic?

The architecture that created this problem

A typical production infrastructure for a web-facing service involves at least four distinct certificate surfaces, each with different trust requirements:

Edge / public-facing. The certificate presented to browsers and API clients. Issued by a public CA (Let's Encrypt, DigiCert, etc.), trusted by the standard root store. This is the cert most people think of when they say "the certificate."

Load balancer to backend. After TLS terminates at the load balancer (HAProxy, NGINX, AWS ALB), traffic often re-encrypts to reach backend services. This internal cert may be issued by your own internal CA. It's not in any browser's root store — it doesn't need to be, because only your load balancer validates it.

Service-to-service. Backend services calling each other over mTLS. Each service has a certificate that identifies it to other services, and validates others' certificates against an internal CA. These certs have shorter lifetimes by design and rotate more frequently.

Management and monitoring plane. Certificates used by your infrastructure tooling — Ansible, Prometheus, Vault, CertLocker itself. These need to trust a different set of roots than your public-facing services, and are often managed by separate teams with separate tooling.

Each of these surfaces can fail independently. An expired edge cert causes browser errors. An expired backend cert causes a 502 at the load balancer. An expired service-to-service cert causes a microservice call chain to break silently. An expired monitoring cert means your observability goes blind exactly when you need it most.

Four ways the mental model breaks down

The "one service, one cert" mental model was never entirely accurate, but for a long time the gap between the model and reality was small enough to ignore. As infrastructure became more distributed and layered, the gap grew. Here's where it breaks down in practice.

Validity isn't universal — it's scoped to a trust store. When you check "is this cert valid?" using OpenSSL or a browser, you're checking validity against a specific trust store. A cert that's valid against the system root store may not be valid against the custom CA bundle your internal services use, and vice versa. "The cert is valid" means nothing without specifying what's validating it.

Multiple teams manage different certificate surfaces. The infra team owns the HAProxy cert. The platform team manages the service mesh certs. The security team controls the internal CA. Each team knows its own surface is healthy. Nobody has a complete view. When something breaks at the intersection of two surfaces — a load balancer cert issued by an internal CA that just had its intermediate rotated — nobody owns the problem because nobody has the full picture.

Monitoring checks one path but traffic uses another. A healthcheck monitors https://api.example.com/health and reports the cert as valid. But 20% of API traffic arrives on a different hostname that maps to a different virtual host with a different cert that expired last week. The monitor passes. The traffic fails. The two are measuring different things.

Cert inventory diverges from deployment reality over time. The cert management system says a cert was renewed on Tuesday. But the deployment step failed silently on two of the five backend nodes. The cert is "current" in the management system but expired in production on those two nodes. Without delivery verification — confirming that every target is actually serving the new cert — the gap between "issued" and "deployed" is invisible.

How to map your trust surfaces correctly

The starting point for reasoning clearly about certificates in a complex environment is a trust surface map: a diagram of every point where TLS termination or certificate validation occurs in your infrastructure, with the answer to three questions at each point:

What certificate is presented? — which cert, which CA, what validity period
What is validating it? — which trust store, which client, what validation logic
Who owns this cert? — which team manages it, what tooling renews it, where is it tracked

For a typical production service, this produces a table something like:

Surface	Cert issued by	Validated by	Owner
HAProxy edge	Let's Encrypt	Browser / OS root store	Infra team
HAProxy → backend	Internal CA	HAProxy CA bundle	Platform team
Service mesh mTLS	Internal CA (short-lived)	Sidecar proxy	Platform team
Monitoring / ops	Internal CA	Prometheus / Grafana	SRE team

Building this table for your environment takes an afternoon. Updating it when the architecture changes takes five minutes. Having it when an incident occurs saves hours of investigation.

The monitoring implication

Once you have a trust surface map, the right monitoring strategy becomes clear: you need a check for each surface, using the right trust store for that surface.

A single synthetic check against the public hostname tells you the edge cert is valid. It tells you nothing about the backend cert, the service mesh certs, or the ops tooling certs. If your monitoring is a single openssl s_client command checking the public endpoint, your coverage is one-quarter of the picture at best.

More importantly: the check for an internal cert should use the internal CA bundle, not the system root store. If you check an internal cert against the system root store, you'll get a failure even when the cert is completely healthy — it's just not trusted by that trust store. This produces false positives that train engineers to ignore cert alerts, which is how a real expiry goes unnoticed.

Centralized visibility across surfaces

The problem with certificate surfaces being owned by different teams is that the aggregate view doesn't exist anywhere. Infra knows the HAProxy cert is healthy. Platform knows the service mesh is healthy. But nobody knows the combined state of all certificate surfaces simultaneously.

This is where certificate management tooling provides the most value beyond automation: a single view of every cert in the inventory, regardless of which surface it covers or which CA issued it. Not just "is this cert expired?" but "which team manages this, when was it last renewed, what are its targets, and has delivery been verified?"

Without that view, the response to "is anything cert-related broken?" is a multi-team investigation that starts by figuring out what certs even exist. With it, the answer is a dashboard that shows the state of every cert surface at a glance.

The practical takeaway

The next time an incident starts with a cert check, ask the follow-up questions before concluding anything:

Which cert, on which surface? The edge cert, a backend cert, an internal service cert?
Valid against which trust store? Public root store? Internal CA bundle? A specific service's trust configuration?
What's actually validating this cert in production? Your openssl check uses your local trust store, which may differ from what the failing service uses.
Is this the only cert surface that could be causing the observed failure? Could the problem be on a different surface that this check doesn't cover?

These questions don't require a sophisticated tooling investment to ask. But they do require a mental model that treats "the cert" as shorthand for a multi-surface, multi-team problem space rather than a single file on a single server.

Getting comfortable with that model is the difference between an investigation that resolves in 15 minutes and one that takes three hours and involves four team members.

When "Is the Cert Valid?" Is the Wrong Question