CertLocker Active-Passive HA: How We Built Two-Node Failover
Certificate infrastructure that goes down takes certificate infrastructure with it. This is a field report on how we designed and drilled active-passive HA for CertLocker: two nodes, one watchdog, and the hard lessons from a dozen full-node poweroff tests.
Why this matters for a certificate platform
CertLocker manages TLS certificates, SSH access tokens, secrets, and ACME workflows. In a single-node setup, if that node goes down, so does the platform responsible for keeping everything else up. HAProxy ACME renewals stall. SSH access tokens cannot be issued. Audit events stop. The tooling meant to prevent outages is itself a single point of failure.
The design requirement was straightforward: two nodes, automatic failover, no manual steps for the common failure modes. Process crash, container death, and Postgres failure should all recover on their own. Full node or datacenter loss should promote the surviving node automatically. What follows is how that works, what we learned drilling it, and what still has an open problem.
The architecture in three layers
The design is Active-Passive with a warm standby. Both nodes run a complete copy of the CertLocker stack at all times. The secondary is not idle; it serves traffic as a slave while the primary handles schedulers, the Router, and all write decisions. Only one thing ever migrates between nodes: the primary role, which moves as a single unit.
Cloudflare / AWS ALB / GCP LB
health: GET /edge/rest/health
dev stand-in: test VM running HAProxy"] LB -->|healthy backend| N1HA["node-1 HAProxy
:443"] LB -->|healthy backend| N2HA["node-2 HAProxy
:443"] subgraph N1["node-1 · role: primary or standby"] N1HA --> N1Edge[Edge] N1Edge --> N1GW[Gateway] N1GW --> N1DS[DataService] N1DS --> N1Pool[pgpool] N1Pool --> N1PG[("Postgres
repmgr")] N1Router["Router
(runs when primary)"] N1Watch[cl-watchdog timer · 10s] end subgraph N2["node-2 · role: primary or standby"] N2HA --> N2Edge[Edge] N2Edge --> N2GW[Gateway] N2GW --> N2DS[DataService] N2DS --> N2Pool[pgpool] N2Pool --> N2PG[("Postgres
repmgr")] N2Router["Router
(runs when primary)"] N2Watch[cl-watchdog timer · 10s] end N1PG <-.->|streaming replication| N2PG N1GW -.->|registers with active Router| N2Router N2GW -.->|registers with active Router| N1Router
Layer 1: Entry point
In production, the entry point is a cloud load balancer: Cloudflare, AWS ALB/NLB, GCP Cloud Load Balancing, or whatever your provider offers. The two CertLocker nodes sit behind it as backends. The cloud LB handles TLS termination or TCP passthrough, health-checks both nodes, and routes traffic away from a failing one automatically with no extra infrastructure to run or maintain on your end.
For our two-node trial on a private cloud tenant, we did not have a cloud LB available, so a small test VM running native HAProxy filled the role: HTTP-to-HTTPS redirect on port 80 and TCP-passthrough on port 443 to the two backend nodes. The configuration below is that stand-in. Swap it for your cloud LB of choice in production — the only thing that must carry over is the health check target.
The health check must be L7, not TCP. The target is /edge/rest/health over HTTPS with the correct Host header. A TCP probe can pass while the application layer is broken: containers starting, Router registry empty after a restart, Edge stuck in a STARTING state. Only the L7 check distinguishes a node that is reachable from one that is actually serving traffic.
/edge/rest/health as the health check path./edge/rest/health.# test VM HAProxy — stand-in for a cloud load balancer
frontend https_in
bind *:443
mode tcp
default_backend certlocker_nodes
backend certlocker_nodes
mode tcp
option httpchk GET /edge/rest/health # L7 — TCP alone misses broken registries
http-check send hdr Host certlocker.example.com hdr SNI certlocker.example.com
server node-1 :443 check check-ssl verify none \
check-sni certlocker.example.com
server node-2 :443 check check-ssl verify none \
check-sni certlocker.example.com
# NOTE: no `ssl` keyword on server lines.
# Frontend is TCP passthrough — adding ssl here makes HAProxy terminate
# and re-originate TLS, which breaks real client traffic. Layer 2: Application — Jeremy slave mode
CertLocker's application services (Gateway, DataService, Edge, Bastion) use an internal service registry called the Router. The Router runs in-memory and assigns roles: the first service of each type to register becomes master; all later registrations of the same type become slaves. Slave services serve requests but never run scheduled jobs: no ACME housekeeping, no certificate expiry checks, no probe polling.
The trick for two-node HA is that only one Router can be active at any time, and both nodes' services register with it. On node-2, the compose environment overrides the Router address to point at node-1's private IP. Node-2 has no Router service of its own. Both nodes' Gateways, DataServices, Edges, and Bastions register with node-1's Router, are elected slaves, and never run schedulers. When node-1 dies, node-2 starts its own Router, services re-register with it, and schedulers start for the first time on node-2.
There is no distributed consensus involved. The Router is the single authority. Which node the Router lives on is the only question, and the watchdog answers it.
Layer 3: Database — repmgr and pgpool
Both nodes run bitnamilegacy/postgresql-repmgr:17.6.0. repmgr handles streaming replication and failover promotion. It detects primary unavailability and promotes the standby without any manual step. Replication traffic between the two nodes runs over TLS 1.3 — pg_hba.conf is configured with hostssl only, so plaintext TCP replication connections are rejected at the database level.
Each node also runs bitnamilegacy/pgpool:4.6.3, and every CertLocker service connects to its local pgpool rather than directly to Postgres. pgpool tracks which Postgres backend is currently primary using streaming-replication checks and routes all connections there. Load balancing is disabled; every query goes to the primary. This means that when repmgr promotes a new primary on node-2, node-1's pgpool detects the change and starts routing to node-2's Postgres without any service restart.
# Node 1 (ha-primary profile)
profiles: [ha-primary]
router:
image: ghcr.io/certlocker-io/jeremy-router:${CERTLOCKER_JEREMY_TAG}
network_mode: host # must be host — Router records registration IPs
environment:
SERVER_PORT: 9000
REGISTRA_PORT: 40905
gateway:
image: ghcr.io/certlocker-io/jeremy-gateway:${CERTLOCKER_JEREMY_TAG}
network_mode: host
environment:
APPLICATION_ROUTER: http://${HA_ROUTER_IP}:9000
ApplicationRegistra: ${HA_ROUTER_IP}:40905
# On node-2 (ha-secondary), HA_ROUTER_IP = node-1's tenant IP.
# Gateway registers there → elected slave → no schedulers run.
# ─── Database layer (both profiles) ───────────────────────────────────────────
postgres:
image: bitnamilegacy/postgresql-repmgr:17.6.0
environment:
REPMGR_NODE_NAME: ${CERTLOCKER_HA_NODE_NAME} # node-1 | node-2
REPMGR_NODE_NETWORK_NAME: ${CERTLOCKER_HA_NODE_IP}
REPMGR_PARTNER_NODES: ${CERTLOCKER_HA_NODE1_IP},${CERTLOCKER_HA_NODE2_IP}
REPMGR_USE_PASSFILE: "true" # required — inline passwords break on metacharacters
ports:
- "${CERTLOCKER_HA_NODE_IP}:5432:5432"
pgpool:
image: bitnamilegacy/pgpool:4.6.3
environment:
PGPOOL_BACKEND_NODES: "0:${NODE1_IP}:5432,1:${NODE2_IP}:5432"
PGPOOL_ENABLE_LOAD_BALANCING: "no" # all writes to primary; no read/write split
PGPOOL_SR_CHECK_USER: repmgr The watchdog: three rules, no election logic
The watchdog is a bash script deployed by Ansible to /usr/local/bin/cl-watchdog on both nodes. A systemd timer fires it every 10 seconds. A host-local flock prevents overlapping runs. It does not implement any election protocol of its own. It asks one question per tick: is the world consistent with what repmgr decided?
COMPOSE_PROFILES in .env, flip the profile and run docker compose up -d. Covers node death, node return, and planned switchover automatically.# Rule 1 — role follows DB
# repmgr is the only election brain.
# If local postgres role disagrees with COMPOSE_PROFILES, flip and converge.
if [ "$pg_recovering" = "f" ] && [ "$current_profile" = "ha-secondary" ]; then
set_env "COMPOSE_PROFILES" "ha-primary"
set_env "HA_ROUTER_IP" "$LOCAL_IP"
remove_local_router
converge_stack # starts Router, promotes Jeremy services to master
fi
# Rule 2 — Router registry repair
# Router restarts empty (in-memory registry). If our services are missing,
# restart them so they re-register. Heals amnesia on both nodes.
missing=$(check_missing_node_types) # DATASERVICE GATEWAY EDGE BASTION
if [ -n "$missing" ]; then
repair_missing_services "$missing"
fi
# Rule 3 — datacenter failover
# If active Router unreachable for ROUTER_ABSENT_THRESHOLD (5s) and peer
# Router also unreachable, secondary starts its own Router without waiting.
if router_absent_long_enough && ! peer_has_router; then
start_local_router_and_promote_app_layer
fi When Rule 1 fires (a DB-driven role change), it clears the Rule 3 absence timer so the two paths never conflict. The watchdog never tries to implement its own quorum or election; it just mirrors what repmgr has already decided into compose state.
Ordered convergence: the lesson that hurt the most
Early versions of the watchdog started the primary app layer by converging the whole stack at once with a single docker compose up -d. This consistently produced two failure modes that did not appear in any individual component test.
The first was a DataService JDBC pool crash: DataService would start before pgpool had completed its first successful repmgr status check, try to open a connection to 127.0.0.1:5432, fail, and never recover without a manual restart. The fix was a hard gate: before starting DataService, the watchdog waits until a SELECT 1 routed through pgpool returns successfully.
The second was Edge stuck in Router registry STARTING status while Docker reported the container healthy. Docker health and Router-registry ACTIVE are two different things. Docker health checks a local HTTP endpoint; the Router only marks a service ACTIVE after it has fully completed registration. If Edge took too long to register, HAProxy's backend health check timed out and the node fell out of the LB rotation even though every container was green. The fix: after starting Edge, the watchdog waits 20 seconds and then checks the Router registry directly. If Edge is not ACTIVE in the registry, it restarts Edge specifically, without touching anything else.
# Ordered primary convergence after failover
# (learned the hard way — starting everything at once caused
# dataservice JDBC pool failures and Edge stuck in STARTING)
1. Force-recreate pgpool
2. Wait for: SELECT 1 via pgpool succeeds ← gate
3. Start dataservice (--no-deps, avoids Compose health-gate waits)
4. Start gateway + bastion-server
5. Start edge
6. Wait 20s
7. If EDGE not ACTIVE in Router registry → restart edge only
8. Start ui + haproxy Failure scenario table
| Failure | Who recovers it | RTO | Status |
|---|---|---|---|
| Container crash (any service, either node) | Docker restart: always | 10–20s | Passes |
| Router container crash on primary | Docker restarts; watchdog Rule 2 triggers re-registration on both nodes | ~30s | Passes |
| Postgres crash on primary | Docker restarts; repmgr sorts out who is primary; watchdog Rule 1 converges compose | ~36s observed | Passes |
| Primary node return after failover | repmgr rejoins as standby; watchdog demotes app layer, removes stale Router | 1 watchdog tick | Passes |
| Full node / OpenStack poweroff | Rule 3 + Rule 1; secondary auto-promotes, repmgr promotes DB | Seconds to low minutes depending on repmgr promotion timing | Automatic |
| Network partition (both nodes alive, link dead) | Rule 3 promotes secondary; node-1 DB still primary; two live Routers | Automatic, but split-brain risk | Mitigated by witness VM in prod |
Drill results: what the polls actually showed
All poweroff drills were run on our two-node trial pair on an OpenStack tenant subnet. The test: issue an OpenStack poweroff to the active primary node, poll the public health endpoint every second for three minutes, count consecutive non-200 responses as the bad window.
# Reproduce a component crash inside a container
# (docker kill is not valid — it can trigger Docker's manual-stop path)
docker exec certlocker-gateway pkill -9 java
# Simulate a clean node power-off for watchdog drills
# Disable watchdog timer first so it doesn't interfere with timing
systemctl stop cl-watchdog.timer
docker compose --project-directory /opt/certlocker/ha/stack down
# Check Router registry on the active node
curl -s http://:9000/router/rest/status | jq .
# Check which node owns the DB primary
docker exec certlocker-postgres bash -c \
'PGPASSWORD=$POSTGRESQL_POSTGRES_PASSWORD psql -U postgres -h 127.0.0.1 -tAc "SELECT pg_is_in_recovery()"'
# f = primary, t = standby After a dozen iterations, the pattern became clear: the bottleneck is not the watchdog logic. It is the sequential dependency chain that must complete before an L7 health check returns 200:
- OpenStack detects the primary is gone (~5–10s depending on heartbeat interval)
- repmgr detects Postgres is unreachable and votes to promote (~15–25s depending on repmgr reconnect timing)
- pgpool detects the new primary and re-routes queries (~5s)
- Watchdog Rule 1 sees the DB role change and starts the app layer in dependency order (~20–30s)
- Edge completes Router registration and HAProxy L7 health check passes (~10s)
Each step has its own timeout and retry window. The dominant cost is repmgr's promotion delay — by design it waits to be confident the old primary is dead before acting. That conservatism is correct; shortcutting it risks split-brain. Once repmgr promotes, the watchdog picks up the role change and the application layer converges automatically.
Hard-won lessons from the drills
REPMGR_USE_PASSFILE=true is required
Bitnami repmgr generates shell commands with the repmgr password inline. Passwords containing shell metacharacters silently break promotion with a shell syntax error. With REPMGR_USE_PASSFILE=true, credentials go through a pgpass file instead.
Router and Jeremy services must use host network mode
The Router records each service's source IP from the TCP registration socket. Bridge networking advertises Docker-internal IPs that the peer node cannot route to. Host network mode fixes both: services register with the real node IP, and there is no Docker bridge hairpin issue.
Never put pull_policy: always on watchdog-restarted services
Runtime recovery should not block on a GitHub Container Registry pull. Image pulls belong in the deploy path; the watchdog should only start or recreate images that are already present on the host.
pgpool must be force-recreated every deploy
Stale pgpool_status files can wedge pgpool into a state where it refuses to start. A clean container recreation on each deploy avoids this entirely.
Docker health ≠ Router registry ACTIVE
Docker reports a container healthy as soon as its health-check endpoint returns 200. The Router marks a service ACTIVE only after full registration. HAProxy's L7 probe uses the real traffic path, so a node where Edge is Docker-healthy but Router-STARTING will still fail LB checks and drop out of the pool.
Use private fixed IPs for inter-node traffic, not public or NAT addresses
On cloud providers that use floating/elastic IPs as NAT overlays, you cannot bind a service to that address from inside the VM. All inter-node communication (repmgr replication, Router registration) must use the private fixed IPs on the tenant subnet, or WireGuard tunnel IPs when the nodes are on different networks or providers.
Postgres replication must be TLS-only — the default is plaintext
Out of the box, repmgr streaming replication will connect over plaintext TCP if you let it. That means your entire database replication stream — every write, every WAL segment — crosses the network unencrypted. The fix is to set hostssl (not host) in pg_hba.conf for the replication user so the database rejects any plaintext replication connection at the authentication layer. Live verification from our primary confirmed the standby connection as ssl=t version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384. This is now enforced in the Ansible deploy so it cannot regress.
What the Ansible setup looks like
The entire HA stack is managed by Ansible. There are no manual steps after the initial secrets are uploaded to CertLocker Trust. The role hierarchy is:
(targets ha-dev group)"] --> Baseline["baseline
apt / OS hardening"] Play --> Docker["docker
engine + compose"] Play --> UFW["ufw
firewall rules"] Play --> HAStack["certlocker_ha_stack
compose · watchdog · .env"] Play --> LB["certlocker_lb.yml
cloud LB / test HAProxy"] HAStack --> EnvJ2["env.j2
COMPOSE_PROFILES
HA_ROUTER_IP
NODE IPs
secrets from Trust"] HAStack --> WatchdogSH["watchdog.sh
systemd service + timer 10s"] HAStack --> ComposeHa["docker-compose.ha.yml
postgres/repmgr · pgpool
router · gateway · edge
bastion · ui · haproxy"]
Secrets live in CertLocker Trust, not in the repo. The init script generates all HA passwords, uploads them as named secrets, and writes an Ansible vault file with lookup references. The deploy run resolves them at play time using a gateway API token. Re-running the init script regenerates and rotates secrets without touching hosts.yml or all.yml.
What we are working toward
The current design meets its goals across the full failure surface. Process crashes and container deaths stay local and self-heal in seconds. A full Postgres process crash promotes the standby automatically and the watchdog converges the application layer behind it. Planned switchovers are a single Ansible variable change. Node return after failover is automatic and leaves no stale state. Full node poweroff is handled end-to-end without any manual step.
A production deployment would add a witness VM for quorum on the Rule 3 network-partition path, preventing the split-brain edge case where both nodes see each other as unreachable and both try to promote their app layer.
On the orchestration side, we have deliberately kept the current design on plain Docker Compose with a bash watchdog rather than pulling in a full scheduler. Kubernetes is the obvious question, and the answer for now is no: the operational overhead of running a cluster, the network model changes, and the way Kubernetes handles stateful workloads add more complexity than they solve for a two-node active-passive setup at this scale. The problems we are solving are already solved by repmgr, pgpool, and a 200-line watchdog.
Nomad is worth a closer look down the road. It is a much lighter fit for this kind of mixed workload (Docker containers alongside system services), handles bare metal and VM fleets without a dedicated control plane cluster, and does not require you to rethink your networking model to use it. If we move beyond two nodes or need more dynamic scheduling, Nomad is the direction we would go before considering Kubernetes.
The HA design is live and running on our two-node trial today. If you are deploying CertLocker into an environment where uptime matters, reach out and we can talk through what the right topology looks like for your infrastructure.