CertLocker Active-Passive HA: How We Built Two-Node Failover

Why this matters for a certificate platform

CertLocker manages TLS certificates, SSH access tokens, secrets, and ACME workflows. In a single-node setup, if that node goes down, so does the platform responsible for keeping everything else up. HAProxy ACME renewals stall. SSH access tokens cannot be issued. Audit events stop. The tooling meant to prevent outages is itself a single point of failure.

The design requirement was straightforward: two nodes, automatic failover, no manual steps for the common failure modes. Process crash, container death, and Postgres failure should all recover on their own. Full node or datacenter loss should promote the surviving node automatically. What follows is how that works, what we learned drilling it, and what still has an open problem.

The architecture in three layers

The design is Active-Passive with a warm standby. Both nodes run a complete copy of the CertLocker stack at all times. The secondary is not idle; it serves traffic as a slave while the primary handles schedulers, the Router, and all write decisions. Only one thing ever migrates between nodes: the primary role, which moves as a single unit.

flowchart TB User[Users / operators] --> DNS[certlocker.io] DNS --> LB["Cloud load balancer
Cloudflare / AWS ALB / GCP LB
health: GET /edge/rest/health
dev stand-in: test VM running HAProxy"] LB -->|healthy backend| N1HA["node-1 HAProxy
:443"] LB -->|healthy backend| N2HA["node-2 HAProxy
:443"] subgraph N1["node-1 · role: primary or standby"] N1HA --> N1Edge[Edge] N1Edge --> N1GW[Gateway] N1GW --> N1DS[DataService] N1DS --> N1Pool[pgpool] N1Pool --> N1PG[("Postgres
repmgr")] N1Router["Router
(runs when primary)"] N1Watch[cl-watchdog timer · 10s] end subgraph N2["node-2 · role: primary or standby"] N2HA --> N2Edge[Edge] N2Edge --> N2GW[Gateway] N2GW --> N2DS[DataService] N2DS --> N2Pool[pgpool] N2Pool --> N2PG[("Postgres
repmgr")] N2Router["Router
(runs when primary)"] N2Watch[cl-watchdog timer · 10s] end N1PG <-.->|streaming replication| N2PG N1GW -.->|registers with active Router| N2Router N2GW -.->|registers with active Router| N1Router

Layer 1: Entry point

In production, the entry point is a cloud load balancer: Cloudflare, AWS ALB/NLB, GCP Cloud Load Balancing, or whatever your provider offers. The two CertLocker nodes sit behind it as backends. The cloud LB handles TLS termination or TCP passthrough, health-checks both nodes, and routes traffic away from a failing one automatically with no extra infrastructure to run or maintain on your end.

For our two-node trial on a private cloud tenant, we did not have a cloud LB available, so a small test VM running native HAProxy filled the role: HTTP-to-HTTPS redirect on port 80 and TCP-passthrough on port 443 to the two backend nodes. The configuration below is that stand-in. Swap it for your cloud LB of choice in production — the only thing that must carry over is the health check target.

The health check must be L7, not TCP. The target is /edge/rest/health over HTTPS with the correct Host header. A TCP probe can pass while the application layer is broken: containers starting, Router registry empty after a restart, Edge stuck in a STARTING state. Only the L7 check distinguishes a node that is reachable from one that is actually serving traffic.

Cloud LB equivalents

Cloudflare

Load Balancing with HTTP health monitors. Point two origins at your node IPs. Use /edge/rest/health as the health check path.

AWS

NLB for TCP passthrough (preserves client TLS) or ALB for termination. Target group health check: HTTPS on port 443, path /edge/rest/health.

GCP / others

Any provider with HTTPS health checks and two backend instances works. The nodes themselves run HAProxy for intra-node routing regardless of the cloud LB in front.

dev stand-in: native HAProxy on a test VM

# test VM HAProxy — stand-in for a cloud load balancer
frontend https_in
    bind *:443
    mode tcp
    default_backend certlocker_nodes

backend certlocker_nodes
    mode tcp
    option httpchk GET /edge/rest/health   # L7 — TCP alone misses broken registries
    http-check send hdr Host certlocker.example.com hdr SNI certlocker.example.com
    server node-1 :443 check check-ssl verify none \
        check-sni certlocker.example.com
    server node-2 :443 check check-ssl verify none \
        check-sni certlocker.example.com

# NOTE: no `ssl` keyword on server lines.
# Frontend is TCP passthrough — adding ssl here makes HAProxy terminate
# and re-originate TLS, which breaks real client traffic.

Layer 2: Application — Jeremy slave mode

CertLocker's application services (Gateway, DataService, Edge, Bastion) use an internal service registry called the Router. The Router runs in-memory and assigns roles: the first service of each type to register becomes master; all later registrations of the same type become slaves. Slave services serve requests but never run scheduled jobs: no ACME housekeeping, no certificate expiry checks, no probe polling.

The trick for two-node HA is that only one Router can be active at any time, and both nodes' services register with it. On node-2, the compose environment overrides the Router address to point at node-1's private IP. Node-2 has no Router service of its own. Both nodes' Gateways, DataServices, Edges, and Bastions register with node-1's Router, are elected slaves, and never run schedulers. When node-1 dies, node-2 starts its own Router, services re-register with it, and schedulers start for the first time on node-2.

There is no distributed consensus involved. The Router is the single authority. Which node the Router lives on is the only question, and the watchdog answers it.

Layer 3: Database — repmgr and pgpool

Both nodes run bitnamilegacy/postgresql-repmgr:17.6.0. repmgr handles streaming replication and failover promotion. It detects primary unavailability and promotes the standby without any manual step. Replication traffic between the two nodes runs over TLS 1.3 — pg_hba.conf is configured with hostssl only, so plaintext TCP replication connections are rejected at the database level.

Each node also runs bitnamilegacy/pgpool:4.6.3, and every CertLocker service connects to its local pgpool rather than directly to Postgres. pgpool tracks which Postgres backend is currently primary using streaming-replication checks and routes all connections there. Load balancing is disabled; every query goes to the primary. This means that when repmgr promotes a new primary on node-2, node-1's pgpool detects the change and starts routing to node-2's Postgres without any service restart.

docker-compose.ha.yml — database and app layer sketch

# Node 1 (ha-primary profile)
profiles: [ha-primary]

router:
  image: ghcr.io/certlocker-io/jeremy-router:${CERTLOCKER_JEREMY_TAG}
  network_mode: host          # must be host — Router records registration IPs
  environment:
    SERVER_PORT: 9000
    REGISTRA_PORT: 40905

gateway:
  image: ghcr.io/certlocker-io/jeremy-gateway:${CERTLOCKER_JEREMY_TAG}
  network_mode: host
  environment:
    APPLICATION_ROUTER: http://${HA_ROUTER_IP}:9000
    ApplicationRegistra: ${HA_ROUTER_IP}:40905
    # On node-2 (ha-secondary), HA_ROUTER_IP = node-1's tenant IP.
    # Gateway registers there → elected slave → no schedulers run.

# ─── Database layer (both profiles) ───────────────────────────────────────────
postgres:
  image: bitnamilegacy/postgresql-repmgr:17.6.0
  environment:
    REPMGR_NODE_NAME: ${CERTLOCKER_HA_NODE_NAME}         # node-1 | node-2
    REPMGR_NODE_NETWORK_NAME: ${CERTLOCKER_HA_NODE_IP}
    REPMGR_PARTNER_NODES: ${CERTLOCKER_HA_NODE1_IP},${CERTLOCKER_HA_NODE2_IP}
    REPMGR_USE_PASSFILE: "true"      # required — inline passwords break on metacharacters
  ports:
    - "${CERTLOCKER_HA_NODE_IP}:5432:5432"

pgpool:
  image: bitnamilegacy/pgpool:4.6.3
  environment:
    PGPOOL_BACKEND_NODES: "0:${NODE1_IP}:5432,1:${NODE2_IP}:5432"
    PGPOOL_ENABLE_LOAD_BALANCING: "no"    # all writes to primary; no read/write split
    PGPOOL_SR_CHECK_USER: repmgr

The watchdog: three rules, no election logic

The watchdog is a bash script deployed by Ansible to /usr/local/bin/cl-watchdog on both nodes. A systemd timer fires it every 10 seconds. A host-local flock prevents overlapping runs. It does not implement any election protocol of its own. It asks one question per tick: is the world consistent with what repmgr decided?

Three reconcile rules

Role follows DB. repmgr is the only election brain. If the local Postgres role disagrees with COMPOSE_PROFILES in .env, flip the profile and run docker compose up -d. Covers node death, node return, and planned switchover automatically.

Router registry repair. The Router's registry is in-memory. If the Router container dies and Docker restarts it (restart: always), it comes back empty and no services re-register on their own. The watchdog asks the active Router for its node list; if any required service types are missing, it restarts them to force re-registration.

Datacenter failover. If the active Router has been unreachable for 5 seconds and the peer also cannot reach a Router, the secondary promotes its own app layer without waiting for repmgr. This handles a full node loss where the DB promotion may not have completed yet.

watchdog.sh — rule sketches

# Rule 1 — role follows DB
# repmgr is the only election brain.
# If local postgres role disagrees with COMPOSE_PROFILES, flip and converge.
if [ "$pg_recovering" = "f" ] && [ "$current_profile" = "ha-secondary" ]; then
  set_env "COMPOSE_PROFILES" "ha-primary"
  set_env "HA_ROUTER_IP" "$LOCAL_IP"
  remove_local_router
  converge_stack   # starts Router, promotes Jeremy services to master
fi

# Rule 2 — Router registry repair
# Router restarts empty (in-memory registry). If our services are missing,
# restart them so they re-register. Heals amnesia on both nodes.
missing=$(check_missing_node_types)   # DATASERVICE GATEWAY EDGE BASTION
if [ -n "$missing" ]; then
  repair_missing_services "$missing"
fi

# Rule 3 — datacenter failover
# If active Router unreachable for ROUTER_ABSENT_THRESHOLD (5s) and peer
# Router also unreachable, secondary starts its own Router without waiting.
if router_absent_long_enough && ! peer_has_router; then
  start_local_router_and_promote_app_layer
fi

When Rule 1 fires (a DB-driven role change), it clears the Rule 3 absence timer so the two paths never conflict. The watchdog never tries to implement its own quorum or election; it just mirrors what repmgr has already decided into compose state.

Ordered convergence: the lesson that hurt the most

Early versions of the watchdog started the primary app layer by converging the whole stack at once with a single docker compose up -d. This consistently produced two failure modes that did not appear in any individual component test.

The first was a DataService JDBC pool crash: DataService would start before pgpool had completed its first successful repmgr status check, try to open a connection to 127.0.0.1:5432, fail, and never recover without a manual restart. The fix was a hard gate: before starting DataService, the watchdog waits until a SELECT 1 routed through pgpool returns successfully.

The second was Edge stuck in Router registry STARTING status while Docker reported the container healthy. Docker health and Router-registry ACTIVE are two different things. Docker health checks a local HTTP endpoint; the Router only marks a service ACTIVE after it has fully completed registration. If Edge took too long to register, HAProxy's backend health check timed out and the node fell out of the LB rotation even though every container was green. The fix: after starting Edge, the watchdog waits 20 seconds and then checks the Router registry directly. If Edge is not ACTIVE in the registry, it restarts Edge specifically, without touching anything else.

Primary convergence order

# Ordered primary convergence after failover
# (learned the hard way — starting everything at once caused
#  dataservice JDBC pool failures and Edge stuck in STARTING)

1. Force-recreate pgpool
2. Wait for: SELECT 1 via pgpool succeeds          ← gate
3. Start dataservice     (--no-deps, avoids Compose health-gate waits)
4. Start gateway + bastion-server
5. Start edge
6. Wait 20s
7. If EDGE not ACTIVE in Router registry → restart edge only
8. Start ui + haproxy

Failure scenario table

Failure	Who recovers it	RTO	Status
Container crash (any service, either node)	Docker `restart: always`	10–20s	Passes
Router container crash on primary	Docker restarts; watchdog Rule 2 triggers re-registration on both nodes	~30s	Passes
Postgres crash on primary	Docker restarts; repmgr sorts out who is primary; watchdog Rule 1 converges compose	~36s observed	Passes
Primary node return after failover	repmgr rejoins as standby; watchdog demotes app layer, removes stale Router	1 watchdog tick	Passes
Full node / OpenStack poweroff	Rule 3 + Rule 1; secondary auto-promotes, repmgr promotes DB	Seconds to low minutes depending on repmgr promotion timing	Automatic
Network partition (both nodes alive, link dead)	Rule 3 promotes secondary; node-1 DB still primary; two live Routers	Automatic, but split-brain risk	Mitigated by witness VM in prod

Drill results: what the polls actually showed

All poweroff drills were run on our two-node trial pair on an OpenStack tenant subnet. The test: issue an OpenStack poweroff to the active primary node, poll the public health endpoint every second for three minutes, count consecutive non-200 responses as the bad window.

Useful drill commands

# Reproduce a component crash inside a container
# (docker kill is not valid — it can trigger Docker's manual-stop path)
docker exec certlocker-gateway pkill -9 java

# Simulate a clean node power-off for watchdog drills
# Disable watchdog timer first so it doesn't interfere with timing
systemctl stop cl-watchdog.timer
docker compose --project-directory /opt/certlocker/ha/stack down

# Check Router registry on the active node
curl -s http://:9000/router/rest/status | jq .

# Check which node owns the DB primary
docker exec certlocker-postgres bash -c \
  'PGPASSWORD=$POSTGRESQL_POSTGRES_PASSWORD psql -U postgres -h 127.0.0.1 -tAc "SELECT pg_is_in_recovery()"'
# f = primary, t = standby

After a dozen iterations, the pattern became clear: the bottleneck is not the watchdog logic. It is the sequential dependency chain that must complete before an L7 health check returns 200:

OpenStack detects the primary is gone (~5–10s depending on heartbeat interval)
repmgr detects Postgres is unreachable and votes to promote (~15–25s depending on repmgr reconnect timing)
pgpool detects the new primary and re-routes queries (~5s)
Watchdog Rule 1 sees the DB role change and starts the app layer in dependency order (~20–30s)
Edge completes Router registration and HAProxy L7 health check passes (~10s)

Each step has its own timeout and retry window. The dominant cost is repmgr's promotion delay — by design it waits to be confident the old primary is dead before acting. That conservatism is correct; shortcutting it risks split-brain. Once repmgr promotes, the watchdog picks up the role change and the application layer converges automatically.

Hard-won lessons from the drills

⚠️

REPMGR_USE_PASSFILE=true is required

Bitnami repmgr generates shell commands with the repmgr password inline. Passwords containing shell metacharacters silently break promotion with a shell syntax error. With REPMGR_USE_PASSFILE=true, credentials go through a pgpass file instead.

⚠️

Router and Jeremy services must use host network mode

The Router records each service's source IP from the TCP registration socket. Bridge networking advertises Docker-internal IPs that the peer node cannot route to. Host network mode fixes both: services register with the real node IP, and there is no Docker bridge hairpin issue.

⚠️

Never put pull_policy: always on watchdog-restarted services

Runtime recovery should not block on a GitHub Container Registry pull. Image pulls belong in the deploy path; the watchdog should only start or recreate images that are already present on the host.

⚠️

pgpool must be force-recreated every deploy

Stale pgpool_status files can wedge pgpool into a state where it refuses to start. A clean container recreation on each deploy avoids this entirely.

⚠️

Docker health ≠ Router registry ACTIVE

Docker reports a container healthy as soon as its health-check endpoint returns 200. The Router marks a service ACTIVE only after full registration. HAProxy's L7 probe uses the real traffic path, so a node where Edge is Docker-healthy but Router-STARTING will still fail LB checks and drop out of the pool.

⚠️

Use private fixed IPs for inter-node traffic, not public or NAT addresses

On cloud providers that use floating/elastic IPs as NAT overlays, you cannot bind a service to that address from inside the VM. All inter-node communication (repmgr replication, Router registration) must use the private fixed IPs on the tenant subnet, or WireGuard tunnel IPs when the nodes are on different networks or providers.

🔒

Postgres replication must be TLS-only — the default is plaintext

Out of the box, repmgr streaming replication will connect over plaintext TCP if you let it. That means your entire database replication stream — every write, every WAL segment — crosses the network unencrypted. The fix is to set hostssl (not host) in pg_hba.conf for the replication user so the database rejects any plaintext replication connection at the authentication layer. Live verification from our primary confirmed the standby connection as ssl=t version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384. This is now enforced in the Ansible deploy so it cannot regress.

What the Ansible setup looks like

The entire HA stack is managed by Ansible. There are no manual steps after the initial secrets are uploaded to CertLocker Trust. The role hierarchy is:

flowchart LR Play["certlocker_ha_deploy.yml
(targets ha-dev group)"] --> Baseline["baseline
apt / OS hardening"] Play --> Docker["docker
engine + compose"] Play --> UFW["ufw
firewall rules"] Play --> HAStack["certlocker_ha_stack
compose · watchdog · .env"] Play --> LB["certlocker_lb.yml
cloud LB / test HAProxy"] HAStack --> EnvJ2["env.j2
COMPOSE_PROFILES
HA_ROUTER_IP
NODE IPs
secrets from Trust"] HAStack --> WatchdogSH["watchdog.sh
systemd service + timer 10s"] HAStack --> ComposeHa["docker-compose.ha.yml
postgres/repmgr · pgpool
router · gateway · edge
bastion · ui · haproxy"]

Secrets live in CertLocker Trust, not in the repo. The init script generates all HA passwords, uploads them as named secrets, and writes an Ansible vault file with lookup references. The deploy run resolves them at play time using a gateway API token. Re-running the init script regenerates and rotates secrets without touching hosts.yml or all.yml.

What we are working toward

The current design meets its goals across the full failure surface. Process crashes and container deaths stay local and self-heal in seconds. A full Postgres process crash promotes the standby automatically and the watchdog converges the application layer behind it. Planned switchovers are a single Ansible variable change. Node return after failover is automatic and leaves no stale state. Full node poweroff is handled end-to-end without any manual step.

A production deployment would add a witness VM for quorum on the Rule 3 network-partition path, preventing the split-brain edge case where both nodes see each other as unreachable and both try to promote their app layer.

On the orchestration side, we have deliberately kept the current design on plain Docker Compose with a bash watchdog rather than pulling in a full scheduler. Kubernetes is the obvious question, and the answer for now is no: the operational overhead of running a cluster, the network model changes, and the way Kubernetes handles stateful workloads add more complexity than they solve for a two-node active-passive setup at this scale. The problems we are solving are already solved by repmgr, pgpool, and a 200-line watchdog.

Nomad is worth a closer look down the road. It is a much lighter fit for this kind of mixed workload (Docker containers alongside system services), handles bare metal and VM fleets without a dedicated control plane cluster, and does not require you to rethink your networking model to use it. If we move beyond two nodes or need more dynamic scheduling, Nomad is the direction we would go before considering Kubernetes.

The HA design is live and running on our two-node trial today. If you are deploying CertLocker into an environment where uptime matters, reach out and we can talk through what the right topology looks like for your infrastructure.