Build Notes: Production HA, AI-Readable Docs, and Running Shannon Against Our Own API

There is a version of this update where we describe a careful phased rollout across multiple services that landed cleanly. That version is technically accurate and completely misleading. The honest version is that HA went live in production across two nodes on different European datacentres, the docs site and llms.txt shipped, MCP security was hardened and wired into the lifecycle runner, we ran the Shannon autonomous pentester against a live mock version of the CertLocker API, and then we launched on Instagram.

Here is what actually shipped.

Active-passive HA is live in production

The biggest technical change this week was also the least glamorous to write about: two-node active-passive high-availability went live in production.

Both nodes run the full CertLocker stack. repmgr handles Postgres primary/standby streaming replication. pgpool sits in front. A watchdog process monitors repmgr state and promotes the standby automatically when the primary stops responding. Ansible manages the whole thing, including the secrets for the HA deployment, which are stored in CertLocker's own trust store. So the HA deployment uses CertLocker to manage its own credentials. That is either elegant or recursive depending on how you feel about it.

The development run across dev03 and dev04 took eleven iterations before it stabilised. The problems were predictable in retrospect and invisible until we tried them live: tenant fixed IPs instead of floating IPs because neutron NAT does not expose a bindable address; router and backend services moved to host network namespace so registration source IPs are routable from both nodes without hairpin routing; pgpool force-recreated on every deploy because a stale pgpool_status file wedges it silently; repmgr node names must match the format ^.*-[0-9]+$, which is documented somewhere but not obviously.

None of these were bugs. They were conditions. Infrastructure automation only handles conditions once you have encountered them in a running system.

We also enabled TLS for the Postgres replication channel between nodes, because plaintext replication over a network you do not fully control is not something you want to explain after an incident.

After development, the production cutover went to the two live nodes. Streaming replication was verified with a cross-node smoke test. repmgr confirmed primary/standby state on both sides. The UI returned 200. The watchdog timers ran quiet. If one node disappears, the standby promotes, services resume, and the customer does not need to do anything. That is what HA is supposed to mean, but there is a lot of boring plumbing between the description and the working system.

The longer write-up on the HA architecture, including the watchdog design and the pgpool/repmgr interaction, is in the CertLocker Active-Passive HA article.

docs.certlocker.io launched with AI crawler support

AI agents increasingly start their discovery from a small set of predictable URLs: /robots.txt, /sitemap.xml, /llms.txt, and an OpenAPI JSON spec. If those are missing or inconsistent, an agent either skips the product, guesses from HTML, or infers API behaviour from marketing copy. For an infrastructure security tool, any of those outcomes is a problem.

This week we launched docs.certlocker.io with a full public OpenAPI spec, crawler discovery files, and the metadata an AI agent needs to interact with the CertLocker API correctly:

Public OpenAPI spec at /public-agent-openapi.json covering all 22 public paths across /api/agent/* and /api/v1/*
llms.txt at both certlocker.io/llms.txt and docs.certlocker.io/llms.txt
robots.txt with LLMs: hints pointing to both discovery files on both domains
<link rel="llms.txt"> in the HTML of both sites
A full sitemap.xml for the docs site

The public OpenAPI spec is not a copy-paste of the internal spec. It strips raw secret values from schema definitions, marks the three endpoints that expose sensitive material as privileged, and runs through a validator that checks for secret field leaks before the spec ships. If an AI agent is going to read the API contract and decide how to interact with secrets, the contract needs to be honest about what it exposes.

The docs site deploys as a Docker container behind HAProxy with its own Ansible partial playbook. Documentation is a deployable artifact. It belongs in the HA rollout process like any other service, not as a manual update someone does when they remember.

If an AI agent has to infer your API contract from a marketing page, you have already lost control of the integration.

MCP security hardening: the boundary is policy, not the prompt

CertLocker has had MCP support since last month. This week we moved it from “the server exposes tools” toward “an agent can use those tools under real security conditions.”

Metadata-only secret access. Agents can call certlocker_get_secret_metadata and receive the secret's name, group, tags, and existence, but never the raw value. This is the most useful boundary for AI agent workflows: the agent can confirm a secret exists and belongs to the right group without needing to hold the credential itself. The agent gets enough to complete the job. The team does not have to expose the production password.

MCP token TTL guardrails. Token expiry is now required, with a maximum of 90 days. A token without an expiry is rejected at creation time. This is the constraint that feels obvious in a design meeting and quietly gets skipped in the first implementation because it adds friction at setup.

Token revocation. POST /api/agent/token/revoke revokes a token immediately, emits a TOKEN_STATUS_CHANGE audit event, and the token stops working. When a pentest engagement ends, a contractor finishes, or an agent session completes, you revoke the token and the access disappears.

Dry-run mode. SSH session creation and probe creation both support a dry-run flag: the request validates scope and permissions without creating a resource. Agents can verify they have the right access before committing an action.

Machine-readable deny responses. When an agent is denied, the response now includes requiredScope, requiredGroup, and httpStatus. The agent gets enough information to report the failure meaningfully rather than receiving an opaque 403.

Discovery documents. GET /llms.txt and GET /.well-known/certlocker-ai.json tell agents what CertLocker is, what the API does, and which endpoints require careful handling.

The live MCP E2E test phase now covers: tool discovery, resource discovery, profile and permissions resource reads, allowed secret access, denied access for out-of-scope resources, unknown tool rejection, and audit expectations. The Slack summary from the lifecycle runner reports each check individually. A single “MCP: passed” message is not enough. You want to see which boundaries were actually tested.

We ran Shannon against our own API

Shannon is an open-source autonomous AI pentester built by Keygraph. It is white-box: you give it access to your source code, and it runs a multi-phase pipeline (reconnaissance, parallel vulnerability analysis, parallel exploitation) to find vulnerabilities and produce reproducible proof-of-concept exploits. Not a list of guesses. Actual working exploits against a running target, or nothing reported at all.

Shannon handles the whole workflow autonomously: browser automation, injection attacks, authentication bypass, SSRF, XSS, API abuse. It achieved a 96.15% success rate on the hint-free, source-aware XBOW benchmark and found over 20 critical vulnerabilities in OWASP Juice Shop in one run. It is genuinely capable, which is exactly why we wanted to point it at CertLocker.

The setup was Shannon running against a local mock version of the CertLocker API, with Codex CLI (gpt-5.5) as the executor inside a containerised worker. The goal was twofold: validate the MCP security hardening we shipped this week, and establish Shannon as a repeatable part of the security development cycle rather than a one-time exercise.

Getting the Codex executor to work inside the worker container involved some plumbing: preventing the host CODEX_HOME from overriding the container's mounted path, exposing the host model proxy via a temporary forwarder so the worker could reach it from inside Docker, and wiring in Codex as a selectable provider alongside Claude. None of that is Shannon's problem. It is the infrastructure tax you pay when you run any LLM tool in a sandboxed environment.

The pre_recon and recon phases both completed successfully. Vulnerability specialist agents started, produced output, and then hit Shannon's deliverable validation layer: their findings were rejected because they did not match the expected output schema.

That is the right behaviour. Shannon's validation layer exists because an autonomous pentester that reports unverified findings is just a slower way to generate noise. The agents tried to continue with partial output. The framework rejected it. The full exploit pipeline will complete once the specialist prompts are aligned with the schema.

Running your own baseline security regression before the autonomous pentest matters too. Phase 10 of the CertLocker E2E lifecycle passed 5/5: unauthenticated access checks, node operation exposure, email 2FA brute-force protection. Shannon is most useful when it is testing the subtle attack surfaces, not rediscovering the obvious ones. A clean baseline means its findings are signal, not noise.

We launched on Instagram

CertLocker now has an Instagram account and we are giving it a proper go. Nine short-form video reels went out this week covering the certificate spreadsheet problem, retail infrastructure secrets, mixed non-Kubernetes estates, vendor access, and the cert-renewed-but-production-still-went-down scenario. Plus four trial reels for the two-week free trial campaign launching Monday 22nd.

No particular playbook. Just showing up, talking about real infrastructure problems, and seeing what sticks. If you are in DevOps or SRE and you want to follow along, we are building in public and the content is technical.

The pattern underneath all of it

The common thread across HA, the docs site, MCP hardening, and the Shannon pentest run is the same: infrastructure that earns operational trust needs to be boring, auditable, and recoverable.

HA earns trust by recovering automatically when a node disappears. The docs site earns trust by being honest about what the API does and what agents should never do with it. The MCP security layer earns trust by enforcing policy at the server, not by trusting the model prompt. Shannon earns trust by rejecting findings it cannot prove and only reporting what it can actually exploit.

That is also just a description of what this week looked like in practice.

Was a great week for the parish all in all.