Engineering field note

Ansible and CertLocker, Part 2: hardening the deployment path

Moving secrets into CertLocker was the first step. The next step was making the whole Ansible deployment path repeatable, recoverable, and less dependent on local secret files.

Runtime flow for Ansible resolving deployment secrets from CertLocker by stable name

Open full-size screenshot

In Part 1, we covered the big boundary change: Ansible still owns configuration, but CertLocker owns the sensitive material. That first piece explained the move away from raw Ansible Vault values and ID-based lookups toward stable CertLocker names such as ansible.dev03.certlocker_stack.vault_certlocker_postgres_password.

The commits from May 27, 2026 onward were about what happens after that first migration works once. We needed it to work across existing inventories, new development environments, SSH identities, Postgres TLS files, Git hooks, and repeat deploys where Docker or networking can fail for boring operational reasons.

So this is not another article about why stable names are better than IDs. This is the follow-up: the part where the migration becomes a system instead of a one-off script.

The new boundary is the inventory layout

The uploader now treats vault.yml as the operational boundary. If a value still lives in a vault file, the assumption is that it belongs in CertLocker. Non-sensitive values belong in all.yml. That keeps the rule simple enough for humans to follow and simple enough for automation to enforce.

all.yml    public hostnames, ports, image tags, feature flags, paths
vault.yml  passwords, PATs, tokens, SMTP secrets, private keys, bootstrap values

The script scans inventory vault files, decrypts Ansible Vault content when needed, skips values that are already CertLocker lookup templates, and ignores the old *_cl_id helper variables. The result is a cleaner conversion path: old-style values can be uploaded, then replaced with lookup-only references.

CERTLOCKER_API_URL=https://trust.certlocker.io/rest \
CERTLOCKER_TOKEN="$CERTLOCKER_TRUST_GATEWAY_TOKEN" \
./scripts/certlocker-upload-inventory-vaults.py \
  --inventory dev01,dev02,dev03,dev04,sire,certlocker \
  --include-empty \
  --write-lookups \
  --fail-fast

That command does three useful things in one pass: it uploads the remaining vault variables, creates records under stable names, and rewrites the local vault files to resolve those names at deploy time. Existing Ansible roles keep using the same variable names. The storage backend changes, not the role interface.

Dry-runs became part of the migration contract

The migration script gained a more careful operator path. You can filter inventories with repeated or comma-separated --inventory arguments, run with --dry-run, include empty values deliberately, and decide whether certificate files or SSH PEMs should be part of a given run.

./scripts/certlocker-upload-inventory-vaults.py \
  --inventory dev03 \
  --include-empty \
  --dry-run

That matters because deployment secrets rarely move in a perfectly clean repo. Some inventories are already lookup-only. Some still have old encrypted values. Some need certificate material uploaded. Some should not touch SSH keys during a particular run. The tool now lets the operator narrow the blast radius before writing anything back.

Certificate files are first-class deployment secrets

Part 1 mentioned PEM support, but the follow-up work made it operational for the stack. The uploader now includes certificate material from inventory/*/certs by default and gives those files predictable names:

ansible.<inventory>.certs.<directory>.<filename>
ansible.dev03.certs.postgres-primary.server.key
ansible.dev03.certs.postgres-primary.server.crt
ansible.dev03.certs.postgres-primary.ca.crt

Private keys are stored as PEM records. Public certificates and CA files are stored as configuration-style records. That distinction lets Ansible use the right CertLocker lookup plugin for the right material.

certlocker_postgres_tls_source: certlocker

certlocker_postgres_tls_certlocker_map:
  postgres-primary:
    server.key:
      name: ansible.dev03.certs.postgres-primary.server.key
      lookup: pem_key

The deployment role now supports two Postgres TLS sources. It can still copy files from the controller inventory for older flows, or it can pull uploaded TLS records from CertLocker and write them onto the target host with the right ownership and permissions. That is the important shift: database TLS material follows the same trust plane as passwords and API tokens.

SSH identity PEMs moved into the same sync model

The next gap was SSH. It is not enough to move application secrets out of Git if the deployment path still depends on local PEM files being passed around manually.

The new keys/certlocker_ssh_pems.yml file is deliberately non-secret. It maps where a PEM should live locally to the CertLocker record that stores it:

ssh_pems:
  - path: keys/sire/production.pem
    name: ansible.keys.sire.production.pem

The uploader can upload those mapped PEMs as PEM records. The new sync script can later fetch them back from CertLocker Trust and write them locally with 0600 permissions. It prints status and paths, not key contents.

CERTLOCKER_API_URL=https://trust.certlocker.io/rest \
CERTLOCKER_TOKEN="$CERTLOCKER_TRUST_GATEWAY_TOKEN" \
./scripts/certlocker-sync-ssh-pems.py

The Git hooks now use that path too. After checkout, merge, or rewrite, the hook decrypts tracked vault-eligible files for local Ansible use and then attempts to sync SSH PEMs from CertLocker. In non-interactive contexts it will skip the sync unless CERTLOCKER_TOKEN is available, which keeps CI and automation from hanging on a token prompt.

New environments are now generated, uploaded, and converted

The most useful operational change is the new environment initializer. A greenfield inventory should not require someone to handcraft a vault file, generate certificates separately, copy a PEM into the right place, upload values manually, and then remember to rewrite the vault into lookups.

CERTLOCKER_TOKEN="$CERTLOCKER_TRUST_GATEWAY_TOKEN" \
  ./scripts/init-new-environment.sh dev09 trust.certlocker.io

The script creates the inventory from the certlocker.io template, writes host and group variable files, generates bootstrap secrets, generates Postgres TLS material, optionally copies an SSH PEM into the expected inventory key path, records the PEM mapping, uploads the vault values and certificate files to CertLocker, and writes a lookup-only vault.yml.

That is the difference between "we can migrate this environment" and "we can create the next one correctly by default." The compatibility wrappers for older init scripts now point into this path, so new work starts from the CertLocker-backed model.

The deploy role got less fragile

Not every commit was about secrets. Some of the hardening was about making the deployment path tolerate real infrastructure behavior.

  • Docker image pulls now register results, retry failed pulls, and use configurable retry counts and delays.
  • docker compose up now uses already-pulled images with --pull missing instead of pulling everything again during the up step.
  • The role can configure Docker daemon DNS servers so registry pulls do not fail because the host resolver is wrong.
  • The HAProxy ACME render step now only runs when the selected compose service set includes certlocker-haproxy.
  • The external HTTPS connectivity check now has explicit timeouts, so a stuck package install or network call cannot hang the play forever.

These are not glamorous changes, but they are the kind that make a deployment system feel less brittle. A trust platform still has to get through Docker pulls, DNS behavior, network reconciliation, and partial service deploys.

What changed in the repo

The practical shape of the work from May 27 onward was:

  • The certlocker.secrets Ansible collection is now published and released at github.com/certlocker-io/ansilbe-certlocker, so the lookup plugins are no longer just internal deployment code.
  • scripts/certlocker-upload-inventory-vaults.py became the main migration tool for vault variables, inventory certificate files, SSH PEM mappings, dry-runs, existing-record checks, and lookup rewrites.
  • scripts/init-new-environment.sh became the greenfield environment path for generated secrets, generated Postgres TLS, CertLocker upload, and lookup-only vault output.
  • scripts/certlocker-sync-ssh-pems.py added the reverse path for downloading mapped SSH identities from CertLocker Trust.
  • scripts/certlocker-vault-hooks.sh kept the legacy encrypted-file safety net while adding post-checkout SSH PEM sync.
  • roles/certlocker_stack/tasks/copy_postgres_tls.yml gained a CertLocker-backed TLS source mode for Postgres primary and replica files.
  • The stack deploy tasks gained safer Docker pull/up behavior, clearer HAProxy render conditions, and bounded connectivity checks.

The result

CertLocker is now doing more than storing values that Ansible reads. It is becoming the operational source of truth for the sensitive material around the deployment itself: stack secrets, certificate private keys, public cert material, SSH deployment identities, inventory grouping, and audit history.

Ansible describes what should exist.
CertLocker stores the sensitive material.
The deployment path pulls only what it needs, when it needs it.

That is the real Part 2. The first migration proved the lookup model. This work made it repeatable across inventories and new environments, while keeping enough Ansible Vault compatibility to avoid a dangerous big-bang cutover.

There is still more to clean up, especially around making the operator docs easier to follow. But the collection is now released, and the direction is clear: fewer committed secrets, fewer local-only PEM assumptions, and a deployment path that uses CertLocker the same way we expect infrastructure teams to use it.