Replacing Ansible Vault Sprawl with CertLocker

This was good craic to build because it was not just a feature ticket. It was one of those infrastructure problems that only gives you a proper answer after you chase it through Ansible, Gateway tokens, PEM files, groups, deployment failures, and your own assumptions.

Ansible Vault was not the villain here. It did its job: encrypt sensitive values in a repo so we could keep deploying. The problem was the centre of gravity. As CertLocker matured, it became harder to justify keeping the sensitive deployment material in encrypted files beside the playbooks.

We are trying to be transparent with how CertLocker is being built. Some weeks are shiny UI work. Some weeks are HAProxy, Pebble, Gateway routes, Postgres keys, or deployment scripts. This was the second kind: useful infrastructure work that makes the deployment path cleaner.

CertLocker is meant to be an infrastructure trust control plane. If we still need a separate Vault-shaped dependency to deploy our own certificate platform, then the control plane is not finished. This migration is part of closing that loop.

What changed

The interesting part is the boundary it changes. We did not start with "replace Ansible Vault" as a slogan. We started with the pieces CertLocker already needed to support this kind of deployment:

Gateway support for creating and reading secrets with automation tokens.
Group-scoped access so records can belong to environments such as dev02 and dev03.
A Gateway lookup endpoint that resolves a secret by stable name.
PEM secret support for certificate private keys.
ACME, HAProxy, Cloudflare DNS, Pebble, and Let's Encrypt certificate delivery.
Deployment hardening in Ansible: V3 RAM-only secrets, vault hooks, cert material handling, and safer compose rendering.

This is not saying HashiCorp Vault is bad. For a lot of organisations it is the right answer. For this CertLocker deployment path, we now have enough native capability that a separate Vault cluster is not the right dependency.

Diagram showing CertLocker as the trust plane for Ansible secrets, HAProxy certificates, SSH access, groups, and audit

Open full-size screenshot

The first fix was not code

The first fix was deciding the boundary. The repo should describe the system, not carry the secrets.

Diagram showing non-sensitive Ansible config in all.yml and CertLocker lookup references in vault.yml

Open full-size screenshot

all.yml    non-sensitive operational configuration
vault.yml  values uploaded to CertLocker or resolved from CertLocker

Ports, image tags, public FQDNs, usernames, feature flags, backup schedules, and paths belong in all.yml.

PATs, passwords, master keys, ACME tokens, Cloudflare tokens, SMTP secrets, webhooks, and private keys belong in CertLocker.

The uploader no longer tries to guess whether a variable looks sensitive. If a value is still in vault.yml, it gets uploaded. The inventory layout is the source of truth.

The uploaded records

After the upload, the inventory names appear in CertLocker exactly where the Ansible lookup expects to resolve them.

CertLocker System Secrets filtered by dev02, showing Ansible records, certificate records, active status, types, and groups

Open full-size screenshot

ansible.dev02.certlocker_stack... for variable-backed records.
ansible.dev02.certs... for certificate material.
CONFIGURATION for ordinary secret values and public certificate material.
PEM for private keys.
ansible plus the inventory name as groups.
ACTIVE status for uploaded records.

The useful part is the shape: names, types, groups, status, and audit history all sit on the same record.

Stable names instead of ID babysitting

Our first version used CertLocker IDs in Ansible:

vault_grafana_admin_password_cl_id: "b5f9830b6"
vault_grafana_admin_password: >-
  {{ lookup('certlocker.secrets.secret',
            id=vault_grafana_admin_password_cl_id,
            field='secret.value') }}

It worked until records were wiped or recreated. Then the IDs changed and someone had to patch Git. That is not infrastructure trust. That is ID babysitting.

The better contract is a stable operational name:

ansible.<inventory>.<group>.<variable>
ansible.dev03.certlocker_stack.vault_grafana_admin_password

Gateway resolves that name:

GET /api/v1/secret/lookup?name=ansible.dev03.certlocker_stack.vault_grafana_admin_password

Then the Ansible collection fetches the current secret by the returned ID. The inventory never commits that ID.

Sequence diagram showing Ansible lookup by stable name through Gateway and CertLocker

Open full-size screenshot

The Ansible collection

We built and released a small Ansible collection, certlocker.secrets, so playbooks can resolve CertLocker records at runtime from the controller. The collection is published at github.com/certlocker-io/ansilbe-certlocker; the point is to make this reusable rather than a private deployment trick.

For another team, the runtime path is intentionally small: install the collection, set the Gateway URL and token on the controller, then replace raw vault values with lookups by name. You do not need the managed hosts to know anything about CertLocker.

ansible-galaxy collection install git+https://github.com/certlocker-io/ansilbe-certlocker.git

export CERTLOCKER_API_PROFILE=gateway
export CERTLOCKER_API_URL="https://dev01.certlocker.io/rest"
export CERTLOCKER_TOKEN="${CERTLOCKER_GATEWAY_TOKEN}"

Normal secret lookup:

vault_certlocker_postgres_password: >-
  {{ lookup('certlocker.secrets.secret',
            name='ansible.dev03.certlocker_stack.vault_certlocker_postgres_password',
            field='secret.value') }}

PEM private key lookup:

- name: Install Postgres primary private key
  ansible.builtin.copy:
    content: "{{ lookup('certlocker.secrets.pem_key',
                         name='ansible.dev03.certs.postgres-primary.server.key') }}"
    dest: "{{ certlocker_postgres_primary_cert_dir }}/server.key"
    owner: root
    group: root
    mode: "0600"
  no_log: true

The managed host does not need the CertLocker token. The lookup runs on the Ansible controller. Existing roles can keep using the same variable names; the value resolves from CertLocker at deploy time.

The uploader

The uploader is only there to help convert existing Ansible Vault variables and sensitive inventory files. It is not needed on every deploy. It reads what you already have, creates CertLocker records, and lets the playbook move to stable lookups. That tooling now sits alongside the released collection instead of living as a private deployment trick.

For an existing inventory, the conversion looks like this:

cd cl-ansible

CERTLOCKER_API_URL=https://dev01.certlocker.io/rest \
CERTLOCKER_TOKEN="${CERTLOCKER_GATEWAY_TOKEN}" \
./scripts/certlocker-upload-inventory-vaults.py --inventory dev03 --insecure

It scans:

inventory/*/group_vars/*/vault.yml
inventory/*/certs/**

It decrypts Ansible Vault files with ansible-vault view, uploads every remaining variable in vault.yml, skips existing lookup templates and old *_cl_id helpers, validates URI-friendly names, and uploads certificate files as CONFIGURATION or PEM.

.key / .pem -> PEM
.crt / .srl -> CONFIGURATION

After that, vault.yml can become lookup-only. That is the easy part people should care about: the conversion is a one-time upload, while normal Ansible keeps reading variables in the same places.

New development environments

A new environment starts from our own components: Ansible inventory, generated Postgres TLS material, CertLocker Gateway, CertLocker secrets, HAProxy/ACME delivery, and the same group/audit model we expose to users.

cd cl-ansible

CERTLOCKER_TOKEN="${CERTLOCKER_GATEWAY_TOKEN}" \
  ./scripts/init-new-environment.sh dev09 trust.certlocker.io dev01.certlocker.io

The script creates the inventory, generates bootstrap secrets and Postgres certs, uploads the values and PEM files to CertLocker, then writes lookup-only references back into vault.yml.

vault_certlocker_postgres_password: >-
  {{ lookup('certlocker.secrets.secret',
            name='ansible.dev09.certlocker_stack.vault_certlocker_postgres_password',
            field='secret.value') }}

Why this matters

We are always growing the product, and we are trying to be transparent about the work as we go. I enjoyed this one because it made us run CertLocker through our own deployment process, not just a prepared demo path.

CertLocker already handles certificates, ACME delivery, private secrets, tokens, groups, probes, bastion access, and audit evidence. Moving our own Ansible deployment secrets into CertLocker is where those pieces start to click together.

Ansible owns configuration.
CertLocker owns sensitive material.
Stable names connect the two.

The repo becomes easier to review. Rotations stop being Git edits. A database rebuild does not strand Ansible on dead IDs. PEM files follow the same access model as passwords and PATs. Groups give us an inventory boundary. Audit gives us the story afterwards.

The headline is not "we added an Ansible lookup". The point is that CertLocker is now being used as the trust plane for its own infrastructure.