18 KiB
Authentik deployment checklist
Purpose
This runbook is the operator path for deploying, verifying, and handing off Authentik as the homelab identity provider.
It covers:
- Preflight checks: secrets, Swarm state, storage, and network readiness.
- Deployment execution using the canonical Ansible playbook.
- Service convergence and health verification.
- Ingress and functional smoke tests against the live endpoint.
- Post-deploy hardening, evidence capture, and rollback guidance.
- Day-1 troubleshooting for common failure modes.
Scope
- Stack name:
authentik - Canonical playbook:
ansible/playbooks/docker/deploy_authentik.yml - Stack template:
ansible/templates/stacks/authentik.stack.yml - Target manager:
swarm-manager-1(10.0.0.211) - Public URL:
https://sso.castaldifamily.com - Data root:
/mnt/homelab/apps/authentik - Services deployed:
authentik-postgres,authentik-redis,authentik-server,authentik-worker
Important
This stack uses absolute bind mounts. The deploy playbook requires all data directories to exist before deployment. If any path is missing, the preflight asserts will fail-safe and abort rather than bootstrap an empty installation over existing data.
Deployment flow
flowchart LR
preflight[Phase 1 — Preflight] --> validation[Phase 2 — Validation run]
validation --> deploy[Phase 3 — Deploy]
deploy --> convergence[Phase 4 — Convergence]
convergence --> ingress[Phase 5 — Ingress checks]
ingress --> handoff[Phase 6 — Handoff]
classDef phase fill:#dbeafe,stroke:#3b82f6;
class preflight,validation,deploy,convergence,ingress,handoff phase
Phase 1 — Preflight checklist
Complete all items in this phase before running any playbook command.
1.1 Change window and ownership
- Deployment owner is assigned.
- Rollback owner is assigned.
- Maintenance window is confirmed.
- No active cluster incidents in the latest Swarm audit
(
outputs/swarm_audit_*.md).
1.2 Control node readiness
Run from the ansible/ directory with the virtual environment active.
# Confirm Python environment
source /home/chester/homelab/.venv/bin/activate
# Confirm Ansible version (must be >= 2.18.0)
ansible --version
# Confirm SSH access to all Swarm managers
ansible swarm_managers -i inventory/hosts.ini -m ping
- Ansible version is
2.18.0or higher. - All Swarm managers return
pong. - Vault password is available (
.vault_passfile present orANSIBLE_VAULT_PASSWORD_FILEset).
1.3 Secrets readiness
The deploy playbook asserts both values are defined, non-empty, and not placeholder strings. Verify them first:
ansible -i inventory/hosts.ini localhost \
-m ansible.builtin.debug \
-a "msg={{ vault_authentik_secret_key | length }}" \
-e "@group_vars/all.yml" \
--vault-password-file .vault_pass
Repeat for vault_authentik_postgres_password.
vault_authentik_secret_keydecrypts to a non-empty, non-placeholder value.vault_authentik_postgres_passworddecrypts to a non-empty, non-placeholder value.- Neither value is any of:
change-me,changeme,your-random-secret,your-db-password.
1.4 Swarm cluster state
# Confirm target manager is active and is control-plane
ssh chester@10.0.0.211 \
"docker info --format '{{.Swarm.LocalNodeState}}|{{.Swarm.ControlAvailable}}'"
# Expected output: active|true
# Confirm all managers are active
ansible swarm_managers -i inventory/hosts.ini \
-m ansible.builtin.command \
-a "docker info --format '{{.Swarm.LocalNodeState}}'"
swarm-manager-1returnsactive|true.- All three managers return
active. - No node shows
inactive,pending, orerror.
1.5 External overlay network
Authentik requires proxy-net to exist before stack deploy.
ssh chester@10.0.0.211 \
"docker network ls --filter name=proxy-net --format '{{.Name}}|{{.Driver}}|{{.Scope}}'"
# Expected: proxy-net|overlay|swarm
proxy-netexists withoverlaydriver andswarmscope.
Warning
If
proxy-netis missing, create it before continuing:ssh chester@10.0.0.211 \ "docker network create --driver overlay --attachable proxy-net"
1.6 Persistent data paths
All bind-mount paths must exist on swarm-manager-1 before deploying.
The playbook will fail-safe if any are missing.
ssh chester@10.0.0.211 "for d in \
/mnt/homelab/apps/authentik \
/mnt/homelab/apps/authentik/data \
/mnt/homelab/apps/authentik/data/database \
/mnt/homelab/apps/authentik/data/redis \
/mnt/homelab/apps/authentik/data/media \
/mnt/homelab/apps/authentik/data/config \
/mnt/homelab/apps/authentik/data/blueprints; do
[ -d \"\$d\" ] && echo \"OK \$d\" || echo \"MISSING \$d\"
done"
- All 7 paths return
OK. - If any path is
MISSING, create or restore from backup before proceeding.
To create paths for a fresh install (no existing data to protect):
ssh chester@10.0.0.211 "sudo mkdir -p \
/mnt/homelab/apps/authentik/data/database \
/mnt/homelab/apps/authentik/data/redis \
/mnt/homelab/apps/authentik/data/media \
/mnt/homelab/apps/authentik/data/config \
/mnt/homelab/apps/authentik/data/blueprints"
Warning
Do not create missing paths if you are restoring an existing Authentik install. Restore from backup first to avoid initialising an empty database over pre-existing data.
Phase 2 — Validation-only run
Run the playbook in validation mode to confirm all asserts pass before changing anything on the cluster.
cd /home/chester/homelab/ansible
ansible-playbook \
-i inventory/hosts.ini \
playbooks/docker/deploy_authentik.yml \
-e "stack_validate_only=true" \
--vault-password-file .vault_pass
- Playbook completes with
0failed tasks. - Secrets assertion tasks pass (no
FAILEDon assert blocks). - Swarm manager state assertion passes.
- Data path assertions pass for all 7 required directories.
Stop here if any assert fails. Diagnose using the Troubleshooting matrix below, then re-run validation before proceeding.
Phase 3 — Deployment execution
Run the standard deploy. All playbook output should be captured for the evidence record.
cd /home/chester/homelab/ansible
ansible-playbook \
-i inventory/hosts.ini \
playbooks/docker/deploy_authentik.yml \
--vault-password-file .vault_pass \
2>&1 | tee ../outputs/authentik_deploy_$(date +%Y%m%dT%H%M%S).log
- Playbook completes without
FAILEDtasks. - Deployment result block is printed confirming stack name, manager, and URL.
- Log file is saved to
outputs/with a timestamp.
Expected deployment result output:
"Authentik deployment complete."
"Stack : authentik"
"Manager : swarm-manager-1 (10.0.0.211)"
"URL : https://sso.castaldifamily.com"
"Data root : /mnt/homelab/apps/authentik"
"Services : authentik-postgres, authentik-redis, authentik-server, authentik-worker"
Phase 4 — Service convergence and health
Verify that all four services are running, stable, and healthy.
4.1 Service replica status
ssh chester@10.0.0.211 \
"docker service ls --filter label=com.docker.stack.namespace=authentik"
Expected replica counts:
| Service | Expected |
|---|---|
authentik_authentik-postgres |
1/1 |
authentik_authentik-redis |
1/1 |
authentik_authentik-server |
1/1 |
authentik_authentik-worker |
1/1 |
- All four services show
1/1replicas. - No service shows
0/1or a failure count.
4.2 Service placement
All four services must be pinned to swarm-manager-1.
ssh chester@10.0.0.211 \
"docker service ps authentik_authentik-server --filter desired-state=running --format '{{.Node}} {{.CurrentState}}'"
# Expected: swarm-manager-1 Running ...
authentik-servertask is running onswarm-manager-1.authentik-workertask is running onswarm-manager-1.
4.3 Container health checks
# postgres health (pg_isready)
ssh chester@10.0.0.211 \
"docker ps --filter name=authentik_authentik-postgres --format '{{.Status}}'"
# Expected: Up ... (healthy)
# redis health (redis-cli ping)
ssh chester@10.0.0.211 \
"docker ps --filter name=authentik_authentik-redis --format '{{.Status}}'"
# Expected: Up ... (healthy)
authentik-postgrescontainer shows(healthy).authentik-rediscontainer shows(healthy).
4.4 Critical startup log checks
# Check server startup for migration and database connectivity
ssh chester@10.0.0.211 \
"docker service logs authentik_authentik-server --since 10m --no-task-ids 2>&1 | tail -40"
# Check worker for job queue connectivity
ssh chester@10.0.0.211 \
"docker service logs authentik_authentik-worker --since 10m --no-task-ids 2>&1 | tail -40"
- No
FATALorERRORmessages relating to database connection in server logs. - No
FATALorERRORmessages relating to Redis connection in server or worker logs. - Database migration messages complete without errors.
- No repeated container restart events (no
started 2+ times).
4.5 Resource limits in effect
| Service | Memory limit | CPU limit |
|---|---|---|
authentik-postgres |
1 G | 0.75 |
authentik-redis |
512 M | 0.50 |
authentik-server |
2 G | 1.0 |
authentik-worker |
1 G | 0.75 |
ssh chester@10.0.0.211 \
"docker service inspect authentik_authentik-server \
--format '{{.Spec.TaskTemplate.Resources.Limits.MemoryBytes}}'"
# Expected: 2147483648 (2 GB)
- Resource limits are present and match the table above.
Phase 5 — Ingress and functional verification
5.1 Traefik route registration
Traefik routes are published via traefik-kop. Verify the route is active before
testing the public endpoint.
# Check Traefik router for the authentik rule
curl -fsS http://10.0.0.151:8080/api/http/routers/authentik@docker \
| python3 -m json.tool | grep -E '"rule"|"status"'
# Expected: "rule": "Host(...sso.castaldifamily.com...)", "status": "enabled"
- Traefik router
authentik@dockerexists and isenabled. - Router rule matches
Host('sso.castaldifamily.com'). - Middlewares include
security-headers@fileandratelimit-basic@file.
5.2 HTTPS endpoint reachability
# TLS handshake and HTTP 200/302 response
curl -fsS -o /dev/null -w "%{http_code} %{ssl_verify_result}" \
https://sso.castaldifamily.com
# Expected: 200 0 (or 302 0 for a redirect to login)
- curl returns HTTP
200or302. ssl_verify_resultis0(certificate valid).- Response is not a Traefik 404 or 502.
5.3 Login page load
Open https://sso.castaldifamily.com in a browser.
- Authentik login page loads without JavaScript errors.
- Page title includes "authentik" or "Sign in".
- No TLS certificate warning from the browser.
5.4 Admin UI readiness (if initial deploy)
Navigate to https://sso.castaldifamily.com/if/flow/initial-setup/
- Initial setup flow is reachable on first-run bootstrap.
- Skip this step if the instance already existed; do not re-run initial setup on an existing install.
Phase 6 — Post-deploy handoff
6.1 Monitoring integration
Authentik is referenced as the SSO provider in group_vars/all.yml:
monitoring:
authentik_host: "https://sso.castaldifamily.com"
- Uptime Kuma has a monitor for
https://sso.castaldifamily.com. - Prometheus or health check system is alerting on
authentik_authentik-serverreplica count dropping below 1.
6.2 Backup verification
/mnt/homelab/apps/authentik/data/databaseis included in backup scope.- A manual backup snapshot was taken before or immediately after deploy.
- Restore procedure is documented and tested (or explicitly deferred).
6.3 Secret rotation awareness
| Secret | Rotation procedure |
|---|---|
vault_authentik_secret_key |
Update vault → redeploy stack → running sessions are invalidated |
vault_authentik_postgres_password |
Update vault AND postgres user password → redeploy |
- Rotation procedure is known to the deployment owner.
6.4 Evidence capture
# Save service state snapshot
ssh chester@10.0.0.211 \
"docker service ls --filter label=com.docker.stack.namespace=authentik" \
> ../outputs/authentik_service_snapshot_$(date +%Y%m%dT%H%M%S).txt
- Deploy log saved to
outputs/authentik_deploy_<timestamp>.log. - Service state snapshot saved to
outputs/authentik_service_snapshot_<timestamp>.txt. - Deployment timestamp and verification timestamp recorded in this checklist.
6.5 Deployment sign-off
| Field | Value |
|---|---|
| Deployment owner | |
| Deployment timestamp | |
| Verification timestamp | |
| Endpoint verified | https://sso.castaldifamily.com |
| Final status | ☐ GREEN — all phases passed |
Rollback procedure
If deployment fails or causes instability, remove the stack and preserve data.
cd /home/chester/homelab/ansible
ansible-playbook \
-i inventory/hosts.ini \
playbooks/docker/deploy_authentik.yml \
-e "authentik_deploy_state=absent" \
--vault-password-file .vault_pass
Warning
authentik_deploy_state=absentremoves the Swarm stack (containers, services, configs) but does not delete the bind-mount data directories. Data at/mnt/homelab/apps/authentikis preserved for re-deploy or restore.
- Stack removed cleanly (
docker stack lsshows noauthentikentry). - Data directories still intact on
swarm-manager-1. - Root cause identified before re-deploying.
Troubleshooting matrix
Validation assert fails: secrets not defined or placeholder
Symptom: Playbook fails on Assert vault_authentik_secret_key is defined or
Assert Authentik secrets are not placeholders.
Check:
ansible -i inventory/hosts.ini localhost \
-m ansible.builtin.debug \
-a "var=vault_authentik_secret_key" \
-e "@group_vars/all.yml" \
--vault-password-file .vault_pass
Fix: Encrypt and store the correct value:
ansible-vault encrypt_string 'YOUR-KEY' \
--name 'vault_authentik_secret_key' \
--vault-password-file .vault_pass
# Paste output into group_vars/vault/all.yml
Validation assert fails: data paths missing
Symptom: Playbook fails on Assert required Authentik paths exist before deploy.
Check:
ssh chester@10.0.0.211 "ls -la /mnt/homelab/apps/authentik/"
Fix (fresh install only):
ssh chester@10.0.0.211 "sudo mkdir -p \
/mnt/homelab/apps/authentik/data/{database,redis,media,config,blueprints}"
Fix (existing install): Restore from backup before creating directories.
Swarm assert fails: manager not active or not control plane
Symptom: Playbook fails on Assert target is an active Swarm manager.
Check:
ssh chester@10.0.0.211 "docker info --format '{{.Swarm.LocalNodeState}}'"
Fix: Investigate Swarm manager health. Do not proceed until a healthy quorum manager is the deploy target.
Services not converging to 1/1
Symptom: docker service ls shows 0/1 or a service cycles through restarts.
Check:
ssh chester@10.0.0.211 \
"docker service ps authentik_authentik-server --no-trunc"
Look for failure reasons in the Error column.
Common causes:
| Cause | Evidence in logs | Fix |
|---|---|---|
| Secret key mismatch | cryptography error or key invalid in server logs |
Re-check vault value, redeploy |
| Postgres not healthy yet | connection refused in server logs |
Wait for postgres (healthy), then check server |
| Redis not reachable | redis connection error in server or worker logs |
Confirm authentik-redis is 1/1 healthy first |
| Missing bind-mount path | no such file or directory in container start |
Create path, redeploy |
| Insufficient memory | OOM kill in docker service ps error column |
Check node resources, adjust limits if needed |
Traefik route not registered or 502 response
Symptom: curl https://sso.castaldifamily.com returns 502 Bad Gateway or
connection refused.
Check:
# Confirm traefik-kop is running (Swarm stack)
ssh chester@10.0.0.211 \
"docker service ls --filter name=traefik-kop"
# Check server is listening on port 9000
ssh chester@10.0.0.211 \
"docker service ps authentik_authentik-server --filter desired-state=running"
Common causes:
traefik-kopis not running → deploy monitoring stack first.authentik-serveris not bound on port9000→ check replica and restart.edge_routing.swarm.bind_ipis incorrect ingroup_vars/all.yml→ verify it resolves to an active Swarm node.- Cloudflare DNS is not pointing to
10.0.0.151→ verify DNS record forsso.castaldifamily.com.
Database migration errors on first boot
Symptom: Server logs show migration errors or relation does not exist.
Check:
ssh chester@10.0.0.211 \
"docker service logs authentik_authentik-server --since 5m 2>&1 | grep -i 'migrat\|error\|fatal'"
Fix: Migrations run automatically on startup. If they fail:
- Check postgres is
(healthy)and accepting connections. - Check
vault_authentik_postgres_passwordin vault matches the running postgres password. - Restart the server service to trigger a re-run:
ssh chester@10.0.0.211 \
"docker service update --force authentik_authentik-server"
Reference
| Resource | Location |
|---|---|
| Deploy playbook | ansible/playbooks/docker/deploy_authentik.yml |
| Stack template | ansible/templates/stacks/authentik.stack.yml |
| Shared variables | ansible/group_vars/all.yml |
| Vault secrets | ansible/group_vars/vault/all.yml |
| Authentik docs | https://goauthentik.io/docs |
| Authentik changelog | https://github.com/goauthentik/authentik/releases |
| Swarm cluster baseline | outputs/swarm_audit_20260314T122134.md |