# Authentik deployment checklist ## Purpose This runbook is the operator path for deploying, verifying, and handing off Authentik as the homelab identity provider. It covers: - Preflight checks: secrets, Swarm state, storage, and network readiness. - Deployment execution using the canonical Ansible playbook. - Service convergence and health verification. - Ingress and functional smoke tests against the live endpoint. - Post-deploy hardening, evidence capture, and rollback guidance. - Day-1 troubleshooting for common failure modes. ## Scope - **Stack name:** `authentik` - **Canonical playbook:** `ansible/playbooks/docker/deploy_authentik.yml` - **Stack template:** `ansible/templates/stacks/authentik.stack.yml` - **Target manager:** `swarm-manager-1` (`10.0.0.211`) - **Public URL:** `https://sso.castaldifamily.com` - **Data root:** `/mnt/homelab/apps/authentik` - **Services deployed:** `authentik-postgres`, `authentik-redis`, `authentik-server`, `authentik-worker` > [!IMPORTANT] > This stack uses **absolute bind mounts**. The deploy playbook requires all data > directories to exist before deployment. If any path is missing, the preflight > asserts will fail-safe and abort rather than bootstrap an empty installation > over existing data. --- ## Deployment flow ```mermaid flowchart LR preflight[Phase 1 — Preflight] --> validation[Phase 2 — Validation run] validation --> deploy[Phase 3 — Deploy] deploy --> convergence[Phase 4 — Convergence] convergence --> ingress[Phase 5 — Ingress checks] ingress --> handoff[Phase 6 — Handoff] classDef phase fill:#dbeafe,stroke:#3b82f6; class preflight,validation,deploy,convergence,ingress,handoff phase ``` --- ## Phase 1 — Preflight checklist Complete all items in this phase before running any playbook command. ### 1.1 Change window and ownership - [ ] Deployment owner is assigned. - [ ] Rollback owner is assigned. - [ ] Maintenance window is confirmed. - [ ] No active cluster incidents in the latest Swarm audit (`outputs/swarm_audit_*.md`). ### 1.2 Control node readiness Run from the `ansible/` directory with the virtual environment active. ```bash # Confirm Python environment source /home/chester/homelab/.venv/bin/activate # Confirm Ansible version (must be >= 2.18.0) ansible --version # Confirm SSH access to all Swarm managers ansible swarm_managers -i inventory/hosts.ini -m ping ``` - [ ] Ansible version is `2.18.0` or higher. - [ ] All Swarm managers return `pong`. - [ ] Vault password is available (`.vault_pass` file present or `ANSIBLE_VAULT_PASSWORD_FILE` set). ### 1.3 Secrets readiness The deploy playbook asserts both values are defined, non-empty, and not placeholder strings. Verify them first: ```bash ansible -i inventory/hosts.ini localhost \ -m ansible.builtin.debug \ -a "msg={{ vault_authentik_secret_key | length }}" \ -e "@group_vars/all.yml" \ --vault-password-file .vault_pass ``` Repeat for `vault_authentik_postgres_password`. - [ ] `vault_authentik_secret_key` decrypts to a non-empty, non-placeholder value. - [ ] `vault_authentik_postgres_password` decrypts to a non-empty, non-placeholder value. - [ ] Neither value is any of: `change-me`, `changeme`, `your-random-secret`, `your-db-password`. ### 1.4 Swarm cluster state ```bash # Confirm target manager is active and is control-plane ssh chester@10.0.0.211 \ "docker info --format '{{.Swarm.LocalNodeState}}|{{.Swarm.ControlAvailable}}'" # Expected output: active|true # Confirm all managers are active ansible swarm_managers -i inventory/hosts.ini \ -m ansible.builtin.command \ -a "docker info --format '{{.Swarm.LocalNodeState}}'" ``` - [ ] `swarm-manager-1` returns `active|true`. - [ ] All three managers return `active`. - [ ] No node shows `inactive`, `pending`, or `error`. ### 1.5 External overlay network Authentik requires `proxy-net` to exist before stack deploy. ```bash ssh chester@10.0.0.211 \ "docker network ls --filter name=proxy-net --format '{{.Name}}|{{.Driver}}|{{.Scope}}'" # Expected: proxy-net|overlay|swarm ``` - [ ] `proxy-net` exists with `overlay` driver and `swarm` scope. > [!WARNING] > If `proxy-net` is missing, create it before continuing: > ```bash > ssh chester@10.0.0.211 \ > "docker network create --driver overlay --attachable proxy-net" > ``` ### 1.6 Persistent data paths All bind-mount paths must exist on `swarm-manager-1` **before** deploying. The playbook will fail-safe if any are missing. ```bash ssh chester@10.0.0.211 "for d in \ /mnt/homelab/apps/authentik \ /mnt/homelab/apps/authentik/data \ /mnt/homelab/apps/authentik/data/database \ /mnt/homelab/apps/authentik/data/redis \ /mnt/homelab/apps/authentik/data/media \ /mnt/homelab/apps/authentik/data/config \ /mnt/homelab/apps/authentik/data/blueprints; do [ -d \"\$d\" ] && echo \"OK \$d\" || echo \"MISSING \$d\" done" ``` - [ ] All 7 paths return `OK`. - [ ] If any path is `MISSING`, create or restore from backup before proceeding. To create paths for a **fresh install** (no existing data to protect): ```bash ssh chester@10.0.0.211 "sudo mkdir -p \ /mnt/homelab/apps/authentik/data/database \ /mnt/homelab/apps/authentik/data/redis \ /mnt/homelab/apps/authentik/data/media \ /mnt/homelab/apps/authentik/data/config \ /mnt/homelab/apps/authentik/data/blueprints" ``` > [!WARNING] > Do not create missing paths if you are restoring an existing Authentik install. > Restore from backup first to avoid initialising an empty database over > pre-existing data. --- ## Phase 2 — Validation-only run Run the playbook in validation mode to confirm all asserts pass before changing anything on the cluster. ```bash cd /home/chester/homelab/ansible ansible-playbook \ -i inventory/hosts.ini \ playbooks/docker/deploy_authentik.yml \ -e "stack_validate_only=true" \ --vault-password-file .vault_pass ``` - [ ] Playbook completes with `0` failed tasks. - [ ] Secrets assertion tasks pass (no `FAILED` on assert blocks). - [ ] Swarm manager state assertion passes. - [ ] Data path assertions pass for all 7 required directories. **Stop here if any assert fails.** Diagnose using the [Troubleshooting matrix](#troubleshooting-matrix) below, then re-run validation before proceeding. --- ## Phase 3 — Deployment execution Run the standard deploy. All playbook output should be captured for the evidence record. ```bash cd /home/chester/homelab/ansible ansible-playbook \ -i inventory/hosts.ini \ playbooks/docker/deploy_authentik.yml \ --vault-password-file .vault_pass \ 2>&1 | tee ../outputs/authentik_deploy_$(date +%Y%m%dT%H%M%S).log ``` - [ ] Playbook completes without `FAILED` tasks. - [ ] Deployment result block is printed confirming stack name, manager, and URL. - [ ] Log file is saved to `outputs/` with a timestamp. **Expected deployment result output:** ``` "Authentik deployment complete." "Stack : authentik" "Manager : swarm-manager-1 (10.0.0.211)" "URL : https://sso.castaldifamily.com" "Data root : /mnt/homelab/apps/authentik" "Services : authentik-postgres, authentik-redis, authentik-server, authentik-worker" ``` --- ## Phase 4 — Service convergence and health Verify that all four services are running, stable, and healthy. ### 4.1 Service replica status ```bash ssh chester@10.0.0.211 \ "docker service ls --filter label=com.docker.stack.namespace=authentik" ``` Expected replica counts: | Service | Expected | | :--- | :---: | | `authentik_authentik-postgres` | `1/1` | | `authentik_authentik-redis` | `1/1` | | `authentik_authentik-server` | `1/1` | | `authentik_authentik-worker` | `1/1` | - [ ] All four services show `1/1` replicas. - [ ] No service shows `0/1` or a failure count. ### 4.2 Service placement All four services must be pinned to `swarm-manager-1`. ```bash ssh chester@10.0.0.211 \ "docker service ps authentik_authentik-server --filter desired-state=running --format '{{.Node}} {{.CurrentState}}'" # Expected: swarm-manager-1 Running ... ``` - [ ] `authentik-server` task is running on `swarm-manager-1`. - [ ] `authentik-worker` task is running on `swarm-manager-1`. ### 4.3 Container health checks ```bash # postgres health (pg_isready) ssh chester@10.0.0.211 \ "docker ps --filter name=authentik_authentik-postgres --format '{{.Status}}'" # Expected: Up ... (healthy) # redis health (redis-cli ping) ssh chester@10.0.0.211 \ "docker ps --filter name=authentik_authentik-redis --format '{{.Status}}'" # Expected: Up ... (healthy) ``` - [ ] `authentik-postgres` container shows `(healthy)`. - [ ] `authentik-redis` container shows `(healthy)`. ### 4.4 Critical startup log checks ```bash # Check server startup for migration and database connectivity ssh chester@10.0.0.211 \ "docker service logs authentik_authentik-server --since 10m --no-task-ids 2>&1 | tail -40" # Check worker for job queue connectivity ssh chester@10.0.0.211 \ "docker service logs authentik_authentik-worker --since 10m --no-task-ids 2>&1 | tail -40" ``` - [ ] No `FATAL` or `ERROR` messages relating to database connection in server logs. - [ ] No `FATAL` or `ERROR` messages relating to Redis connection in server or worker logs. - [ ] Database migration messages complete without errors. - [ ] No repeated container restart events (no `started 2+ times`). ### 4.5 Resource limits in effect | Service | Memory limit | CPU limit | | :--- | :---: | :---: | | `authentik-postgres` | 1 G | 0.75 | | `authentik-redis` | 512 M | 0.50 | | `authentik-server` | 2 G | 1.0 | | `authentik-worker` | 1 G | 0.75 | ```bash ssh chester@10.0.0.211 \ "docker service inspect authentik_authentik-server \ --format '{{.Spec.TaskTemplate.Resources.Limits.MemoryBytes}}'" # Expected: 2147483648 (2 GB) ``` - [ ] Resource limits are present and match the table above. --- ## Phase 5 — Ingress and functional verification ### 5.1 Traefik route registration Traefik routes are published via `traefik-kop`. Verify the route is active before testing the public endpoint. ```bash # Check Traefik router for the authentik rule curl -fsS http://10.0.0.151:8080/api/http/routers/authentik@docker \ | python3 -m json.tool | grep -E '"rule"|"status"' # Expected: "rule": "Host(...sso.castaldifamily.com...)", "status": "enabled" ``` - [ ] Traefik router `authentik@docker` exists and is `enabled`. - [ ] Router rule matches `Host('sso.castaldifamily.com')`. - [ ] Middlewares include `security-headers@file` and `ratelimit-basic@file`. ### 5.2 HTTPS endpoint reachability ```bash # TLS handshake and HTTP 200/302 response curl -fsS -o /dev/null -w "%{http_code} %{ssl_verify_result}" \ https://sso.castaldifamily.com # Expected: 200 0 (or 302 0 for a redirect to login) ``` - [ ] curl returns HTTP `200` or `302`. - [ ] `ssl_verify_result` is `0` (certificate valid). - [ ] Response is not a Traefik 404 or 502. ### 5.3 Login page load Open `https://sso.castaldifamily.com` in a browser. - [ ] Authentik login page loads without JavaScript errors. - [ ] Page title includes "authentik" or "Sign in". - [ ] No TLS certificate warning from the browser. ### 5.4 Admin UI readiness (if initial deploy) Navigate to `https://sso.castaldifamily.com/if/flow/initial-setup/` - [ ] Initial setup flow is reachable on first-run bootstrap. - [ ] Skip this step if the instance already existed; do not re-run initial setup on an existing install. --- ## Phase 6 — Post-deploy handoff ### 6.1 Monitoring integration Authentik is referenced as the SSO provider in `group_vars/all.yml`: ```yaml monitoring: authentik_host: "https://sso.castaldifamily.com" ``` - [ ] Uptime Kuma has a monitor for `https://sso.castaldifamily.com`. - [ ] Prometheus or health check system is alerting on `authentik_authentik-server` replica count dropping below 1. ### 6.2 Backup verification - [ ] `/mnt/homelab/apps/authentik/data/database` is included in backup scope. - [ ] A manual backup snapshot was taken before or immediately after deploy. - [ ] Restore procedure is documented and tested (or explicitly deferred). ### 6.3 Secret rotation awareness | Secret | Rotation procedure | | :--- | :--- | | `vault_authentik_secret_key` | Update vault → redeploy stack → running sessions are invalidated | | `vault_authentik_postgres_password` | Update vault AND postgres user password → redeploy | - [ ] Rotation procedure is known to the deployment owner. ### 6.4 Evidence capture ```bash # Save service state snapshot ssh chester@10.0.0.211 \ "docker service ls --filter label=com.docker.stack.namespace=authentik" \ > ../outputs/authentik_service_snapshot_$(date +%Y%m%dT%H%M%S).txt ``` - [ ] Deploy log saved to `outputs/authentik_deploy_.log`. - [ ] Service state snapshot saved to `outputs/authentik_service_snapshot_.txt`. - [ ] Deployment timestamp and verification timestamp recorded in this checklist. ### 6.5 Deployment sign-off | Field | Value | | :--- | :--- | | Deployment owner | | | Deployment timestamp | | | Verification timestamp | | | Endpoint verified | `https://sso.castaldifamily.com` | | Final status | ☐ GREEN — all phases passed | --- ## Rollback procedure If deployment fails or causes instability, remove the stack and preserve data. ```bash cd /home/chester/homelab/ansible ansible-playbook \ -i inventory/hosts.ini \ playbooks/docker/deploy_authentik.yml \ -e "authentik_deploy_state=absent" \ --vault-password-file .vault_pass ``` > [!WARNING] > `authentik_deploy_state=absent` removes the **Swarm stack** (containers, > services, configs) but does **not** delete the bind-mount data directories. > Data at `/mnt/homelab/apps/authentik` is preserved for re-deploy or restore. - [ ] Stack removed cleanly (`docker stack ls` shows no `authentik` entry). - [ ] Data directories still intact on `swarm-manager-1`. - [ ] Root cause identified before re-deploying. --- ## Troubleshooting matrix ### Validation assert fails: secrets not defined or placeholder **Symptom:** Playbook fails on `Assert vault_authentik_secret_key is defined` or `Assert Authentik secrets are not placeholders`. **Check:** ```bash ansible -i inventory/hosts.ini localhost \ -m ansible.builtin.debug \ -a "var=vault_authentik_secret_key" \ -e "@group_vars/all.yml" \ --vault-password-file .vault_pass ``` **Fix:** Encrypt and store the correct value: ```bash ansible-vault encrypt_string 'YOUR-KEY' \ --name 'vault_authentik_secret_key' \ --vault-password-file .vault_pass # Paste output into group_vars/vault/all.yml ``` --- ### Validation assert fails: data paths missing **Symptom:** Playbook fails on `Assert required Authentik paths exist before deploy`. **Check:** ```bash ssh chester@10.0.0.211 "ls -la /mnt/homelab/apps/authentik/" ``` **Fix (fresh install only):** ```bash ssh chester@10.0.0.211 "sudo mkdir -p \ /mnt/homelab/apps/authentik/data/{database,redis,media,config,blueprints}" ``` **Fix (existing install):** Restore from backup before creating directories. --- ### Swarm assert fails: manager not active or not control plane **Symptom:** Playbook fails on `Assert target is an active Swarm manager`. **Check:** ```bash ssh chester@10.0.0.211 "docker info --format '{{.Swarm.LocalNodeState}}'" ``` **Fix:** Investigate Swarm manager health. Do not proceed until a healthy quorum manager is the deploy target. --- ### Services not converging to 1/1 **Symptom:** `docker service ls` shows `0/1` or a service cycles through restarts. **Check:** ```bash ssh chester@10.0.0.211 \ "docker service ps authentik_authentik-server --no-trunc" ``` Look for failure reasons in the `Error` column. **Common causes:** | Cause | Evidence in logs | Fix | | :--- | :--- | :--- | | Secret key mismatch | `cryptography error` or `key invalid` in server logs | Re-check vault value, redeploy | | Postgres not healthy yet | `connection refused` in server logs | Wait for postgres `(healthy)`, then check server | | Redis not reachable | `redis connection error` in server or worker logs | Confirm `authentik-redis` is `1/1` healthy first | | Missing bind-mount path | `no such file or directory` in container start | Create path, redeploy | | Insufficient memory | OOM kill in `docker service ps` error column | Check node resources, adjust limits if needed | --- ### Traefik route not registered or 502 response **Symptom:** `curl https://sso.castaldifamily.com` returns `502 Bad Gateway` or connection refused. **Check:** ```bash # Confirm traefik-kop is running (Swarm stack) ssh chester@10.0.0.211 \ "docker service ls --filter name=traefik-kop" # Check server is listening on port 9000 ssh chester@10.0.0.211 \ "docker service ps authentik_authentik-server --filter desired-state=running" ``` **Common causes:** - `traefik-kop` is not running → deploy monitoring stack first. - `authentik-server` is not bound on port `9000` → check replica and restart. - `edge_routing.swarm.bind_ip` is incorrect in `group_vars/all.yml` → verify it resolves to an active Swarm node. - Cloudflare DNS is not pointing to `10.0.0.151` → verify DNS record for `sso.castaldifamily.com`. --- ### Database migration errors on first boot **Symptom:** Server logs show migration errors or `relation does not exist`. **Check:** ```bash ssh chester@10.0.0.211 \ "docker service logs authentik_authentik-server --since 5m 2>&1 | grep -i 'migrat\|error\|fatal'" ``` **Fix:** Migrations run automatically on startup. If they fail: 1. Check postgres is `(healthy)` and accepting connections. 2. Check `vault_authentik_postgres_password` in vault matches the running postgres password. 3. Restart the server service to trigger a re-run: ```bash ssh chester@10.0.0.211 \ "docker service update --force authentik_authentik-server" ``` --- ## Reference | Resource | Location | | :--- | :--- | | Deploy playbook | `ansible/playbooks/docker/deploy_authentik.yml` | | Stack template | `ansible/templates/stacks/authentik.stack.yml` | | Shared variables | `ansible/group_vars/all.yml` | | Vault secrets | `ansible/group_vars/vault/all.yml` | | Authentik docs | | | Authentik changelog | | | Swarm cluster baseline | `outputs/swarm_audit_20260314T122134.md` |