# Watchtower monitoring onboarding and self-healing runbook ## Purpose This runbook is the operator path for deploying, validating, and maintaining the full Watchtower monitoring stack. It covers: - Monitoring stack onboarding (all services). - Integration points between services and external Traefik. - Day-1 troubleshooting, including Authentik outpost restart loops. - Self-healing execution with safe, repeatable reconciliation. ## Scope The canonical Watchtower monitoring scope is: - traefik-kop - Prometheus - Grafana - Uptime Kuma - node-exporter - watchtower-cadvisor - Dozzle - Authentik outpost for Dozzle - Loki - Promtail - blackbox-exporter ## Architecture summary - External Traefik ingress runs on `10.0.0.151` and is not migrated into Swarm. - Swarm exporters run on Swarm nodes. - Watchtower hosts aggregation, storage, visualization, and logging services. - Traefik labels are used for HTTPS-routed UIs (Grafana, Dozzle, Uptime Kuma). ## Prerequisites 1. Inventory groups are defined and reachable: `swarm_managers`, `swarm_workers`, `swarm_hosts`, and `watchtower`. 2. Docker is installed on all target nodes. 3. Overlay network `proxy-net` exists for Swarm workloads. 4. Vault file exists at `ansible/group_vars/vault/all.yml` or equivalent secrets are provided through secure environment variables. 5. Required secrets are present: - `vault_grafana_admin_password` - `vault_authentik_outpost_dozzle_token` If Authentik token is not available yet, set `monitoring_enable_authentik_outpost=false` for bootstrap deployment and keep Dozzle private until token onboarding is complete. > [!WARNING] > Never hardcode tokens or passwords in compose files, playbooks, or helper scripts. > Use Vault variables and rotate credentials if any plaintext secret was committed. ## Deployment order 1. Exporters on Swarm nodes (`node-exporter`, `cAdvisor`). 2. Dozzle agent on Swarm managers. 3. Watchtower stack (`traefik-kop`, Prometheus, Grafana, Uptime Kuma, Dozzle, Authentik outpost, Loki, Promtail). 4. Post-deploy verification and dashboard bootstrap. ## Deploy commands Run from `ansible/`: ```bash ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml ``` Target only Swarm exporters: ```bash ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml --tags swarm ``` Target only Watchtower stack: ```bash ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml --tags watchtower ``` ## Service-by-service onboarding checks ### traefik-kop - Verify service starts and can reach Redis endpoint `10.0.0.151:6379`. - Verify route updates are visible from external Traefik behavior. ### Prometheus - Verify readiness endpoint: ```bash curl -fsS http://10.0.0.200:9091/-/ready ``` - Verify targets include expected managers, workers, and Watchtower node-exporter. ### Grafana - Verify HTTPS route at configured domain. - Confirm login with admin user and vault-provided password. - Add data sources: - Prometheus: `http://prometheus:9090` - Loki: `http://loki:3100` ### Uptime Kuma - Verify HTTPS route and UI load. - Add core checks for: - External Traefik endpoint - Watchtower host health - Swarm manager API reachability ### node-exporter and cAdvisor - Verify metrics endpoints are reachable from each node. - Confirm Prometheus scrape status is `up` for all exporters. - Verify local Watchtower cAdvisor endpoint: ```bash curl -fsS http://10.0.0.200:18080/metrics | head ``` ### Dozzle and Authentik outpost - Verify Dozzle HTTPS route. - Verify Authentik outpost endpoint routing under `/outpost.goauthentik.io/`. - Verify forward-auth middleware is attached and blocking unauthenticated access. ### Loki and Promtail - Verify Loki API health via container logs and ingestion behavior. - Verify Promtail discovers Docker logs and labels streams by project/service. ### blackbox-exporter (network and endpoint probes) - Verify Blackbox exporter is reachable: ```bash curl -fsS http://10.0.0.200:9115/metrics | head ``` - Verify Prometheus shows probe targets in `blackbox-probes` job. - Add probe targets through `monitoring_probe_targets` in group vars. ## Day-1 troubleshooting ### Authentik outpost restart loop 1. Verify token presence in rendered `.env` for stack directory. 1. Confirm token matches active Authentik outpost token in Authentik admin. 1. Confirm Traefik middleware label references the same outpost service. 1. Check container logs: ```bash docker logs authentik-outpost-dozzle --tail 200 ``` 1. Reconcile stack after token correction: ```bash ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml --tags watchtower ``` ### Backlog item: Authentik token pending 1. Keep `monitoring_enable_authentik_outpost=false` while token is unavailable. 1. Do not expose Dozzle publicly without Authentik forward-auth. 1. Re-enable outpost after token handoff and re-run watchtower tag. ### Prometheus missing targets 1. Confirm inventory contains correct node IPs and groups. 2. Re-run deployment to re-render scrape config. 3. Query target API and inspect dropped targets. ### Blackbox probes failing 1. Confirm target is reachable from Watchtower network path. 1. Confirm probe module matches target protocol (`icmp`, `tcp_connect`, `http_2xx`). 1. Confirm Prometheus relabeling routes probes to `watchtower_ip:9115`. ### Dozzle cannot see remote logs 1. Confirm `dozzle-agent` service is healthy on manager nodes. 2. Confirm remote agent endpoints and ports are reachable. 3. Confirm Docker socket mount is present and read-only where expected. ## Self-healing model Self-healing is implemented as scheduled reconciliation, not ad-hoc manual edits. ### Current helper script status - `ansible/scripts/pi_pull_updates.sh` is retained as a helper and now expects configurable environment variables instead of embedded credentials. - `ansible/scripts/pi_init.sh` is optional for operator bootstrap and is not required for monitoring stack reconciliation. ### Recommended execution pattern 1. Use `ansible-pull` to sync and apply `ansible/playbooks/self-heal/watchtower.yml`. 2. Run through a scheduler (prefer `systemd` timer for reliability and observability). 3. Keep logs in a persistent path and alert on repeated failures. Example manual run: ```bash REPO_URL=git@git.castaldifamily.com:nathan/homelab.git \ PLAYBOOK_PATH=ansible/playbooks/self-heal/watchtower.yml \ /home/chester/homelab/ansible/scripts/pi_pull_updates.sh ``` > [!IMPORTANT] > If your repository is private, use SSH deploy keys or vault-backed secret injection. > Do not place long-lived personal access tokens in script files. ## Idempotency and rollback - Re-running deployment playbooks is expected and safe; desired state is reconciled. - Keep stack definitions in Git and avoid manual edits in `/opt/stacks`. - Rollback method: 1. Revert the offending commit in Git. 2. Re-run deployment playbook. 3. Validate endpoints and target health. ## Operational safety rules - Do not run services as root unless technically required and documented. - Avoid broad host mounts unless required for telemetry collection. - Keep exposed admin ports behind Traefik and authentication middleware. - Validate health and auth behavior before declaring changes complete.