7.2 KiB
Watchtower monitoring onboarding and self-healing runbook
Purpose
This runbook is the operator path for deploying, validating, and maintaining the full Watchtower monitoring stack.
It covers:
- Monitoring stack onboarding (all services).
- Integration points between services and external Traefik.
- Day-1 troubleshooting, including Authentik outpost restart loops.
- Self-healing execution with safe, repeatable reconciliation.
Scope
The canonical Watchtower monitoring scope is:
- traefik-kop
- Prometheus
- Grafana
- Uptime Kuma
- node-exporter
- watchtower-cadvisor
- Dozzle
- Authentik outpost for Dozzle
- Loki
- Promtail
- blackbox-exporter
Architecture summary
- External Traefik ingress runs on
10.0.0.151and is not migrated into Swarm. - Swarm exporters run on Swarm nodes.
- Watchtower hosts aggregation, storage, visualization, and logging services.
- Traefik labels are used for HTTPS-routed UIs (Grafana, Dozzle, Uptime Kuma).
Prerequisites
- Inventory groups are defined and reachable:
swarm_managers,swarm_workers,swarm_hosts, andwatchtower. - Docker is installed on all target nodes.
- Overlay network
proxy-netexists for Swarm workloads. - Vault file exists at
ansible/group_vars/vault/all.ymlor equivalent secrets are provided through secure environment variables. - Required secrets are present:
vault_grafana_admin_passwordvault_authentik_outpost_dozzle_token
If Authentik token is not available yet, set monitoring_enable_authentik_outpost=false
for bootstrap deployment and keep Dozzle private until token onboarding is complete.
Warning
Never hardcode tokens or passwords in compose files, playbooks, or helper scripts. Use Vault variables and rotate credentials if any plaintext secret was committed.
Deployment order
- Exporters on Swarm nodes (
node-exporter,cAdvisor). - Dozzle agent on Swarm managers.
- Watchtower stack (
traefik-kop, Prometheus, Grafana, Uptime Kuma, Dozzle, Authentik outpost, Loki, Promtail). - Post-deploy verification and dashboard bootstrap.
Deploy commands
Run from ansible/:
ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml
Target only Swarm exporters:
ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml --tags swarm
Target only Watchtower stack:
ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml --tags watchtower
Service-by-service onboarding checks
traefik-kop
- Verify service starts and can reach Redis endpoint
10.0.0.151:6379. - Verify route updates are visible from external Traefik behavior.
Prometheus
- Verify readiness endpoint:
curl -fsS http://10.0.0.200:9091/-/ready
- Verify targets include expected managers, workers, and Watchtower node-exporter.
Grafana
- Verify HTTPS route at configured domain.
- Confirm login with admin user and vault-provided password.
- Add data sources:
- Prometheus:
http://prometheus:9090 - Loki:
http://loki:3100
- Prometheus:
Uptime Kuma
- Verify HTTPS route and UI load.
- Add core checks for:
- External Traefik endpoint
- Watchtower host health
- Swarm manager API reachability
node-exporter and cAdvisor
- Verify metrics endpoints are reachable from each node.
- Confirm Prometheus scrape status is
upfor all exporters. - Verify local Watchtower cAdvisor endpoint:
curl -fsS http://10.0.0.200:18080/metrics | head
Dozzle and Authentik outpost
- Verify Dozzle HTTPS route.
- Verify Authentik outpost endpoint routing under
/outpost.goauthentik.io/. - Verify forward-auth middleware is attached and blocking unauthenticated access.
Loki and Promtail
- Verify Loki API health via container logs and ingestion behavior.
- Verify Promtail discovers Docker logs and labels streams by project/service.
blackbox-exporter (network and endpoint probes)
- Verify Blackbox exporter is reachable:
curl -fsS http://10.0.0.200:9115/metrics | head
- Verify Prometheus shows probe targets in
blackbox-probesjob. - Add probe targets through
monitoring_probe_targetsin group vars.
Day-1 troubleshooting
Authentik outpost restart loop
- Verify token presence in rendered
.envfor stack directory. - Confirm token matches active Authentik outpost token in Authentik admin.
- Confirm Traefik middleware label references the same outpost service.
- Check container logs:
docker logs authentik-outpost-dozzle --tail 200
- Reconcile stack after token correction:
ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml --tags watchtower
Backlog item: Authentik token pending
- Keep
monitoring_enable_authentik_outpost=falsewhile token is unavailable. - Do not expose Dozzle publicly without Authentik forward-auth.
- Re-enable outpost after token handoff and re-run watchtower tag.
Prometheus missing targets
- Confirm inventory contains correct node IPs and groups.
- Re-run deployment to re-render scrape config.
- Query target API and inspect dropped targets.
Blackbox probes failing
- Confirm target is reachable from Watchtower network path.
- Confirm probe module matches target protocol (
icmp,tcp_connect,http_2xx). - Confirm Prometheus relabeling routes probes to
watchtower_ip:9115.
Dozzle cannot see remote logs
- Confirm
dozzle-agentservice is healthy on manager nodes. - Confirm remote agent endpoints and ports are reachable.
- Confirm Docker socket mount is present and read-only where expected.
Self-healing model
Self-healing is implemented as scheduled reconciliation, not ad-hoc manual edits.
Current helper script status
ansible/scripts/pi_pull_updates.shis retained as a helper and now expects configurable environment variables instead of embedded credentials.ansible/scripts/pi_init.shis optional for operator bootstrap and is not required for monitoring stack reconciliation.
Recommended execution pattern
- Use
ansible-pullto sync and applyansible/playbooks/self-heal/watchtower.yml. - Run through a scheduler (prefer
systemdtimer for reliability and observability). - Keep logs in a persistent path and alert on repeated failures.
Example manual run:
REPO_URL=git@git.castaldifamily.com:nathan/homelab.git \
PLAYBOOK_PATH=ansible/playbooks/self-heal/watchtower.yml \
/home/chester/homelab/ansible/scripts/pi_pull_updates.sh
Important
If your repository is private, use SSH deploy keys or vault-backed secret injection. Do not place long-lived personal access tokens in script files.
Idempotency and rollback
- Re-running deployment playbooks is expected and safe; desired state is reconciled.
- Keep stack definitions in Git and avoid manual edits in
/opt/stacks. - Rollback method:
- Revert the offending commit in Git.
- Re-run deployment playbook.
- Validate endpoints and target health.
Operational safety rules
- Do not run services as root unless technically required and documented.
- Avoid broad host mounts unless required for telemetry collection.
- Keep exposed admin ports behind Traefik and authentication middleware.
- Validate health and auth behavior before declaring changes complete.