homelab/ansible/archive/documentation/playbooks/watchtower-monitoring-onboarding.md

7.2 KiB

Watchtower monitoring onboarding and self-healing runbook

Purpose

This runbook is the operator path for deploying, validating, and maintaining the full Watchtower monitoring stack.

It covers:

  • Monitoring stack onboarding (all services).
  • Integration points between services and external Traefik.
  • Day-1 troubleshooting, including Authentik outpost restart loops.
  • Self-healing execution with safe, repeatable reconciliation.

Scope

The canonical Watchtower monitoring scope is:

  • traefik-kop
  • Prometheus
  • Grafana
  • Uptime Kuma
  • node-exporter
  • watchtower-cadvisor
  • Dozzle
  • Authentik outpost for Dozzle
  • Loki
  • Promtail
  • blackbox-exporter

Architecture summary

  • External Traefik ingress runs on 10.0.0.151 and is not migrated into Swarm.
  • Swarm exporters run on Swarm nodes.
  • Watchtower hosts aggregation, storage, visualization, and logging services.
  • Traefik labels are used for HTTPS-routed UIs (Grafana, Dozzle, Uptime Kuma).

Prerequisites

  1. Inventory groups are defined and reachable: swarm_managers, swarm_workers, swarm_hosts, and watchtower.
  2. Docker is installed on all target nodes.
  3. Overlay network proxy-net exists for Swarm workloads.
  4. Vault file exists at ansible/group_vars/vault/all.yml or equivalent secrets are provided through secure environment variables.
  5. Required secrets are present:
    • vault_grafana_admin_password
    • vault_authentik_outpost_dozzle_token

If Authentik token is not available yet, set monitoring_enable_authentik_outpost=false for bootstrap deployment and keep Dozzle private until token onboarding is complete.

Warning

Never hardcode tokens or passwords in compose files, playbooks, or helper scripts. Use Vault variables and rotate credentials if any plaintext secret was committed.

Deployment order

  1. Exporters on Swarm nodes (node-exporter, cAdvisor).
  2. Dozzle agent on Swarm managers.
  3. Watchtower stack (traefik-kop, Prometheus, Grafana, Uptime Kuma, Dozzle, Authentik outpost, Loki, Promtail).
  4. Post-deploy verification and dashboard bootstrap.

Deploy commands

Run from ansible/:

ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml

Target only Swarm exporters:

ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml --tags swarm

Target only Watchtower stack:

ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml --tags watchtower

Service-by-service onboarding checks

traefik-kop

  • Verify service starts and can reach Redis endpoint 10.0.0.151:6379.
  • Verify route updates are visible from external Traefik behavior.

Prometheus

  • Verify readiness endpoint:
curl -fsS http://10.0.0.200:9091/-/ready
  • Verify targets include expected managers, workers, and Watchtower node-exporter.

Grafana

  • Verify HTTPS route at configured domain.
  • Confirm login with admin user and vault-provided password.
  • Add data sources:
    • Prometheus: http://prometheus:9090
    • Loki: http://loki:3100

Uptime Kuma

  • Verify HTTPS route and UI load.
  • Add core checks for:
    • External Traefik endpoint
    • Watchtower host health
    • Swarm manager API reachability

node-exporter and cAdvisor

  • Verify metrics endpoints are reachable from each node.
  • Confirm Prometheus scrape status is up for all exporters.
  • Verify local Watchtower cAdvisor endpoint:
curl -fsS http://10.0.0.200:18080/metrics | head

Dozzle and Authentik outpost

  • Verify Dozzle HTTPS route.
  • Verify Authentik outpost endpoint routing under /outpost.goauthentik.io/.
  • Verify forward-auth middleware is attached and blocking unauthenticated access.

Loki and Promtail

  • Verify Loki API health via container logs and ingestion behavior.
  • Verify Promtail discovers Docker logs and labels streams by project/service.

blackbox-exporter (network and endpoint probes)

  • Verify Blackbox exporter is reachable:
curl -fsS http://10.0.0.200:9115/metrics | head
  • Verify Prometheus shows probe targets in blackbox-probes job.
  • Add probe targets through monitoring_probe_targets in group vars.

Day-1 troubleshooting

Authentik outpost restart loop

  1. Verify token presence in rendered .env for stack directory.
  2. Confirm token matches active Authentik outpost token in Authentik admin.
  3. Confirm Traefik middleware label references the same outpost service.
  4. Check container logs:
docker logs authentik-outpost-dozzle --tail 200
  1. Reconcile stack after token correction:
ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml --tags watchtower

Backlog item: Authentik token pending

  1. Keep monitoring_enable_authentik_outpost=false while token is unavailable.
  2. Do not expose Dozzle publicly without Authentik forward-auth.
  3. Re-enable outpost after token handoff and re-run watchtower tag.

Prometheus missing targets

  1. Confirm inventory contains correct node IPs and groups.
  2. Re-run deployment to re-render scrape config.
  3. Query target API and inspect dropped targets.

Blackbox probes failing

  1. Confirm target is reachable from Watchtower network path.
  2. Confirm probe module matches target protocol (icmp, tcp_connect, http_2xx).
  3. Confirm Prometheus relabeling routes probes to watchtower_ip:9115.

Dozzle cannot see remote logs

  1. Confirm dozzle-agent service is healthy on manager nodes.
  2. Confirm remote agent endpoints and ports are reachable.
  3. Confirm Docker socket mount is present and read-only where expected.

Self-healing model

Self-healing is implemented as scheduled reconciliation, not ad-hoc manual edits.

Current helper script status

  • ansible/scripts/pi_pull_updates.sh is retained as a helper and now expects configurable environment variables instead of embedded credentials.
  • ansible/scripts/pi_init.sh is optional for operator bootstrap and is not required for monitoring stack reconciliation.
  1. Use ansible-pull to sync and apply ansible/playbooks/self-heal/watchtower.yml.
  2. Run through a scheduler (prefer systemd timer for reliability and observability).
  3. Keep logs in a persistent path and alert on repeated failures.

Example manual run:

REPO_URL=git@git.castaldifamily.com:nathan/homelab.git \
PLAYBOOK_PATH=ansible/playbooks/self-heal/watchtower.yml \
/home/chester/homelab/ansible/scripts/pi_pull_updates.sh

Important

If your repository is private, use SSH deploy keys or vault-backed secret injection. Do not place long-lived personal access tokens in script files.

Idempotency and rollback

  • Re-running deployment playbooks is expected and safe; desired state is reconciled.
  • Keep stack definitions in Git and avoid manual edits in /opt/stacks.
  • Rollback method:
    1. Revert the offending commit in Git.
    2. Re-run deployment playbook.
    3. Validate endpoints and target health.

Operational safety rules

  • Do not run services as root unless technically required and documented.
  • Avoid broad host mounts unless required for telemetry collection.
  • Keep exposed admin ports behind Traefik and authentication middleware.
  • Validate health and auth behavior before declaring changes complete.