homelab/ansible/archive/documentation/playbooks/watchtower-monitoring-onboarding.md

237 lines
7.2 KiB
Markdown

# Watchtower monitoring onboarding and self-healing runbook
## Purpose
This runbook is the operator path for deploying, validating, and maintaining the full
Watchtower monitoring stack.
It covers:
- Monitoring stack onboarding (all services).
- Integration points between services and external Traefik.
- Day-1 troubleshooting, including Authentik outpost restart loops.
- Self-healing execution with safe, repeatable reconciliation.
## Scope
The canonical Watchtower monitoring scope is:
- traefik-kop
- Prometheus
- Grafana
- Uptime Kuma
- node-exporter
- watchtower-cadvisor
- Dozzle
- Authentik outpost for Dozzle
- Loki
- Promtail
- blackbox-exporter
## Architecture summary
- External Traefik ingress runs on `10.0.0.151` and is not migrated into Swarm.
- Swarm exporters run on Swarm nodes.
- Watchtower hosts aggregation, storage, visualization, and logging services.
- Traefik labels are used for HTTPS-routed UIs (Grafana, Dozzle, Uptime Kuma).
## Prerequisites
1. Inventory groups are defined and reachable: `swarm_managers`, `swarm_workers`,
`swarm_hosts`, and `watchtower`.
2. Docker is installed on all target nodes.
3. Overlay network `proxy-net` exists for Swarm workloads.
4. Vault file exists at `ansible/group_vars/vault/all.yml` or equivalent secrets are
provided through secure environment variables.
5. Required secrets are present:
- `vault_grafana_admin_password`
- `vault_authentik_outpost_dozzle_token`
If Authentik token is not available yet, set `monitoring_enable_authentik_outpost=false`
for bootstrap deployment and keep Dozzle private until token onboarding is complete.
> [!WARNING]
> Never hardcode tokens or passwords in compose files, playbooks, or helper scripts.
> Use Vault variables and rotate credentials if any plaintext secret was committed.
## Deployment order
1. Exporters on Swarm nodes (`node-exporter`, `cAdvisor`).
2. Dozzle agent on Swarm managers.
3. Watchtower stack (`traefik-kop`, Prometheus, Grafana, Uptime Kuma, Dozzle,
Authentik outpost, Loki, Promtail).
4. Post-deploy verification and dashboard bootstrap.
## Deploy commands
Run from `ansible/`:
```bash
ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml
```
Target only Swarm exporters:
```bash
ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml --tags swarm
```
Target only Watchtower stack:
```bash
ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml --tags watchtower
```
## Service-by-service onboarding checks
### traefik-kop
- Verify service starts and can reach Redis endpoint `10.0.0.151:6379`.
- Verify route updates are visible from external Traefik behavior.
### Prometheus
- Verify readiness endpoint:
```bash
curl -fsS http://10.0.0.200:9091/-/ready
```
- Verify targets include expected managers, workers, and Watchtower node-exporter.
### Grafana
- Verify HTTPS route at configured domain.
- Confirm login with admin user and vault-provided password.
- Add data sources:
- Prometheus: `http://prometheus:9090`
- Loki: `http://loki:3100`
### Uptime Kuma
- Verify HTTPS route and UI load.
- Add core checks for:
- External Traefik endpoint
- Watchtower host health
- Swarm manager API reachability
### node-exporter and cAdvisor
- Verify metrics endpoints are reachable from each node.
- Confirm Prometheus scrape status is `up` for all exporters.
- Verify local Watchtower cAdvisor endpoint:
```bash
curl -fsS http://10.0.0.200:18080/metrics | head
```
### Dozzle and Authentik outpost
- Verify Dozzle HTTPS route.
- Verify Authentik outpost endpoint routing under `/outpost.goauthentik.io/`.
- Verify forward-auth middleware is attached and blocking unauthenticated access.
### Loki and Promtail
- Verify Loki API health via container logs and ingestion behavior.
- Verify Promtail discovers Docker logs and labels streams by project/service.
### blackbox-exporter (network and endpoint probes)
- Verify Blackbox exporter is reachable:
```bash
curl -fsS http://10.0.0.200:9115/metrics | head
```
- Verify Prometheus shows probe targets in `blackbox-probes` job.
- Add probe targets through `monitoring_probe_targets` in group vars.
## Day-1 troubleshooting
### Authentik outpost restart loop
1. Verify token presence in rendered `.env` for stack directory.
1. Confirm token matches active Authentik outpost token in Authentik admin.
1. Confirm Traefik middleware label references the same outpost service.
1. Check container logs:
```bash
docker logs authentik-outpost-dozzle --tail 200
```
1. Reconcile stack after token correction:
```bash
ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml --tags watchtower
```
### Backlog item: Authentik token pending
1. Keep `monitoring_enable_authentik_outpost=false` while token is unavailable.
1. Do not expose Dozzle publicly without Authentik forward-auth.
1. Re-enable outpost after token handoff and re-run watchtower tag.
### Prometheus missing targets
1. Confirm inventory contains correct node IPs and groups.
2. Re-run deployment to re-render scrape config.
3. Query target API and inspect dropped targets.
### Blackbox probes failing
1. Confirm target is reachable from Watchtower network path.
1. Confirm probe module matches target protocol (`icmp`, `tcp_connect`, `http_2xx`).
1. Confirm Prometheus relabeling routes probes to `watchtower_ip:9115`.
### Dozzle cannot see remote logs
1. Confirm `dozzle-agent` service is healthy on manager nodes.
2. Confirm remote agent endpoints and ports are reachable.
3. Confirm Docker socket mount is present and read-only where expected.
## Self-healing model
Self-healing is implemented as scheduled reconciliation, not ad-hoc manual edits.
### Current helper script status
- `ansible/scripts/pi_pull_updates.sh` is retained as a helper and now expects
configurable environment variables instead of embedded credentials.
- `ansible/scripts/pi_init.sh` is optional for operator bootstrap and is not
required for monitoring stack reconciliation.
### Recommended execution pattern
1. Use `ansible-pull` to sync and apply `ansible/playbooks/self-heal/watchtower.yml`.
2. Run through a scheduler (prefer `systemd` timer for reliability and observability).
3. Keep logs in a persistent path and alert on repeated failures.
Example manual run:
```bash
REPO_URL=git@git.castaldifamily.com:nathan/homelab.git \
PLAYBOOK_PATH=ansible/playbooks/self-heal/watchtower.yml \
/home/chester/homelab/ansible/scripts/pi_pull_updates.sh
```
> [!IMPORTANT]
> If your repository is private, use SSH deploy keys or vault-backed secret injection.
> Do not place long-lived personal access tokens in script files.
## Idempotency and rollback
- Re-running deployment playbooks is expected and safe; desired state is reconciled.
- Keep stack definitions in Git and avoid manual edits in `/opt/stacks`.
- Rollback method:
1. Revert the offending commit in Git.
2. Re-run deployment playbook.
3. Validate endpoints and target health.
## Operational safety rules
- Do not run services as root unless technically required and documented.
- Avoid broad host mounts unless required for telemetry collection.
- Keep exposed admin ports behind Traefik and authentication middleware.
- Validate health and auth behavior before declaring changes complete.