237 lines
7.2 KiB
Markdown
237 lines
7.2 KiB
Markdown
# Watchtower monitoring onboarding and self-healing runbook
|
|
|
|
## Purpose
|
|
|
|
This runbook is the operator path for deploying, validating, and maintaining the full
|
|
Watchtower monitoring stack.
|
|
|
|
It covers:
|
|
|
|
- Monitoring stack onboarding (all services).
|
|
- Integration points between services and external Traefik.
|
|
- Day-1 troubleshooting, including Authentik outpost restart loops.
|
|
- Self-healing execution with safe, repeatable reconciliation.
|
|
|
|
## Scope
|
|
|
|
The canonical Watchtower monitoring scope is:
|
|
|
|
- traefik-kop
|
|
- Prometheus
|
|
- Grafana
|
|
- Uptime Kuma
|
|
- node-exporter
|
|
- watchtower-cadvisor
|
|
- Dozzle
|
|
- Authentik outpost for Dozzle
|
|
- Loki
|
|
- Promtail
|
|
- blackbox-exporter
|
|
|
|
## Architecture summary
|
|
|
|
- External Traefik ingress runs on `10.0.0.151` and is not migrated into Swarm.
|
|
- Swarm exporters run on Swarm nodes.
|
|
- Watchtower hosts aggregation, storage, visualization, and logging services.
|
|
- Traefik labels are used for HTTPS-routed UIs (Grafana, Dozzle, Uptime Kuma).
|
|
|
|
## Prerequisites
|
|
|
|
1. Inventory groups are defined and reachable: `swarm_managers`, `swarm_workers`,
|
|
`swarm_hosts`, and `watchtower`.
|
|
2. Docker is installed on all target nodes.
|
|
3. Overlay network `proxy-net` exists for Swarm workloads.
|
|
4. Vault file exists at `ansible/group_vars/vault/all.yml` or equivalent secrets are
|
|
provided through secure environment variables.
|
|
5. Required secrets are present:
|
|
- `vault_grafana_admin_password`
|
|
- `vault_authentik_outpost_dozzle_token`
|
|
|
|
If Authentik token is not available yet, set `monitoring_enable_authentik_outpost=false`
|
|
for bootstrap deployment and keep Dozzle private until token onboarding is complete.
|
|
|
|
> [!WARNING]
|
|
> Never hardcode tokens or passwords in compose files, playbooks, or helper scripts.
|
|
> Use Vault variables and rotate credentials if any plaintext secret was committed.
|
|
|
|
## Deployment order
|
|
|
|
1. Exporters on Swarm nodes (`node-exporter`, `cAdvisor`).
|
|
2. Dozzle agent on Swarm managers.
|
|
3. Watchtower stack (`traefik-kop`, Prometheus, Grafana, Uptime Kuma, Dozzle,
|
|
Authentik outpost, Loki, Promtail).
|
|
4. Post-deploy verification and dashboard bootstrap.
|
|
|
|
## Deploy commands
|
|
|
|
Run from `ansible/`:
|
|
|
|
```bash
|
|
ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml
|
|
```
|
|
|
|
Target only Swarm exporters:
|
|
|
|
```bash
|
|
ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml --tags swarm
|
|
```
|
|
|
|
Target only Watchtower stack:
|
|
|
|
```bash
|
|
ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml --tags watchtower
|
|
```
|
|
|
|
## Service-by-service onboarding checks
|
|
|
|
### traefik-kop
|
|
|
|
- Verify service starts and can reach Redis endpoint `10.0.0.151:6379`.
|
|
- Verify route updates are visible from external Traefik behavior.
|
|
|
|
### Prometheus
|
|
|
|
- Verify readiness endpoint:
|
|
|
|
```bash
|
|
curl -fsS http://10.0.0.200:9091/-/ready
|
|
```
|
|
|
|
- Verify targets include expected managers, workers, and Watchtower node-exporter.
|
|
|
|
### Grafana
|
|
|
|
- Verify HTTPS route at configured domain.
|
|
- Confirm login with admin user and vault-provided password.
|
|
- Add data sources:
|
|
- Prometheus: `http://prometheus:9090`
|
|
- Loki: `http://loki:3100`
|
|
|
|
### Uptime Kuma
|
|
|
|
- Verify HTTPS route and UI load.
|
|
- Add core checks for:
|
|
- External Traefik endpoint
|
|
- Watchtower host health
|
|
- Swarm manager API reachability
|
|
|
|
### node-exporter and cAdvisor
|
|
|
|
- Verify metrics endpoints are reachable from each node.
|
|
- Confirm Prometheus scrape status is `up` for all exporters.
|
|
- Verify local Watchtower cAdvisor endpoint:
|
|
|
|
```bash
|
|
curl -fsS http://10.0.0.200:18080/metrics | head
|
|
```
|
|
|
|
### Dozzle and Authentik outpost
|
|
|
|
- Verify Dozzle HTTPS route.
|
|
- Verify Authentik outpost endpoint routing under `/outpost.goauthentik.io/`.
|
|
- Verify forward-auth middleware is attached and blocking unauthenticated access.
|
|
|
|
### Loki and Promtail
|
|
|
|
- Verify Loki API health via container logs and ingestion behavior.
|
|
- Verify Promtail discovers Docker logs and labels streams by project/service.
|
|
|
|
### blackbox-exporter (network and endpoint probes)
|
|
|
|
- Verify Blackbox exporter is reachable:
|
|
|
|
```bash
|
|
curl -fsS http://10.0.0.200:9115/metrics | head
|
|
```
|
|
|
|
- Verify Prometheus shows probe targets in `blackbox-probes` job.
|
|
- Add probe targets through `monitoring_probe_targets` in group vars.
|
|
|
|
## Day-1 troubleshooting
|
|
|
|
### Authentik outpost restart loop
|
|
|
|
1. Verify token presence in rendered `.env` for stack directory.
|
|
1. Confirm token matches active Authentik outpost token in Authentik admin.
|
|
1. Confirm Traefik middleware label references the same outpost service.
|
|
1. Check container logs:
|
|
|
|
```bash
|
|
docker logs authentik-outpost-dozzle --tail 200
|
|
```
|
|
|
|
1. Reconcile stack after token correction:
|
|
|
|
```bash
|
|
ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml --tags watchtower
|
|
```
|
|
|
|
### Backlog item: Authentik token pending
|
|
|
|
1. Keep `monitoring_enable_authentik_outpost=false` while token is unavailable.
|
|
1. Do not expose Dozzle publicly without Authentik forward-auth.
|
|
1. Re-enable outpost after token handoff and re-run watchtower tag.
|
|
|
|
### Prometheus missing targets
|
|
|
|
1. Confirm inventory contains correct node IPs and groups.
|
|
2. Re-run deployment to re-render scrape config.
|
|
3. Query target API and inspect dropped targets.
|
|
|
|
### Blackbox probes failing
|
|
|
|
1. Confirm target is reachable from Watchtower network path.
|
|
1. Confirm probe module matches target protocol (`icmp`, `tcp_connect`, `http_2xx`).
|
|
1. Confirm Prometheus relabeling routes probes to `watchtower_ip:9115`.
|
|
|
|
### Dozzle cannot see remote logs
|
|
|
|
1. Confirm `dozzle-agent` service is healthy on manager nodes.
|
|
2. Confirm remote agent endpoints and ports are reachable.
|
|
3. Confirm Docker socket mount is present and read-only where expected.
|
|
|
|
## Self-healing model
|
|
|
|
Self-healing is implemented as scheduled reconciliation, not ad-hoc manual edits.
|
|
|
|
### Current helper script status
|
|
|
|
- `ansible/scripts/pi_pull_updates.sh` is retained as a helper and now expects
|
|
configurable environment variables instead of embedded credentials.
|
|
- `ansible/scripts/pi_init.sh` is optional for operator bootstrap and is not
|
|
required for monitoring stack reconciliation.
|
|
|
|
### Recommended execution pattern
|
|
|
|
1. Use `ansible-pull` to sync and apply `ansible/playbooks/self-heal/watchtower.yml`.
|
|
2. Run through a scheduler (prefer `systemd` timer for reliability and observability).
|
|
3. Keep logs in a persistent path and alert on repeated failures.
|
|
|
|
Example manual run:
|
|
|
|
```bash
|
|
REPO_URL=git@git.castaldifamily.com:nathan/homelab.git \
|
|
PLAYBOOK_PATH=ansible/playbooks/self-heal/watchtower.yml \
|
|
/home/chester/homelab/ansible/scripts/pi_pull_updates.sh
|
|
```
|
|
|
|
> [!IMPORTANT]
|
|
> If your repository is private, use SSH deploy keys or vault-backed secret injection.
|
|
> Do not place long-lived personal access tokens in script files.
|
|
|
|
## Idempotency and rollback
|
|
|
|
- Re-running deployment playbooks is expected and safe; desired state is reconciled.
|
|
- Keep stack definitions in Git and avoid manual edits in `/opt/stacks`.
|
|
- Rollback method:
|
|
1. Revert the offending commit in Git.
|
|
2. Re-run deployment playbook.
|
|
3. Validate endpoints and target health.
|
|
|
|
## Operational safety rules
|
|
|
|
- Do not run services as root unless technically required and documented.
|
|
- Avoid broad host mounts unless required for telemetry collection.
|
|
- Keep exposed admin ports behind Traefik and authentication middleware.
|
|
- Validate health and auth behavior before declaring changes complete.
|