Ansible Monitoring Roles
Overview
This directory contains modular Ansible roles for deploying a complete observability stack across a Docker Swarm cluster. The architecture follows the separation of concerns principle, with each role handling a specific monitoring component.
Architecture
┌─────────────────────────────────────────────────────────────┐
│ WATCHTOWER (Controller) │
│ ┌──────────────┐ ┌──────────┐ ┌───────┐ ┌──────────────┐ │
│ │ Prometheus │ │ Grafana │ │ Loki │ │ Uptime Kuma │ │
│ │ (Metrics DB) │ │ (Dashboards)│(Logs) │ │ (Health) │ │
│ └──────────────┘ └──────────┘ └───────┘ └──────────────┘ │
│ ▲ ▲ ▲ │
└─────────┼──────────────┼────────────┼────────────────────────┘
│ Scrape │ Query │ Push
│ │ │
┌─────────┴──────────────┴────────────┴────────────────────────┐
│ SWARM CLUSTER │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Manager Node 1 │ │ Worker Node 1 │ │
│ │ ┌──────────────┐ │ │ ┌──────────────┐ │ [... more] │
│ │ │node-exporter │ │ │ │node-exporter │ │ │
│ │ │ (Host CPU, │ │ │ │ (Host CPU, │ │ │
│ │ │ RAM, Disk) │ │ │ │ RAM, Disk) │ │ │
│ │ └──────────────┘ │ │ └──────────────┘ │ │
│ │ ┌──────────────┐ │ │ ┌──────────────┐ │ │
│ │ │ cAdvisor │ │ │ │ cAdvisor │ │ │
│ │ │ (Container │ │ │ │ (Container │ │ │
│ │ │ Metrics) │ │ │ │ Metrics) │ │ │
│ │ └──────────────┘ │ │ └──────────────┘ │ │
│ └──────────────────┘ └──────────────────┘ │
└───────────────────────────────────────────────────────────────┘
Roles
1. swarm_node_exporter
Purpose: Deploys Prometheus node-exporter on each swarm node to collect host-level metrics.
Metrics Collected:
- CPU usage (per-core and aggregate)
- Memory usage (total, available, cached)
- Disk I/O and space
- Network traffic
- System load averages
Configuration:
- Port: 9100 (default)
- Network: Host mode (for full system visibility)
- Security: Read-only filesystem, dropped capabilities
Files:
defaults/main.yml: Configurable variablestasks/main.yml: Deployment logic
2. swarm_cadvisor
Purpose: Deploys cAdvisor (Container Advisor) on each node to collect container-level resource usage.
Metrics Collected:
- Per-container CPU usage
- Per-container memory usage
- Container network I/O
- Container disk I/O
- Container restart counts
Configuration:
- Port: 8080 (default)
- Requires: Privileged mode (for cgroup access)
Why cAdvisor + node-exporter?
- node-exporter: "The host used 80% CPU" (host-level aggregates)
- cAdvisor: "Container X used 60% of that CPU" (per-container breakdown)
3. monitoring_stack
Purpose: Deploys the complete monitoring infrastructure on Watchtower (controller node).
Components:
- Prometheus: Metrics time-series database with service discovery
- Grafana: Visualization and dashboarding
- Loki: Log aggregation and indexing
- Promtail: Log shipper (sends logs from Docker to Loki)
- Uptime Kuma: HTTP/TCP health monitoring
- Dozzle: Real-time Docker log viewer
- traefik-kop: Traefik configuration sync
Key Features:
- Dynamic Target Discovery: Prometheus scrape configs are generated from Ansible inventory
- Alert Rules: Pre-configured alerts for CPU, memory, disk, and node availability
- Security: Dozzle protected by Authentik SSO
- Retention: Configurable data retention policies
Configuration:
defaults/main.yml: Ports, domains, retention periodstemplates/prometheus.yml.j2: Scrape configuration with inventory looptemplates/alert-rules.yml.j2: Alerting rulestemplates/loki-config.yml.j2: Log retention and indexingtemplates/docker-compose.yml.j2: Complete stack definition
Usage
Deploy Complete Monitoring Stack
cd /home/chester/homelab/ansible
ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml
Deploy Only to Swarm Nodes
ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml --tags swarm
Deploy Only Watchtower Stack
ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml --tags watchtower
Update Prometheus Configuration
ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml --tags watchtower
Key Concepts
1. Idempotency
All roles are idempotent—running the playbook multiple times produces the same result. This is achieved by:
- Using
docker_containermodule withstate: started(notstate: restarted) - Using handlers for configuration changes
- Checking for existing resources before creation
2. Service Discovery
Instead of hardcoding IP addresses, Prometheus discovers targets dynamically:
# Static approach (bad - manual updates required)
targets: ['10.0.0.211:9100', '10.0.0.212:9100']
# Dynamic approach (good - auto-scales with inventory)
{% for host in groups['swarm_managers'] %}
- '{{ hostvars[host].ansible_host }}:9100'
{% endfor %}
3. Security Hardening
- Read-only filesystems: Exporters can't modify system files
- Dropped capabilities: Containers run with minimal permissions
- No new privileges: Prevents privilege escalation
- SSO integration: Dozzle protected by Authentik
4. Desired State
Docker Compose defines the desired state. Docker continuously reconciles:
- Actual State: "Container X crashed"
- Desired State: "Container X should be running"
- Reconciliation: Docker restarts container X
Troubleshooting
Exporter Not Reachable
# Check if exporters are running
ansible swarm_hosts -i inventory/hosts.ini -a "docker ps | grep -E 'node-exporter|cadvisor'"
# Test from Watchtower
curl http://10.0.0.211:9100/metrics
curl http://10.0.0.211:8080/metrics
Prometheus Shows Target Down
- Check firewall rules
- Verify exporter is running:
docker ps - Check exporter logs:
docker logs node-exporter - Test connectivity:
curl http://<ip>:9100/metrics
Grafana Can't Connect to Prometheus
Grafana runs inside Docker, so use Docker DNS:
- ✅ Data source URL:
http://prometheus:9090 - ❌ Don't use:
http://localhost:9090
Loki Not Receiving Logs
- Check Promtail is running:
docker ps | grep promtail - Check Promtail logs:
docker logs promtail - Verify Loki connectivity:
curl http://localhost:3100/ready
Maintenance
Add New Swarm Node
- Add node to
inventory/hosts.iniunder[swarm_managers]or[swarm_workers] - Run the playbook:
ansible-playbook ... deploy_swarm_monitoring.yml - Prometheus will automatically discover the new node
Update Monitoring Stack
# Pull latest images and restart
ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml
View Current Configuration
# Prometheus config
cat /opt/stacks/watchtower/prometheus-config/prometheus.yml
# Alert rules
cat /opt/stacks/watchtower/prometheus-config/alerts/homelab.yml
# Docker Compose
cat /opt/stacks/watchtower/docker-compose.yml
Recommended Grafana Dashboards
Import these dashboards by ID in Grafana:
| ID | Name | Purpose |
|---|---|---|
| 1860 | Node Exporter Full | Complete host metrics |
| 893 | Docker & System Monitoring | Container resource usage |
| 13639 | Loki Dashboard | Log exploration |
| 14282 | cAdvisor | Detailed container metrics |
Best Practices
- Never hardcode secrets: Use
ansible-vaultor environment variables - Use labels extensively: Makes filtering in Prometheus/Loki easier
- Set resource limits: Prevent monitoring from consuming excessive resources
- Test before deploying: Use
--checkmode to preview changes - Version control everything: Commit all configuration changes