# Ansible Monitoring Roles ## Overview This directory contains modular Ansible roles for deploying a complete observability stack across a Docker Swarm cluster. The architecture follows the **separation of concerns** principle, with each role handling a specific monitoring component. ## Architecture ``` ┌─────────────────────────────────────────────────────────────┐ │ WATCHTOWER (Controller) │ │ ┌──────────────┐ ┌──────────┐ ┌───────┐ ┌──────────────┐ │ │ │ Prometheus │ │ Grafana │ │ Loki │ │ Uptime Kuma │ │ │ │ (Metrics DB) │ │ (Dashboards)│(Logs) │ │ (Health) │ │ │ └──────────────┘ └──────────┘ └───────┘ └──────────────┘ │ │ ▲ ▲ ▲ │ └─────────┼──────────────┼────────────┼────────────────────────┘ │ Scrape │ Query │ Push │ │ │ ┌─────────┴──────────────┴────────────┴────────────────────────┐ │ SWARM CLUSTER │ │ ┌──────────────────┐ ┌──────────────────┐ │ │ │ Manager Node 1 │ │ Worker Node 1 │ │ │ │ ┌──────────────┐ │ │ ┌──────────────┐ │ [... more] │ │ │ │node-exporter │ │ │ │node-exporter │ │ │ │ │ │ (Host CPU, │ │ │ │ (Host CPU, │ │ │ │ │ │ RAM, Disk) │ │ │ │ RAM, Disk) │ │ │ │ │ └──────────────┘ │ │ └──────────────┘ │ │ │ │ ┌──────────────┐ │ │ ┌──────────────┐ │ │ │ │ │ cAdvisor │ │ │ │ cAdvisor │ │ │ │ │ │ (Container │ │ │ │ (Container │ │ │ │ │ │ Metrics) │ │ │ │ Metrics) │ │ │ │ │ └──────────────┘ │ │ └──────────────┘ │ │ │ └──────────────────┘ └──────────────────┘ │ └───────────────────────────────────────────────────────────────┘ ``` ## Roles ### 1. `swarm_node_exporter` **Purpose:** Deploys Prometheus node-exporter on each swarm node to collect host-level metrics. **Metrics Collected:** - CPU usage (per-core and aggregate) - Memory usage (total, available, cached) - Disk I/O and space - Network traffic - System load averages **Configuration:** - Port: 9100 (default) - Network: Host mode (for full system visibility) - Security: Read-only filesystem, dropped capabilities **Files:** - `defaults/main.yml`: Configurable variables - `tasks/main.yml`: Deployment logic ### 2. `swarm_cadvisor` **Purpose:** Deploys cAdvisor (Container Advisor) on each node to collect container-level resource usage. **Metrics Collected:** - Per-container CPU usage - Per-container memory usage - Container network I/O - Container disk I/O - Container restart counts **Configuration:** - Port: 8080 (default) - Requires: Privileged mode (for cgroup access) **Why cAdvisor + node-exporter?** - node-exporter: "The host used 80% CPU" (host-level aggregates) - cAdvisor: "Container X used 60% of that CPU" (per-container breakdown) ### 3. `monitoring_stack` **Purpose:** Deploys the complete monitoring infrastructure on Watchtower (controller node). **Components:** - **Prometheus:** Metrics time-series database with service discovery - **Grafana:** Visualization and dashboarding - **Loki:** Log aggregation and indexing - **Promtail:** Log shipper (sends logs from Docker to Loki) - **Uptime Kuma:** HTTP/TCP health monitoring - **Dozzle:** Real-time Docker log viewer - **traefik-kop:** Traefik configuration sync **Key Features:** - **Dynamic Target Discovery:** Prometheus scrape configs are generated from Ansible inventory - **Alert Rules:** Pre-configured alerts for CPU, memory, disk, and node availability - **Security:** Dozzle protected by Authentik SSO - **Retention:** Configurable data retention policies **Configuration:** - `defaults/main.yml`: Ports, domains, retention periods - `templates/prometheus.yml.j2`: Scrape configuration with inventory loop - `templates/alert-rules.yml.j2`: Alerting rules - `templates/loki-config.yml.j2`: Log retention and indexing - `templates/docker-compose.yml.j2`: Complete stack definition ## Usage ### Deploy Complete Monitoring Stack ```bash cd /home/chester/homelab/ansible ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml ``` ### Deploy Only to Swarm Nodes ```bash ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml --tags swarm ``` ### Deploy Only Watchtower Stack ```bash ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml --tags watchtower ``` ### Update Prometheus Configuration ```bash ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml --tags watchtower ``` ## Key Concepts ### 1. Idempotency All roles are **idempotent**—running the playbook multiple times produces the same result. This is achieved by: - Using `docker_container` module with `state: started` (not `state: restarted`) - Using handlers for configuration changes - Checking for existing resources before creation ### 2. Service Discovery Instead of hardcoding IP addresses, Prometheus discovers targets dynamically: ```yaml # Static approach (bad - manual updates required) targets: ['10.0.0.211:9100', '10.0.0.212:9100'] # Dynamic approach (good - auto-scales with inventory) {% for host in groups['swarm_managers'] %} - '{{ hostvars[host].ansible_host }}:9100' {% endfor %} ``` ### 3. Security Hardening - **Read-only filesystems:** Exporters can't modify system files - **Dropped capabilities:** Containers run with minimal permissions - **No new privileges:** Prevents privilege escalation - **SSO integration:** Dozzle protected by Authentik ### 4. Desired State Docker Compose defines the **desired state**. Docker continuously reconciles: - **Actual State:** "Container X crashed" - **Desired State:** "Container X should be running" - **Reconciliation:** Docker restarts container X ## Troubleshooting ### Exporter Not Reachable ```bash # Check if exporters are running ansible swarm_hosts -i inventory/hosts.ini -a "docker ps | grep -E 'node-exporter|cadvisor'" # Test from Watchtower curl http://10.0.0.211:9100/metrics curl http://10.0.0.211:8080/metrics ``` ### Prometheus Shows Target Down 1. Check firewall rules 2. Verify exporter is running: `docker ps` 3. Check exporter logs: `docker logs node-exporter` 4. Test connectivity: `curl http://:9100/metrics` ### Grafana Can't Connect to Prometheus Grafana runs inside Docker, so use Docker DNS: - ✅ Data source URL: `http://prometheus:9090` - ❌ Don't use: `http://localhost:9090` ### Loki Not Receiving Logs 1. Check Promtail is running: `docker ps | grep promtail` 2. Check Promtail logs: `docker logs promtail` 3. Verify Loki connectivity: `curl http://localhost:3100/ready` ## Maintenance ### Add New Swarm Node 1. Add node to `inventory/hosts.ini` under `[swarm_managers]` or `[swarm_workers]` 2. Run the playbook: `ansible-playbook ... deploy_swarm_monitoring.yml` 3. Prometheus will automatically discover the new node ### Update Monitoring Stack ```bash # Pull latest images and restart ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml ``` ### View Current Configuration ```bash # Prometheus config cat /opt/stacks/watchtower/prometheus-config/prometheus.yml # Alert rules cat /opt/stacks/watchtower/prometheus-config/alerts/homelab.yml # Docker Compose cat /opt/stacks/watchtower/docker-compose.yml ``` ## Recommended Grafana Dashboards Import these dashboards by ID in Grafana: | ID | Name | Purpose | |----|------|---------| | 1860 | Node Exporter Full | Complete host metrics | | 893 | Docker & System Monitoring | Container resource usage | | 13639 | Loki Dashboard | Log exploration | | 14282 | cAdvisor | Detailed container metrics | ## Best Practices 1. **Never hardcode secrets:** Use `ansible-vault` or environment variables 2. **Use labels extensively:** Makes filtering in Prometheus/Loki easier 3. **Set resource limits:** Prevent monitoring from consuming excessive resources 4. **Test before deploying:** Use `--check` mode to preview changes 5. **Version control everything:** Commit all configuration changes ## Further Reading - [Prometheus Documentation](https://prometheus.io/docs/) - [Grafana Loki](https://grafana.com/docs/loki/latest/) - [Ansible Best Practices](https://docs.ansible.com/ansible/latest/user_guide/playbooks_best_practices.html) - [Docker Swarm Monitoring](https://docs.docker.com/engine/swarm/swarm-tutorial/)