Ansible Monitoring Roles

Overview

This directory contains modular Ansible roles for deploying a complete observability stack across a Docker Swarm cluster. The architecture follows the separation of concerns principle, with each role handling a specific monitoring component.

Architecture

┌─────────────────────────────────────────────────────────────┐
│                      WATCHTOWER (Controller)                 │
│  ┌──────────────┐ ┌──────────┐ ┌───────┐ ┌──────────────┐  │
│  │ Prometheus   │ │ Grafana  │ │ Loki  │ │ Uptime Kuma  │  │
│  │ (Metrics DB) │ │ (Dashboards)│(Logs) │ │ (Health)     │  │
│  └──────────────┘ └──────────┘ └───────┘ └──────────────┘  │
│         ▲              ▲            ▲                        │
└─────────┼──────────────┼────────────┼────────────────────────┘
          │ Scrape       │ Query      │ Push
          │              │            │
┌─────────┴──────────────┴────────────┴────────────────────────┐
│                      SWARM CLUSTER                            │
│  ┌──────────────────┐  ┌──────────────────┐                  │
│  │ Manager Node 1   │  │ Worker Node 1    │                  │
│  │ ┌──────────────┐ │  │ ┌──────────────┐ │   [... more]    │
│  │ │node-exporter │ │  │ │node-exporter │ │                  │
│  │ │  (Host CPU,  │ │  │ │  (Host CPU,  │ │                  │
│  │ │   RAM, Disk) │ │  │ │   RAM, Disk) │ │                  │
│  │ └──────────────┘ │  │ └──────────────┘ │                  │
│  │ ┌──────────────┐ │  │ ┌──────────────┐ │                  │
│  │ │  cAdvisor    │ │  │ │  cAdvisor    │ │                  │
│  │ │ (Container   │ │  │ │ (Container   │ │                  │
│  │ │  Metrics)    │ │  │ │  Metrics)    │ │                  │
│  │ └──────────────┘ │  │ └──────────────┘ │                  │
│  └──────────────────┘  └──────────────────┘                  │
└───────────────────────────────────────────────────────────────┘

Roles

1. swarm_node_exporter

Purpose: Deploys Prometheus node-exporter on each swarm node to collect host-level metrics.

Metrics Collected:

  • CPU usage (per-core and aggregate)
  • Memory usage (total, available, cached)
  • Disk I/O and space
  • Network traffic
  • System load averages

Configuration:

  • Port: 9100 (default)
  • Network: Host mode (for full system visibility)
  • Security: Read-only filesystem, dropped capabilities

Files:

  • defaults/main.yml: Configurable variables
  • tasks/main.yml: Deployment logic

2. swarm_cadvisor

Purpose: Deploys cAdvisor (Container Advisor) on each node to collect container-level resource usage.

Metrics Collected:

  • Per-container CPU usage
  • Per-container memory usage
  • Container network I/O
  • Container disk I/O
  • Container restart counts

Configuration:

  • Port: 8080 (default)
  • Requires: Privileged mode (for cgroup access)

Why cAdvisor + node-exporter?

  • node-exporter: "The host used 80% CPU" (host-level aggregates)
  • cAdvisor: "Container X used 60% of that CPU" (per-container breakdown)

3. monitoring_stack

Purpose: Deploys the complete monitoring infrastructure on Watchtower (controller node).

Components:

  • Prometheus: Metrics time-series database with service discovery
  • Grafana: Visualization and dashboarding
  • Loki: Log aggregation and indexing
  • Promtail: Log shipper (sends logs from Docker to Loki)
  • Uptime Kuma: HTTP/TCP health monitoring
  • Dozzle: Real-time Docker log viewer
  • traefik-kop: Traefik configuration sync

Key Features:

  • Dynamic Target Discovery: Prometheus scrape configs are generated from Ansible inventory
  • Alert Rules: Pre-configured alerts for CPU, memory, disk, and node availability
  • Security: Dozzle protected by Authentik SSO
  • Retention: Configurable data retention policies

Configuration:

  • defaults/main.yml: Ports, domains, retention periods
  • templates/prometheus.yml.j2: Scrape configuration with inventory loop
  • templates/alert-rules.yml.j2: Alerting rules
  • templates/loki-config.yml.j2: Log retention and indexing
  • templates/docker-compose.yml.j2: Complete stack definition

Usage

Deploy Complete Monitoring Stack

cd /home/chester/homelab/ansible
ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml

Deploy Only to Swarm Nodes

ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml --tags swarm

Deploy Only Watchtower Stack

ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml --tags watchtower

Update Prometheus Configuration

ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml --tags watchtower

Key Concepts

1. Idempotency

All roles are idempotent—running the playbook multiple times produces the same result. This is achieved by:

  • Using docker_container module with state: started (not state: restarted)
  • Using handlers for configuration changes
  • Checking for existing resources before creation

2. Service Discovery

Instead of hardcoding IP addresses, Prometheus discovers targets dynamically:

# Static approach (bad - manual updates required)
targets: ['10.0.0.211:9100', '10.0.0.212:9100']

# Dynamic approach (good - auto-scales with inventory)
{% for host in groups['swarm_managers'] %}
  - '{{ hostvars[host].ansible_host }}:9100'
{% endfor %}

3. Security Hardening

  • Read-only filesystems: Exporters can't modify system files
  • Dropped capabilities: Containers run with minimal permissions
  • No new privileges: Prevents privilege escalation
  • SSO integration: Dozzle protected by Authentik

4. Desired State

Docker Compose defines the desired state. Docker continuously reconciles:

  • Actual State: "Container X crashed"
  • Desired State: "Container X should be running"
  • Reconciliation: Docker restarts container X

Troubleshooting

Exporter Not Reachable

# Check if exporters are running
ansible swarm_hosts -i inventory/hosts.ini -a "docker ps | grep -E 'node-exporter|cadvisor'"

# Test from Watchtower
curl http://10.0.0.211:9100/metrics
curl http://10.0.0.211:8080/metrics

Prometheus Shows Target Down

  1. Check firewall rules
  2. Verify exporter is running: docker ps
  3. Check exporter logs: docker logs node-exporter
  4. Test connectivity: curl http://<ip>:9100/metrics

Grafana Can't Connect to Prometheus

Grafana runs inside Docker, so use Docker DNS:

  • Data source URL: http://prometheus:9090
  • Don't use: http://localhost:9090

Loki Not Receiving Logs

  1. Check Promtail is running: docker ps | grep promtail
  2. Check Promtail logs: docker logs promtail
  3. Verify Loki connectivity: curl http://localhost:3100/ready

Maintenance

Add New Swarm Node

  1. Add node to inventory/hosts.ini under [swarm_managers] or [swarm_workers]
  2. Run the playbook: ansible-playbook ... deploy_swarm_monitoring.yml
  3. Prometheus will automatically discover the new node

Update Monitoring Stack

# Pull latest images and restart
ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml

View Current Configuration

# Prometheus config
cat /opt/stacks/watchtower/prometheus-config/prometheus.yml

# Alert rules
cat /opt/stacks/watchtower/prometheus-config/alerts/homelab.yml

# Docker Compose
cat /opt/stacks/watchtower/docker-compose.yml

Import these dashboards by ID in Grafana:

ID Name Purpose
1860 Node Exporter Full Complete host metrics
893 Docker & System Monitoring Container resource usage
13639 Loki Dashboard Log exploration
14282 cAdvisor Detailed container metrics

Best Practices

  1. Never hardcode secrets: Use ansible-vault or environment variables
  2. Use labels extensively: Makes filtering in Prometheus/Loki easier
  3. Set resource limits: Prevent monitoring from consuming excessive resources
  4. Test before deploying: Use --check mode to preview changes
  5. Version control everything: Commit all configuration changes

Further Reading