This directory contains modular Ansible roles for deploying a complete observability stack across a Docker Swarm cluster. The architecture follows the separation of concerns principle, with each role handling a specific monitoring component.

Architecture

┌─────────────────────────────────────────────────────────────┐
│                      WATCHTOWER (Controller)                 │
│  ┌──────────────┐ ┌──────────┐ ┌───────┐ ┌──────────────┐  │
│  │ Prometheus   │ │ Grafana  │ │ Loki  │ │ Uptime Kuma  │  │
│  │ (Metrics DB) │ │ (Dashboards)│(Logs) │ │ (Health)     │  │
│  └──────────────┘ └──────────┘ └───────┘ └──────────────┘  │
│         ▲              ▲            ▲                        │
└─────────┼──────────────┼────────────┼────────────────────────┘
          │ Scrape       │ Query      │ Push
          │              │            │
┌─────────┴──────────────┴────────────┴────────────────────────┐
│                      SWARM CLUSTER                            │
│  ┌──────────────────┐  ┌──────────────────┐                  │
│  │ Manager Node 1   │  │ Worker Node 1    │                  │
│  │ ┌──────────────┐ │  │ ┌──────────────┐ │   [... more]    │
│  │ │node-exporter │ │  │ │node-exporter │ │                  │
│  │ │  (Host CPU,  │ │  │ │  (Host CPU,  │ │                  │
│  │ │   RAM, Disk) │ │  │ │   RAM, Disk) │ │                  │
│  │ └──────────────┘ │  │ └──────────────┘ │                  │
│  │ ┌──────────────┐ │  │ ┌──────────────┐ │                  │
│  │ │  cAdvisor    │ │  │ │  cAdvisor    │ │                  │
│  │ │ (Container   │ │  │ │ (Container   │ │                  │
│  │ │  Metrics)    │ │  │ │  Metrics)    │ │                  │
│  │ └──────────────┘ │  │ └──────────────┘ │                  │
│  └──────────────────┘  └──────────────────┘                  │
└───────────────────────────────────────────────────────────────┘

Roles

1. `swarm_node_exporter`

Purpose: Deploys Prometheus node-exporter on each swarm node to collect host-level metrics.

Metrics Collected:

CPU usage (per-core and aggregate)
Memory usage (total, available, cached)
Disk I/O and space
Network traffic
System load averages

Configuration:

Port: 9100 (default)
Network: Host mode (for full system visibility)
Security: Read-only filesystem, dropped capabilities

Files:

defaults/main.yml: Configurable variables
tasks/main.yml: Deployment logic

2. `swarm_cadvisor`

Purpose: Deploys cAdvisor (Container Advisor) on each node to collect container-level resource usage.

Metrics Collected:

Per-container CPU usage
Per-container memory usage
Container network I/O
Container disk I/O
Container restart counts

Configuration:

Port: 8080 (default)
Requires: Privileged mode (for cgroup access)

Why cAdvisor + node-exporter?

node-exporter: "The host used 80% CPU" (host-level aggregates)
cAdvisor: "Container X used 60% of that CPU" (per-container breakdown)

3. `monitoring_stack`

Purpose: Deploys the complete monitoring infrastructure on Watchtower (controller node).

Components:

Prometheus: Metrics time-series database with service discovery
Grafana: Visualization and dashboarding
Loki: Log aggregation and indexing
Promtail: Log shipper (sends logs from Docker to Loki)
Uptime Kuma: HTTP/TCP health monitoring
Dozzle: Real-time Docker log viewer
traefik-kop: Traefik configuration sync

Key Features:

Dynamic Target Discovery: Prometheus scrape configs are generated from Ansible inventory
Alert Rules: Pre-configured alerts for CPU, memory, disk, and node availability
Security: Dozzle protected by Authentik SSO
Retention: Configurable data retention policies

Configuration:

defaults/main.yml: Ports, domains, retention periods
templates/prometheus.yml.j2: Scrape configuration with inventory loop
templates/alert-rules.yml.j2: Alerting rules
templates/loki-config.yml.j2: Log retention and indexing
templates/docker-compose.yml.j2: Complete stack definition

Usage

Deploy Complete Monitoring Stack

cd /home/chester/homelab/ansible
ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml

Deploy Only to Swarm Nodes

ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml --tags swarm

Deploy Only Watchtower Stack

ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml --tags watchtower

Update Prometheus Configuration

ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml --tags watchtower

Key Concepts

1. Idempotency

All roles are idempotent—running the playbook multiple times produces the same result. This is achieved by:

Using docker_container module with state: started (not state: restarted)
Using handlers for configuration changes
Checking for existing resources before creation

2. Service Discovery

Instead of hardcoding IP addresses, Prometheus discovers targets dynamically:

# Static approach (bad - manual updates required)
targets: ['10.0.0.211:9100', '10.0.0.212:9100']

# Dynamic approach (good - auto-scales with inventory)
{% for host in groups['swarm_managers'] %}
  - '{{ hostvars[host].ansible_host }}:9100'
{% endfor %}

3. Security Hardening

Read-only filesystems: Exporters can't modify system files
Dropped capabilities: Containers run with minimal permissions
No new privileges: Prevents privilege escalation
SSO integration: Dozzle protected by Authentik

4. Desired State

Docker Compose defines the desired state. Docker continuously reconciles:

Actual State: "Container X crashed"
Desired State: "Container X should be running"
Reconciliation: Docker restarts container X

Troubleshooting

Exporter Not Reachable

# Check if exporters are running
ansible swarm_hosts -i inventory/hosts.ini -a "docker ps | grep -E 'node-exporter|cadvisor'"

# Test from Watchtower
curl http://10.0.0.211:9100/metrics
curl http://10.0.0.211:8080/metrics

Prometheus Shows Target Down

Check firewall rules
Verify exporter is running: docker ps
Check exporter logs: docker logs node-exporter
Test connectivity: curl http://<ip>:9100/metrics

Grafana Can't Connect to Prometheus

Grafana runs inside Docker, so use Docker DNS:

✅ Data source URL: http://prometheus:9090
❌ Don't use: http://localhost:9090

Loki Not Receiving Logs

Check Promtail is running: docker ps | grep promtail
Check Promtail logs: docker logs promtail
Verify Loki connectivity: curl http://localhost:3100/ready

Maintenance

Add New Swarm Node

Add node to inventory/hosts.ini under [swarm_managers] or [swarm_workers]
Run the playbook: ansible-playbook ... deploy_swarm_monitoring.yml
Prometheus will automatically discover the new node

Update Monitoring Stack

# Pull latest images and restart
ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml

View Current Configuration

# Prometheus config
cat /opt/stacks/watchtower/prometheus-config/prometheus.yml

# Alert rules
cat /opt/stacks/watchtower/prometheus-config/alerts/homelab.yml

# Docker Compose
cat /opt/stacks/watchtower/docker-compose.yml

Recommended Grafana Dashboards

Import these dashboards by ID in Grafana:

ID	Name	Purpose
1860	Node Exporter Full	Complete host metrics
893	Docker & System Monitoring	Container resource usage
13639	Loki Dashboard	Log exploration
14282	cAdvisor	Detailed container metrics

Best Practices

Never hardcode secrets: Use ansible-vault or environment variables
Use labels extensively: Makes filtering in Prometheus/Loki easier
Set resource limits: Prevent monitoring from consuming excessive resources
Test before deploying: Use --check mode to preview changes
Version control everything: Commit all configuration changes

README.md

Ansible Monitoring Roles

Overview

Architecture

Roles

1. swarm_node_exporter

2. swarm_cadvisor

3. monitoring_stack

Usage

Deploy Complete Monitoring Stack

Deploy Only to Swarm Nodes

Deploy Only Watchtower Stack

Update Prometheus Configuration

Key Concepts

1. Idempotency

2. Service Discovery

3. Security Hardening

4. Desired State

Troubleshooting

Exporter Not Reachable

Prometheus Shows Target Down

Grafana Can't Connect to Prometheus

Loki Not Receiving Logs

Maintenance

Add New Swarm Node

Update Monitoring Stack

View Current Configuration

Recommended Grafana Dashboards

Best Practices

Further Reading

1. `swarm_node_exporter`

2. `swarm_cadvisor`

3. `monitoring_stack`