254 lines
9.8 KiB
Markdown

# Ansible Monitoring Roles
## Overview
This directory contains modular Ansible roles for deploying a complete observability stack across a Docker Swarm cluster. The architecture follows the **separation of concerns** principle, with each role handling a specific monitoring component.
## Architecture
```
┌─────────────────────────────────────────────────────────────┐
│ WATCHTOWER (Controller) │
│ ┌──────────────┐ ┌──────────┐ ┌───────┐ ┌──────────────┐ │
│ │ Prometheus │ │ Grafana │ │ Loki │ │ Uptime Kuma │ │
│ │ (Metrics DB) │ │ (Dashboards)│(Logs) │ │ (Health) │ │
│ └──────────────┘ └──────────┘ └───────┘ └──────────────┘ │
│ ▲ ▲ ▲ │
└─────────┼──────────────┼────────────┼────────────────────────┘
│ Scrape │ Query │ Push
│ │ │
┌─────────┴──────────────┴────────────┴────────────────────────┐
│ SWARM CLUSTER │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Manager Node 1 │ │ Worker Node 1 │ │
│ │ ┌──────────────┐ │ │ ┌──────────────┐ │ [... more] │
│ │ │node-exporter │ │ │ │node-exporter │ │ │
│ │ │ (Host CPU, │ │ │ │ (Host CPU, │ │ │
│ │ │ RAM, Disk) │ │ │ │ RAM, Disk) │ │ │
│ │ └──────────────┘ │ │ └──────────────┘ │ │
│ │ ┌──────────────┐ │ │ ┌──────────────┐ │ │
│ │ │ cAdvisor │ │ │ │ cAdvisor │ │ │
│ │ │ (Container │ │ │ │ (Container │ │ │
│ │ │ Metrics) │ │ │ │ Metrics) │ │ │
│ │ └──────────────┘ │ │ └──────────────┘ │ │
│ └──────────────────┘ └──────────────────┘ │
└───────────────────────────────────────────────────────────────┘
```
## Roles
### 1. `swarm_node_exporter`
**Purpose:** Deploys Prometheus node-exporter on each swarm node to collect host-level metrics.
**Metrics Collected:**
- CPU usage (per-core and aggregate)
- Memory usage (total, available, cached)
- Disk I/O and space
- Network traffic
- System load averages
**Configuration:**
- Port: 9100 (default)
- Network: Host mode (for full system visibility)
- Security: Read-only filesystem, dropped capabilities
**Files:**
- `defaults/main.yml`: Configurable variables
- `tasks/main.yml`: Deployment logic
### 2. `swarm_cadvisor`
**Purpose:** Deploys cAdvisor (Container Advisor) on each node to collect container-level resource usage.
**Metrics Collected:**
- Per-container CPU usage
- Per-container memory usage
- Container network I/O
- Container disk I/O
- Container restart counts
**Configuration:**
- Port: 8080 (default)
- Requires: Privileged mode (for cgroup access)
**Why cAdvisor + node-exporter?**
- node-exporter: "The host used 80% CPU" (host-level aggregates)
- cAdvisor: "Container X used 60% of that CPU" (per-container breakdown)
### 3. `monitoring_stack`
**Purpose:** Deploys the complete monitoring infrastructure on Watchtower (controller node).
**Components:**
- **Prometheus:** Metrics time-series database with service discovery
- **Grafana:** Visualization and dashboarding
- **Loki:** Log aggregation and indexing
- **Promtail:** Log shipper (sends logs from Docker to Loki)
- **Uptime Kuma:** HTTP/TCP health monitoring
- **Dozzle:** Real-time Docker log viewer
- **traefik-kop:** Traefik configuration sync
**Key Features:**
- **Dynamic Target Discovery:** Prometheus scrape configs are generated from Ansible inventory
- **Alert Rules:** Pre-configured alerts for CPU, memory, disk, and node availability
- **Security:** Dozzle protected by Authentik SSO
- **Retention:** Configurable data retention policies
**Configuration:**
- `defaults/main.yml`: Ports, domains, retention periods
- `templates/prometheus.yml.j2`: Scrape configuration with inventory loop
- `templates/alert-rules.yml.j2`: Alerting rules
- `templates/loki-config.yml.j2`: Log retention and indexing
- `templates/docker-compose.yml.j2`: Complete stack definition
## Usage
### Deploy Complete Monitoring Stack
```bash
cd /home/chester/homelab/ansible
ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml
```
### Deploy Only to Swarm Nodes
```bash
ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml --tags swarm
```
### Deploy Only Watchtower Stack
```bash
ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml --tags watchtower
```
### Update Prometheus Configuration
```bash
ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml --tags watchtower
```
## Key Concepts
### 1. Idempotency
All roles are **idempotent**—running the playbook multiple times produces the same result. This is achieved by:
- Using `docker_container` module with `state: started` (not `state: restarted`)
- Using handlers for configuration changes
- Checking for existing resources before creation
### 2. Service Discovery
Instead of hardcoding IP addresses, Prometheus discovers targets dynamically:
```yaml
# Static approach (bad - manual updates required)
targets: ['10.0.0.211:9100', '10.0.0.212:9100']
# Dynamic approach (good - auto-scales with inventory)
{% for host in groups['swarm_managers'] %}
- '{{ hostvars[host].ansible_host }}:9100'
{% endfor %}
```
### 3. Security Hardening
- **Read-only filesystems:** Exporters can't modify system files
- **Dropped capabilities:** Containers run with minimal permissions
- **No new privileges:** Prevents privilege escalation
- **SSO integration:** Dozzle protected by Authentik
### 4. Desired State
Docker Compose defines the **desired state**. Docker continuously reconciles:
- **Actual State:** "Container X crashed"
- **Desired State:** "Container X should be running"
- **Reconciliation:** Docker restarts container X
## Troubleshooting
### Exporter Not Reachable
```bash
# Check if exporters are running
ansible swarm_hosts -i inventory/hosts.ini -a "docker ps | grep -E 'node-exporter|cadvisor'"
# Test from Watchtower
curl http://10.0.0.211:9100/metrics
curl http://10.0.0.211:8080/metrics
```
### Prometheus Shows Target Down
1. Check firewall rules
2. Verify exporter is running: `docker ps`
3. Check exporter logs: `docker logs node-exporter`
4. Test connectivity: `curl http://<ip>:9100/metrics`
### Grafana Can't Connect to Prometheus
Grafana runs inside Docker, so use Docker DNS:
- ✅ Data source URL: `http://prometheus:9090`
- ❌ Don't use: `http://localhost:9090`
### Loki Not Receiving Logs
1. Check Promtail is running: `docker ps | grep promtail`
2. Check Promtail logs: `docker logs promtail`
3. Verify Loki connectivity: `curl http://localhost:3100/ready`
## Maintenance
### Add New Swarm Node
1. Add node to `inventory/hosts.ini` under `[swarm_managers]` or `[swarm_workers]`
2. Run the playbook: `ansible-playbook ... deploy_swarm_monitoring.yml`
3. Prometheus will automatically discover the new node
### Update Monitoring Stack
```bash
# Pull latest images and restart
ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml
```
### View Current Configuration
```bash
# Prometheus config
cat /opt/stacks/watchtower/prometheus-config/prometheus.yml
# Alert rules
cat /opt/stacks/watchtower/prometheus-config/alerts/homelab.yml
# Docker Compose
cat /opt/stacks/watchtower/docker-compose.yml
```
## Recommended Grafana Dashboards
Import these dashboards by ID in Grafana:
| ID | Name | Purpose |
|----|------|---------|
| 1860 | Node Exporter Full | Complete host metrics |
| 893 | Docker & System Monitoring | Container resource usage |
| 13639 | Loki Dashboard | Log exploration |
| 14282 | cAdvisor | Detailed container metrics |
## Best Practices
1. **Never hardcode secrets:** Use `ansible-vault` or environment variables
2. **Use labels extensively:** Makes filtering in Prometheus/Loki easier
3. **Set resource limits:** Prevent monitoring from consuming excessive resources
4. **Test before deploying:** Use `--check` mode to preview changes
5. **Version control everything:** Commit all configuration changes
## Further Reading
- [Prometheus Documentation](https://prometheus.io/docs/)
- [Grafana Loki](https://grafana.com/docs/loki/latest/)
- [Ansible Best Practices](https://docs.ansible.com/ansible/latest/user_guide/playbooks_best_practices.html)
- [Docker Swarm Monitoring](https://docs.docker.com/engine/swarm/swarm-tutorial/)