254 lines
9.8 KiB
Markdown
254 lines
9.8 KiB
Markdown
# Ansible Monitoring Roles
|
|
|
|
## Overview
|
|
|
|
This directory contains modular Ansible roles for deploying a complete observability stack across a Docker Swarm cluster. The architecture follows the **separation of concerns** principle, with each role handling a specific monitoring component.
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ WATCHTOWER (Controller) │
|
|
│ ┌──────────────┐ ┌──────────┐ ┌───────┐ ┌──────────────┐ │
|
|
│ │ Prometheus │ │ Grafana │ │ Loki │ │ Uptime Kuma │ │
|
|
│ │ (Metrics DB) │ │ (Dashboards)│(Logs) │ │ (Health) │ │
|
|
│ └──────────────┘ └──────────┘ └───────┘ └──────────────┘ │
|
|
│ ▲ ▲ ▲ │
|
|
└─────────┼──────────────┼────────────┼────────────────────────┘
|
|
│ Scrape │ Query │ Push
|
|
│ │ │
|
|
┌─────────┴──────────────┴────────────┴────────────────────────┐
|
|
│ SWARM CLUSTER │
|
|
│ ┌──────────────────┐ ┌──────────────────┐ │
|
|
│ │ Manager Node 1 │ │ Worker Node 1 │ │
|
|
│ │ ┌──────────────┐ │ │ ┌──────────────┐ │ [... more] │
|
|
│ │ │node-exporter │ │ │ │node-exporter │ │ │
|
|
│ │ │ (Host CPU, │ │ │ │ (Host CPU, │ │ │
|
|
│ │ │ RAM, Disk) │ │ │ │ RAM, Disk) │ │ │
|
|
│ │ └──────────────┘ │ │ └──────────────┘ │ │
|
|
│ │ ┌──────────────┐ │ │ ┌──────────────┐ │ │
|
|
│ │ │ cAdvisor │ │ │ │ cAdvisor │ │ │
|
|
│ │ │ (Container │ │ │ │ (Container │ │ │
|
|
│ │ │ Metrics) │ │ │ │ Metrics) │ │ │
|
|
│ │ └──────────────┘ │ │ └──────────────┘ │ │
|
|
│ └──────────────────┘ └──────────────────┘ │
|
|
└───────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Roles
|
|
|
|
### 1. `swarm_node_exporter`
|
|
|
|
**Purpose:** Deploys Prometheus node-exporter on each swarm node to collect host-level metrics.
|
|
|
|
**Metrics Collected:**
|
|
- CPU usage (per-core and aggregate)
|
|
- Memory usage (total, available, cached)
|
|
- Disk I/O and space
|
|
- Network traffic
|
|
- System load averages
|
|
|
|
**Configuration:**
|
|
- Port: 9100 (default)
|
|
- Network: Host mode (for full system visibility)
|
|
- Security: Read-only filesystem, dropped capabilities
|
|
|
|
**Files:**
|
|
- `defaults/main.yml`: Configurable variables
|
|
- `tasks/main.yml`: Deployment logic
|
|
|
|
### 2. `swarm_cadvisor`
|
|
|
|
**Purpose:** Deploys cAdvisor (Container Advisor) on each node to collect container-level resource usage.
|
|
|
|
**Metrics Collected:**
|
|
- Per-container CPU usage
|
|
- Per-container memory usage
|
|
- Container network I/O
|
|
- Container disk I/O
|
|
- Container restart counts
|
|
|
|
**Configuration:**
|
|
- Port: 8080 (default)
|
|
- Requires: Privileged mode (for cgroup access)
|
|
|
|
**Why cAdvisor + node-exporter?**
|
|
- node-exporter: "The host used 80% CPU" (host-level aggregates)
|
|
- cAdvisor: "Container X used 60% of that CPU" (per-container breakdown)
|
|
|
|
### 3. `monitoring_stack`
|
|
|
|
**Purpose:** Deploys the complete monitoring infrastructure on Watchtower (controller node).
|
|
|
|
**Components:**
|
|
- **Prometheus:** Metrics time-series database with service discovery
|
|
- **Grafana:** Visualization and dashboarding
|
|
- **Loki:** Log aggregation and indexing
|
|
- **Promtail:** Log shipper (sends logs from Docker to Loki)
|
|
- **Uptime Kuma:** HTTP/TCP health monitoring
|
|
- **Dozzle:** Real-time Docker log viewer
|
|
- **traefik-kop:** Traefik configuration sync
|
|
|
|
**Key Features:**
|
|
- **Dynamic Target Discovery:** Prometheus scrape configs are generated from Ansible inventory
|
|
- **Alert Rules:** Pre-configured alerts for CPU, memory, disk, and node availability
|
|
- **Security:** Dozzle protected by Authentik SSO
|
|
- **Retention:** Configurable data retention policies
|
|
|
|
**Configuration:**
|
|
- `defaults/main.yml`: Ports, domains, retention periods
|
|
- `templates/prometheus.yml.j2`: Scrape configuration with inventory loop
|
|
- `templates/alert-rules.yml.j2`: Alerting rules
|
|
- `templates/loki-config.yml.j2`: Log retention and indexing
|
|
- `templates/docker-compose.yml.j2`: Complete stack definition
|
|
|
|
## Usage
|
|
|
|
### Deploy Complete Monitoring Stack
|
|
|
|
```bash
|
|
cd /home/chester/homelab/ansible
|
|
ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml
|
|
```
|
|
|
|
### Deploy Only to Swarm Nodes
|
|
|
|
```bash
|
|
ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml --tags swarm
|
|
```
|
|
|
|
### Deploy Only Watchtower Stack
|
|
|
|
```bash
|
|
ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml --tags watchtower
|
|
```
|
|
|
|
### Update Prometheus Configuration
|
|
|
|
```bash
|
|
ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml --tags watchtower
|
|
```
|
|
|
|
## Key Concepts
|
|
|
|
### 1. Idempotency
|
|
|
|
All roles are **idempotent**—running the playbook multiple times produces the same result. This is achieved by:
|
|
- Using `docker_container` module with `state: started` (not `state: restarted`)
|
|
- Using handlers for configuration changes
|
|
- Checking for existing resources before creation
|
|
|
|
### 2. Service Discovery
|
|
|
|
Instead of hardcoding IP addresses, Prometheus discovers targets dynamically:
|
|
|
|
```yaml
|
|
# Static approach (bad - manual updates required)
|
|
targets: ['10.0.0.211:9100', '10.0.0.212:9100']
|
|
|
|
# Dynamic approach (good - auto-scales with inventory)
|
|
{% for host in groups['swarm_managers'] %}
|
|
- '{{ hostvars[host].ansible_host }}:9100'
|
|
{% endfor %}
|
|
```
|
|
|
|
### 3. Security Hardening
|
|
|
|
- **Read-only filesystems:** Exporters can't modify system files
|
|
- **Dropped capabilities:** Containers run with minimal permissions
|
|
- **No new privileges:** Prevents privilege escalation
|
|
- **SSO integration:** Dozzle protected by Authentik
|
|
|
|
### 4. Desired State
|
|
|
|
Docker Compose defines the **desired state**. Docker continuously reconciles:
|
|
- **Actual State:** "Container X crashed"
|
|
- **Desired State:** "Container X should be running"
|
|
- **Reconciliation:** Docker restarts container X
|
|
|
|
## Troubleshooting
|
|
|
|
### Exporter Not Reachable
|
|
|
|
```bash
|
|
# Check if exporters are running
|
|
ansible swarm_hosts -i inventory/hosts.ini -a "docker ps | grep -E 'node-exporter|cadvisor'"
|
|
|
|
# Test from Watchtower
|
|
curl http://10.0.0.211:9100/metrics
|
|
curl http://10.0.0.211:8080/metrics
|
|
```
|
|
|
|
### Prometheus Shows Target Down
|
|
|
|
1. Check firewall rules
|
|
2. Verify exporter is running: `docker ps`
|
|
3. Check exporter logs: `docker logs node-exporter`
|
|
4. Test connectivity: `curl http://<ip>:9100/metrics`
|
|
|
|
### Grafana Can't Connect to Prometheus
|
|
|
|
Grafana runs inside Docker, so use Docker DNS:
|
|
- ✅ Data source URL: `http://prometheus:9090`
|
|
- ❌ Don't use: `http://localhost:9090`
|
|
|
|
### Loki Not Receiving Logs
|
|
|
|
1. Check Promtail is running: `docker ps | grep promtail`
|
|
2. Check Promtail logs: `docker logs promtail`
|
|
3. Verify Loki connectivity: `curl http://localhost:3100/ready`
|
|
|
|
## Maintenance
|
|
|
|
### Add New Swarm Node
|
|
|
|
1. Add node to `inventory/hosts.ini` under `[swarm_managers]` or `[swarm_workers]`
|
|
2. Run the playbook: `ansible-playbook ... deploy_swarm_monitoring.yml`
|
|
3. Prometheus will automatically discover the new node
|
|
|
|
### Update Monitoring Stack
|
|
|
|
```bash
|
|
# Pull latest images and restart
|
|
ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml
|
|
```
|
|
|
|
### View Current Configuration
|
|
|
|
```bash
|
|
# Prometheus config
|
|
cat /opt/stacks/watchtower/prometheus-config/prometheus.yml
|
|
|
|
# Alert rules
|
|
cat /opt/stacks/watchtower/prometheus-config/alerts/homelab.yml
|
|
|
|
# Docker Compose
|
|
cat /opt/stacks/watchtower/docker-compose.yml
|
|
```
|
|
|
|
## Recommended Grafana Dashboards
|
|
|
|
Import these dashboards by ID in Grafana:
|
|
|
|
| ID | Name | Purpose |
|
|
|----|------|---------|
|
|
| 1860 | Node Exporter Full | Complete host metrics |
|
|
| 893 | Docker & System Monitoring | Container resource usage |
|
|
| 13639 | Loki Dashboard | Log exploration |
|
|
| 14282 | cAdvisor | Detailed container metrics |
|
|
|
|
## Best Practices
|
|
|
|
1. **Never hardcode secrets:** Use `ansible-vault` or environment variables
|
|
2. **Use labels extensively:** Makes filtering in Prometheus/Loki easier
|
|
3. **Set resource limits:** Prevent monitoring from consuming excessive resources
|
|
4. **Test before deploying:** Use `--check` mode to preview changes
|
|
5. **Version control everything:** Commit all configuration changes
|
|
|
|
## Further Reading
|
|
|
|
- [Prometheus Documentation](https://prometheus.io/docs/)
|
|
- [Grafana Loki](https://grafana.com/docs/loki/latest/)
|
|
- [Ansible Best Practices](https://docs.ansible.com/ansible/latest/user_guide/playbooks_best_practices.html)
|
|
- [Docker Swarm Monitoring](https://docs.docker.com/engine/swarm/swarm-tutorial/)
|