homelab/ansible/archive/roles/README.md

# Ansible Monitoring Roles

## Overview

This directory contains modular Ansible roles for deploying a complete observability stack across a Docker Swarm cluster. The architecture follows the **separation of concerns** principle, with each role handling a specific monitoring component.

## Architecture

```
┌─────────────────────────────────────────────────────────────┐
│                      WATCHTOWER (Controller)                 │
│  ┌──────────────┐ ┌──────────┐ ┌───────┐ ┌──────────────┐  │
│  │ Prometheus   │ │ Grafana  │ │ Loki  │ │ Uptime Kuma  │  │
│  │ (Metrics DB) │ │ (Dashboards)│(Logs) │ │ (Health)     │  │
│  └──────────────┘ └──────────┘ └───────┘ └──────────────┘  │
│         ▲              ▲            ▲                        │
└─────────┼──────────────┼────────────┼────────────────────────┘
          │ Scrape       │ Query      │ Push
          │              │            │
┌─────────┴──────────────┴────────────┴────────────────────────┐
│                      SWARM CLUSTER                            │
│  ┌──────────────────┐  ┌──────────────────┐                  │
│  │ Manager Node 1   │  │ Worker Node 1    │                  │
│  │ ┌──────────────┐ │  │ ┌──────────────┐ │   [... more]    │
│  │ │node-exporter │ │  │ │node-exporter │ │                  │
│  │ │  (Host CPU,  │ │  │ │  (Host CPU,  │ │                  │
│  │ │   RAM, Disk) │ │  │ │   RAM, Disk) │ │                  │
│  │ └──────────────┘ │  │ └──────────────┘ │                  │
│  │ ┌──────────────┐ │  │ ┌──────────────┐ │                  │
│  │ │  cAdvisor    │ │  │ │  cAdvisor    │ │                  │
│  │ │ (Container   │ │  │ │ (Container   │ │                  │
│  │ │  Metrics)    │ │  │ │  Metrics)    │ │                  │
│  │ └──────────────┘ │  │ └──────────────┘ │                  │
│  └──────────────────┘  └──────────────────┘                  │
└───────────────────────────────────────────────────────────────┘
```

## Roles

### 1. `swarm_node_exporter`

**Purpose:** Deploys Prometheus node-exporter on each swarm node to collect host-level metrics.

**Metrics Collected:**
- CPU usage (per-core and aggregate)
- Memory usage (total, available, cached)
- Disk I/O and space
- Network traffic
- System load averages

**Configuration:**
- Port: 9100 (default)
- Network: Host mode (for full system visibility)
- Security: Read-only filesystem, dropped capabilities

**Files:**
- `defaults/main.yml`: Configurable variables
- `tasks/main.yml`: Deployment logic

### 2. `swarm_cadvisor`

**Purpose:** Deploys cAdvisor (Container Advisor) on each node to collect container-level resource usage.

**Metrics Collected:**
- Per-container CPU usage
- Per-container memory usage
- Container network I/O
- Container disk I/O
- Container restart counts

**Configuration:**
- Port: 8080 (default)
- Requires: Privileged mode (for cgroup access)

**Why cAdvisor + node-exporter?**
- node-exporter: "The host used 80% CPU" (host-level aggregates)
- cAdvisor: "Container X used 60% of that CPU" (per-container breakdown)

### 3. `monitoring_stack`

**Purpose:** Deploys the complete monitoring infrastructure on Watchtower (controller node).

**Components:**
- **Prometheus:** Metrics time-series database with service discovery
- **Grafana:** Visualization and dashboarding
- **Loki:** Log aggregation and indexing
- **Promtail:** Log shipper (sends logs from Docker to Loki)
- **Uptime Kuma:** HTTP/TCP health monitoring
- **Dozzle:** Real-time Docker log viewer
- **traefik-kop:** Traefik configuration sync

**Key Features:**
- **Dynamic Target Discovery:** Prometheus scrape configs are generated from Ansible inventory
- **Alert Rules:** Pre-configured alerts for CPU, memory, disk, and node availability
- **Security:** Dozzle protected by Authentik SSO
- **Retention:** Configurable data retention policies

**Configuration:**
- `defaults/main.yml`: Ports, domains, retention periods
- `templates/prometheus.yml.j2`: Scrape configuration with inventory loop
- `templates/alert-rules.yml.j2`: Alerting rules
- `templates/loki-config.yml.j2`: Log retention and indexing
- `templates/docker-compose.yml.j2`: Complete stack definition

## Usage

### Deploy Complete Monitoring Stack

```bash
cd /home/chester/homelab/ansible
ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml
```

### Deploy Only to Swarm Nodes

```bash
ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml --tags swarm
```

### Deploy Only Watchtower Stack

```bash
ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml --tags watchtower
```

### Update Prometheus Configuration

```bash
ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml --tags watchtower
```

## Key Concepts

### 1. Idempotency

All roles are **idempotent**—running the playbook multiple times produces the same result. This is achieved by:
- Using `docker_container` module with `state: started` (not `state: restarted`)
- Using handlers for configuration changes
- Checking for existing resources before creation

### 2. Service Discovery

Instead of hardcoding IP addresses, Prometheus discovers targets dynamically:

```yaml
# Static approach (bad - manual updates required)
targets: ['10.0.0.211:9100', '10.0.0.212:9100']

# Dynamic approach (good - auto-scales with inventory)
{% for host in groups['swarm_managers'] %}
  - '{{ hostvars[host].ansible_host }}:9100'
{% endfor %}
```

### 3. Security Hardening

- **Read-only filesystems:** Exporters can't modify system files
- **Dropped capabilities:** Containers run with minimal permissions
- **No new privileges:** Prevents privilege escalation
- **SSO integration:** Dozzle protected by Authentik

### 4. Desired State

Docker Compose defines the **desired state**. Docker continuously reconciles:
- **Actual State:** "Container X crashed"
- **Desired State:** "Container X should be running"
- **Reconciliation:** Docker restarts container X

## Troubleshooting

### Exporter Not Reachable

```bash
# Check if exporters are running
ansible swarm_hosts -i inventory/hosts.ini -a "docker ps | grep -E 'node-exporter|cadvisor'"

# Test from Watchtower
curl http://10.0.0.211:9100/metrics
curl http://10.0.0.211:8080/metrics
```

### Prometheus Shows Target Down

1. Check firewall rules
2. Verify exporter is running: `docker ps`
3. Check exporter logs: `docker logs node-exporter`
4. Test connectivity: `curl http://<ip>:9100/metrics`

### Grafana Can't Connect to Prometheus

Grafana runs inside Docker, so use Docker DNS:
- ✅ Data source URL: `http://prometheus:9090`
- ❌ Don't use: `http://localhost:9090`

### Loki Not Receiving Logs

1. Check Promtail is running: `docker ps | grep promtail`
2. Check Promtail logs: `docker logs promtail`
3. Verify Loki connectivity: `curl http://localhost:3100/ready`

## Maintenance

### Add New Swarm Node

1. Add node to `inventory/hosts.ini` under `[swarm_managers]` or `[swarm_workers]`
2. Run the playbook: `ansible-playbook ... deploy_swarm_monitoring.yml`
3. Prometheus will automatically discover the new node

### Update Monitoring Stack

```bash
# Pull latest images and restart
ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml
```

### View Current Configuration

```bash
# Prometheus config
cat /opt/stacks/watchtower/prometheus-config/prometheus.yml

# Alert rules
cat /opt/stacks/watchtower/prometheus-config/alerts/homelab.yml

# Docker Compose
cat /opt/stacks/watchtower/docker-compose.yml
```

## Recommended Grafana Dashboards

Import these dashboards by ID in Grafana:

| ID | Name | Purpose |
|----|------|---------|
| 1860 | Node Exporter Full | Complete host metrics |
| 893 | Docker & System Monitoring | Container resource usage |
| 13639 | Loki Dashboard | Log exploration |
| 14282 | cAdvisor | Detailed container metrics |

## Best Practices

1. **Never hardcode secrets:** Use `ansible-vault` or environment variables
2. **Use labels extensively:** Makes filtering in Prometheus/Loki easier
3. **Set resource limits:** Prevent monitoring from consuming excessive resources
4. **Test before deploying:** Use `--check` mode to preview changes
5. **Version control everything:** Commit all configuration changes

## Further Reading

- [Prometheus Documentation](https://prometheus.io/docs/)
- [Grafana Loki](https://grafana.com/docs/loki/latest/)
- [Ansible Best Practices](https://docs.ansible.com/ansible/latest/user_guide/playbooks_best_practices.html)
- [Docker Swarm Monitoring](https://docs.docker.com/engine/swarm/swarm-tutorial/)