# Ansible quality gates

This document defines the quality standards, review checklist, and validation workflow for all Ansible code in this repository.

## Philosophy

Quality gates progress through three enforcement tiers:

- **Tier 1 (Advisory):** Visible via lint warnings; not blocking. Baseline cleanup phase.
- **Tier 2 (Mandatory — current):** Must pass for swarm-impacting changes. CI enforces.
- **Tier 3 (Fully blocking):** All rules enforced on every commit. Target: Phase 3 roadmap.

**Idempotency controls are Tier 2 (mandatory now) for all stack-impacting changes.**
This means: changed_when, manager-state assertions, secret preflight asserts,
bind-mount path asserts, and validate-only mode support are required, not advisory.

## Linting

### Configuration

The repository includes [.ansible-lint](../../.ansible-lint) configuration that enforces:

* **Moderate profile** — Balanced between permissive and strict
* **Advisory rules** — No blocking on known patterns (e.g., raw commands in bootstrap playbooks)
* **Warnings** — Experimental syntax and risky permissions are flagged but not blocked

### Running lint checks

```bash
# Lint all playbooks and roles
cd /home/chester/homelab/ansible
ansible-lint

# Lint specific playbook
ansible-lint playbooks/onboarding/generic_host.yml

# Lint entire role
ansible-lint roles/monitoring_stack/
```

### Installing ansible-lint

```bash
# On control node (Ubuntu/Debian)
sudo apt-get update
sudo apt-get install -y python3-pip
pip3 install ansible-lint

# Verify installation
ansible-lint --version
```

## Quality checklist

Use this checklist when creating or reviewing playbooks and roles:

### Security

* [ ] **No SSH bypasses** — `StrictHostKeyChecking=no` is forbidden
* [ ] **Host key checking enabled** — `ansible.cfg` must have `host_key_checking = True`
* [ ] **Secrets vaulted** — No plaintext passwords in defaults, vars, or playbooks
* [ ] **Secrets validated** — Roles requiring secrets include `assert` tasks to fail fast
* [ ] **File permissions explicit** — All `file`, `copy`, `template` tasks specify `mode`
* [ ] **No root by default** — Use `become: true` only when necessary

### Idempotency

* [x] **Changed semantics** — All `command`/`shell` tasks include `changed_when` (**mandatory**)
* [x] **Error handling** — All `command`/`shell` tasks include `failed_when` or `ignore_errors` (**mandatory**)
* [x] **Check mode safe** — Playbooks can run with `--check` without errors (**mandatory**)
* [x] **Replay safe** — Running twice produces no changes on second run (**mandatory**; PR evidence required)
* [x] **Manager assertion** — Swarm manager checks use exact equality (`== 'active|true'`), not substring search (**mandatory**)
* [x] **Absent idempotency** — Stack removal checks existence first; no false `changed` when already absent (**mandatory**)
* [x] **Validate-only mode** — All stack deploy playbooks support `stack_validate_only=true` (**mandatory**)

### Modularity

* [ ] **Roles over monoliths** — Multi-task logic belongs in roles, not massive playbooks
* [ ] **Builtin modules first** — Prefer `ansible.builtin.*` over `command`/`shell`/`raw`
* [ ] **Bootstrap exception** — `raw` commands are acceptable only for pre-Python tasks
* [ ] **Variables separated** — Environment-specific values live in `group_vars`, not role defaults

### Maintainability

* [ ] **Task names descriptive** — Each task has a clear, action-oriented name
* [ ] **Tags applied** — Logical grouping with tags (e.g., `setup`, `security`, `monitoring`)
* [ ] **Documentation inline** — Complex logic includes comments explaining "why"
* [ ] **Handlers for services** — Service restarts use handlers, not inline tasks

## Mandatory pre-deploy gate (effective now — blocking for all stack changes)

> [!IMPORTANT]
> All steps below MUST pass before merging any pull request that touches
> `ansible/templates/stacks/`, `ansible/playbooks/docker/deploy_*.yml`,
> or `ansible/roles/swarm_stack_deploy/`.
> The Gitea CI workflow (`.gitea/workflows/stack-idempotency.yml`) runs
> stages 1–3 automatically on every PR. The two-run idempotency proof
> (step 6 below) must be performed manually and included as PR evidence.

For any swarm-impacting change, all checks below must pass before deployment:

```bash
cd /home/chester/homelab/ansible

# 1) Inventory parse gate
ansible-inventory -i inventory/hosts.ini --graph

# 2) Connectivity gate
ansible -i inventory/hosts.ini swarm_hosts -m ping

# 3) Swarm control-plane gate
ansible -i inventory/hosts.ini swarm_managers -m shell -a "docker info 2>/dev/null | grep -E 'Swarm:|Is Manager:'"

# 4) Playbook syntax gate
ansible-playbook -i inventory/hosts.ini playbooks/your-playbook.yml --syntax-check

# 5) Control node sanity gate
ansible-playbook -i inventory/hosts.ini playbooks/preflight/validate_control_node.yml

# 6) Validate-only preflight (no Swarm mutations — mandatory for stack changes)
ansible-playbook -i inventory/hosts.ini playbooks/docker/deploy_<service>.yml \
  -e "stack_validate_only=true" \
  --vault-password-file .vault_pass

# 7) TWO-RUN IDEMPOTENCY PROOF (required for stack PRs — attach output as evidence)
# Run 1: apply desired state
ansible-playbook -i inventory/hosts.ini playbooks/docker/deploy_<service>.yml \
  --vault-password-file .vault_pass \
  2>&1 | tee /tmp/run1.log

# Run 2: replay — MUST report changed=0 for stack tasks
ansible-playbook -i inventory/hosts.ini playbooks/docker/deploy_<service>.yml \
  --vault-password-file .vault_pass \
  2>&1 | tee /tmp/run2.log

# Verify: second run must show changed=0 for deploy/reconcile tasks
grep -E 'changed=[^0]' /tmp/run2.log && echo 'IDEMPOTENCY FAIL' || echo 'IDEMPOTENCY PASS'
```

## PR evidence pack (required for stack-impacting changes)

For any PR that modifies a stack template, deploy playbook, or the
`swarm_stack_deploy` role, attach the following to the PR description:

```
### Idempotency evidence

**Stack:** <service>
**Date:** YYYY-MM-DD
**Operator:** @username

**Run 1 summary:**
```
PLAY RECAP ***
swarm-manager-1 : ok=N  changed=N  ...
```

**Run 2 summary (must show changed=0 for stack tasks):**
```
PLAY RECAP ***
swarm-manager-1 : ok=N  changed=0  ...
```

**Validate-only passed:** yes/no  
**Lint passed:** yes/no (CI enforced)  
**Syntax check passed:** yes/no (CI enforced)  
```

> [!IMPORTANT]
> A PR that cannot demonstrate changed=0 on the second run MUST NOT be merged.


Before committing changes, always run syntax checks:

```bash
cd /home/chester/homelab/ansible

# Check specific playbook
ansible-playbook -i inventory/hosts.ini playbooks/your-playbook.yml --syntax-check

# Preflight validation (control node sanity)
ansible-playbook -i inventory/hosts.ini playbooks/preflight/validate_control_node.yml
```

## Idempotency testing

High-risk playbooks (those modifying system state) should be tested for idempotency:

```bash
# Run playbook twice; second run should report "changed=0"
ansible-playbook -i inventory/hosts.ini playbooks/your-playbook.yml
ansible-playbook -i inventory/hosts.ini playbooks/your-playbook.yml
```

## Review process

### Pre-commit (developer)

1. Run inventory parse gate and connectivity gate
2. Run syntax check on modified playbooks
  3. Run ansible-lint on modified playbooks/roles (**Tier 2: mandatory for stack files**)
  4. For stack changes, run validate-only preflight
  5. For stack changes, run idempotency proof (two-run) and collect evidence
  6. Ensure required secrets are provided via vault (no plaintext defaults)

### Pre-merge (reviewer)

  1. Verify security checklist items are addressed
  2. Spot-check modularity (no 500+ line playbooks)
  3. Confirm environment-specific values are in inventory, not defaults
  4. Confirm no root-level duplicate Ansible directories were introduced
  5. **For stack changes: verify PR evidence pack is attached and shows changed=0 on second run**
  6. For critical changes (security, networking), require idempotency proof

* **Weekly:** Triage Critical/High findings from drift reports
* **Biweekly:** Run preflight validation suite
* **Monthly:** Generate fresh standards-drift audit and review trends

## Roadmap

As baseline quality improves, the repository will:

1. **Phase 1 (current):** Mandatory idempotency gate for stack changes. Lint advisory for
   non-stack playbooks. Gitea CI blocks stack PRs on lint + syntax + preflight failures.
   `no-changed-when` promoted from skip to warn (visible everywhere).
2. **Phase 2 (3 months):** Mandatory lint for all new/modified playbooks.
   `no-changed-when` moved to blocking; bootstrap exceptions suppressed inline with
   `# noqa: no-changed-when` on specific tasks.
3. **Phase 3 (6 months):** Full baseline coverage, stricter profile. All remaining
   idempotency violations resolved. Two-run check automated in CI for eligible stacks.
4. **Phase 4 (12 months):** Fully blocking CI on every commit. Molecule/integration
   tests for multi-node Swarm scenarios.

## References

* [Ansible Best Practices](https://docs.ansible.com/ansible/latest/tips_tricks/ansible_tips_tricks.html)
* [ansible-lint documentation](https://ansible-lint.readthedocs.io/)
* [environment-constraints.md](./environment-constraints.md) — Infrastructure-specific rules
* [naming-conventions.md](./naming-conventions.md) — File and variable naming standards