241 lines
9.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Ansible quality gates
This document defines the quality standards, review checklist, and validation workflow for all Ansible code in this repository.
## Philosophy
Quality gates progress through three enforcement tiers:
- **Tier 1 (Advisory):** Visible via lint warnings; not blocking. Baseline cleanup phase.
- **Tier 2 (Mandatory — current):** Must pass for swarm-impacting changes. CI enforces.
- **Tier 3 (Fully blocking):** All rules enforced on every commit. Target: Phase 3 roadmap.
**Idempotency controls are Tier 2 (mandatory now) for all stack-impacting changes.**
This means: changed_when, manager-state assertions, secret preflight asserts,
bind-mount path asserts, and validate-only mode support are required, not advisory.
## Linting
### Configuration
The repository includes [.ansible-lint](../../.ansible-lint) configuration that enforces:
* **Moderate profile** — Balanced between permissive and strict
* **Advisory rules** — No blocking on known patterns (e.g., raw commands in bootstrap playbooks)
* **Warnings** — Experimental syntax and risky permissions are flagged but not blocked
### Running lint checks
```bash
# Lint all playbooks and roles
cd /home/chester/homelab/ansible
ansible-lint
# Lint specific playbook
ansible-lint playbooks/onboarding/generic_host.yml
# Lint entire role
ansible-lint roles/monitoring_stack/
```
### Installing ansible-lint
```bash
# On control node (Ubuntu/Debian)
sudo apt-get update
sudo apt-get install -y python3-pip
pip3 install ansible-lint
# Verify installation
ansible-lint --version
```
## Quality checklist
Use this checklist when creating or reviewing playbooks and roles:
### Security
* [ ] **No SSH bypasses**`StrictHostKeyChecking=no` is forbidden
* [ ] **Host key checking enabled**`ansible.cfg` must have `host_key_checking = True`
* [ ] **Secrets vaulted** — No plaintext passwords in defaults, vars, or playbooks
* [ ] **Secrets validated** — Roles requiring secrets include `assert` tasks to fail fast
* [ ] **File permissions explicit** — All `file`, `copy`, `template` tasks specify `mode`
* [ ] **No root by default** — Use `become: true` only when necessary
### Idempotency
* [x] **Changed semantics** — All `command`/`shell` tasks include `changed_when` (**mandatory**)
* [x] **Error handling** — All `command`/`shell` tasks include `failed_when` or `ignore_errors` (**mandatory**)
* [x] **Check mode safe** — Playbooks can run with `--check` without errors (**mandatory**)
* [x] **Replay safe** — Running twice produces no changes on second run (**mandatory**; PR evidence required)
* [x] **Manager assertion** — Swarm manager checks use exact equality (`== 'active|true'`), not substring search (**mandatory**)
* [x] **Absent idempotency** — Stack removal checks existence first; no false `changed` when already absent (**mandatory**)
* [x] **Validate-only mode** — All stack deploy playbooks support `stack_validate_only=true` (**mandatory**)
### Modularity
* [ ] **Roles over monoliths** — Multi-task logic belongs in roles, not massive playbooks
* [ ] **Builtin modules first** — Prefer `ansible.builtin.*` over `command`/`shell`/`raw`
* [ ] **Bootstrap exception**`raw` commands are acceptable only for pre-Python tasks
* [ ] **Variables separated** — Environment-specific values live in `group_vars`, not role defaults
### Maintainability
* [ ] **Task names descriptive** — Each task has a clear, action-oriented name
* [ ] **Tags applied** — Logical grouping with tags (e.g., `setup`, `security`, `monitoring`)
* [ ] **Documentation inline** — Complex logic includes comments explaining "why"
* [ ] **Handlers for services** — Service restarts use handlers, not inline tasks
## Mandatory pre-deploy gate (effective now — blocking for all stack changes)
> [!IMPORTANT]
> All steps below MUST pass before merging any pull request that touches
> `ansible/templates/stacks/`, `ansible/playbooks/docker/deploy_*.yml`,
> or `ansible/roles/swarm_stack_deploy/`.
> The Gitea CI workflow (`.gitea/workflows/stack-idempotency.yml`) runs
> stages 13 automatically on every PR. The two-run idempotency proof
> (step 6 below) must be performed manually and included as PR evidence.
For any swarm-impacting change, all checks below must pass before deployment:
```bash
cd /home/chester/homelab/ansible
# 1) Inventory parse gate
ansible-inventory -i inventory/hosts.ini --graph
# 2) Connectivity gate
ansible -i inventory/hosts.ini swarm_hosts -m ping
# 3) Swarm control-plane gate
ansible -i inventory/hosts.ini swarm_managers -m shell -a "docker info 2>/dev/null | grep -E 'Swarm:|Is Manager:'"
# 4) Playbook syntax gate
ansible-playbook -i inventory/hosts.ini playbooks/your-playbook.yml --syntax-check
# 5) Control node sanity gate
ansible-playbook -i inventory/hosts.ini playbooks/preflight/validate_control_node.yml
# 6) Validate-only preflight (no Swarm mutations — mandatory for stack changes)
ansible-playbook -i inventory/hosts.ini playbooks/docker/deploy_<service>.yml \
-e "stack_validate_only=true" \
--vault-password-file .vault_pass
# 7) TWO-RUN IDEMPOTENCY PROOF (required for stack PRs — attach output as evidence)
# Run 1: apply desired state
ansible-playbook -i inventory/hosts.ini playbooks/docker/deploy_<service>.yml \
--vault-password-file .vault_pass \
2>&1 | tee /tmp/run1.log
# Run 2: replay — MUST report changed=0 for stack tasks
ansible-playbook -i inventory/hosts.ini playbooks/docker/deploy_<service>.yml \
--vault-password-file .vault_pass \
2>&1 | tee /tmp/run2.log
# Verify: second run must show changed=0 for deploy/reconcile tasks
grep -E 'changed=[^0]' /tmp/run2.log && echo 'IDEMPOTENCY FAIL' || echo 'IDEMPOTENCY PASS'
```
## PR evidence pack (required for stack-impacting changes)
For any PR that modifies a stack template, deploy playbook, or the
`swarm_stack_deploy` role, attach the following to the PR description:
```
### Idempotency evidence
**Stack:** <service>
**Date:** YYYY-MM-DD
**Operator:** @username
**Run 1 summary:**
```
PLAY RECAP ***
swarm-manager-1 : ok=N changed=N ...
```
**Run 2 summary (must show changed=0 for stack tasks):**
```
PLAY RECAP ***
swarm-manager-1 : ok=N changed=0 ...
```
**Validate-only passed:** yes/no
**Lint passed:** yes/no (CI enforced)
**Syntax check passed:** yes/no (CI enforced)
```
> [!IMPORTANT]
> A PR that cannot demonstrate changed=0 on the second run MUST NOT be merged.
Before committing changes, always run syntax checks:
```bash
cd /home/chester/homelab/ansible
# Check specific playbook
ansible-playbook -i inventory/hosts.ini playbooks/your-playbook.yml --syntax-check
# Preflight validation (control node sanity)
ansible-playbook -i inventory/hosts.ini playbooks/preflight/validate_control_node.yml
```
## Idempotency testing
High-risk playbooks (those modifying system state) should be tested for idempotency:
```bash
# Run playbook twice; second run should report "changed=0"
ansible-playbook -i inventory/hosts.ini playbooks/your-playbook.yml
ansible-playbook -i inventory/hosts.ini playbooks/your-playbook.yml
```
## Review process
### Pre-commit (developer)
1. Run inventory parse gate and connectivity gate
2. Run syntax check on modified playbooks
3. Run ansible-lint on modified playbooks/roles (**Tier 2: mandatory for stack files**)
4. For stack changes, run validate-only preflight
5. For stack changes, run idempotency proof (two-run) and collect evidence
6. Ensure required secrets are provided via vault (no plaintext defaults)
### Pre-merge (reviewer)
1. Verify security checklist items are addressed
2. Spot-check modularity (no 500+ line playbooks)
3. Confirm environment-specific values are in inventory, not defaults
4. Confirm no root-level duplicate Ansible directories were introduced
5. **For stack changes: verify PR evidence pack is attached and shows changed=0 on second run**
6. For critical changes (security, networking), require idempotency proof
* **Weekly:** Triage Critical/High findings from drift reports
* **Biweekly:** Run preflight validation suite
* **Monthly:** Generate fresh standards-drift audit and review trends
## Roadmap
As baseline quality improves, the repository will:
1. **Phase 1 (current):** Mandatory idempotency gate for stack changes. Lint advisory for
non-stack playbooks. Gitea CI blocks stack PRs on lint + syntax + preflight failures.
`no-changed-when` promoted from skip to warn (visible everywhere).
2. **Phase 2 (3 months):** Mandatory lint for all new/modified playbooks.
`no-changed-when` moved to blocking; bootstrap exceptions suppressed inline with
`# noqa: no-changed-when` on specific tasks.
3. **Phase 3 (6 months):** Full baseline coverage, stricter profile. All remaining
idempotency violations resolved. Two-run check automated in CI for eligible stacks.
4. **Phase 4 (12 months):** Fully blocking CI on every commit. Molecule/integration
tests for multi-node Swarm scenarios.
## References
* [Ansible Best Practices](https://docs.ansible.com/ansible/latest/tips_tricks/ansible_tips_tricks.html)
* [ansible-lint documentation](https://ansible-lint.readthedocs.io/)
* [environment-constraints.md](./environment-constraints.md) — Infrastructure-specific rules
* [naming-conventions.md](./naming-conventions.md) — File and variable naming standards