homelab/ansible/archive/documentation/standards/ansible-quality-gates.md

9.2 KiB
Raw Permalink Blame History

Ansible quality gates

This document defines the quality standards, review checklist, and validation workflow for all Ansible code in this repository.

Philosophy

Quality gates progress through three enforcement tiers:

  • Tier 1 (Advisory): Visible via lint warnings; not blocking. Baseline cleanup phase.
  • Tier 2 (Mandatory — current): Must pass for swarm-impacting changes. CI enforces.
  • Tier 3 (Fully blocking): All rules enforced on every commit. Target: Phase 3 roadmap.

Idempotency controls are Tier 2 (mandatory now) for all stack-impacting changes. This means: changed_when, manager-state assertions, secret preflight asserts, bind-mount path asserts, and validate-only mode support are required, not advisory.

Linting

Configuration

The repository includes .ansible-lint configuration that enforces:

  • Moderate profile — Balanced between permissive and strict
  • Advisory rules — No blocking on known patterns (e.g., raw commands in bootstrap playbooks)
  • Warnings — Experimental syntax and risky permissions are flagged but not blocked

Running lint checks

# Lint all playbooks and roles
cd /home/chester/homelab/ansible
ansible-lint

# Lint specific playbook
ansible-lint playbooks/onboarding/generic_host.yml

# Lint entire role
ansible-lint roles/monitoring_stack/

Installing ansible-lint

# On control node (Ubuntu/Debian)
sudo apt-get update
sudo apt-get install -y python3-pip
pip3 install ansible-lint

# Verify installation
ansible-lint --version

Quality checklist

Use this checklist when creating or reviewing playbooks and roles:

Security

  • No SSH bypassesStrictHostKeyChecking=no is forbidden
  • Host key checking enabledansible.cfg must have host_key_checking = True
  • Secrets vaulted — No plaintext passwords in defaults, vars, or playbooks
  • Secrets validated — Roles requiring secrets include assert tasks to fail fast
  • File permissions explicit — All file, copy, template tasks specify mode
  • No root by default — Use become: true only when necessary

Idempotency

  • Changed semantics — All command/shell tasks include changed_when (mandatory)
  • Error handling — All command/shell tasks include failed_when or ignore_errors (mandatory)
  • Check mode safe — Playbooks can run with --check without errors (mandatory)
  • Replay safe — Running twice produces no changes on second run (mandatory; PR evidence required)
  • Manager assertion — Swarm manager checks use exact equality (== 'active|true'), not substring search (mandatory)
  • Absent idempotency — Stack removal checks existence first; no false changed when already absent (mandatory)
  • Validate-only mode — All stack deploy playbooks support stack_validate_only=true (mandatory)

Modularity

  • Roles over monoliths — Multi-task logic belongs in roles, not massive playbooks
  • Builtin modules first — Prefer ansible.builtin.* over command/shell/raw
  • Bootstrap exceptionraw commands are acceptable only for pre-Python tasks
  • Variables separated — Environment-specific values live in group_vars, not role defaults

Maintainability

  • Task names descriptive — Each task has a clear, action-oriented name
  • Tags applied — Logical grouping with tags (e.g., setup, security, monitoring)
  • Documentation inline — Complex logic includes comments explaining "why"
  • Handlers for services — Service restarts use handlers, not inline tasks

Mandatory pre-deploy gate (effective now — blocking for all stack changes)

Important

All steps below MUST pass before merging any pull request that touches ansible/templates/stacks/, ansible/playbooks/docker/deploy_*.yml, or ansible/roles/swarm_stack_deploy/. The Gitea CI workflow (.gitea/workflows/stack-idempotency.yml) runs stages 13 automatically on every PR. The two-run idempotency proof (step 6 below) must be performed manually and included as PR evidence.

For any swarm-impacting change, all checks below must pass before deployment:

cd /home/chester/homelab/ansible

# 1) Inventory parse gate
ansible-inventory -i inventory/hosts.ini --graph

# 2) Connectivity gate
ansible -i inventory/hosts.ini swarm_hosts -m ping

# 3) Swarm control-plane gate
ansible -i inventory/hosts.ini swarm_managers -m shell -a "docker info 2>/dev/null | grep -E 'Swarm:|Is Manager:'"

# 4) Playbook syntax gate
ansible-playbook -i inventory/hosts.ini playbooks/your-playbook.yml --syntax-check

# 5) Control node sanity gate
ansible-playbook -i inventory/hosts.ini playbooks/preflight/validate_control_node.yml

# 6) Validate-only preflight (no Swarm mutations — mandatory for stack changes)
ansible-playbook -i inventory/hosts.ini playbooks/docker/deploy_<service>.yml \
  -e "stack_validate_only=true" \
  --vault-password-file .vault_pass

# 7) TWO-RUN IDEMPOTENCY PROOF (required for stack PRs — attach output as evidence)
# Run 1: apply desired state
ansible-playbook -i inventory/hosts.ini playbooks/docker/deploy_<service>.yml \
  --vault-password-file .vault_pass \
  2>&1 | tee /tmp/run1.log

# Run 2: replay — MUST report changed=0 for stack tasks
ansible-playbook -i inventory/hosts.ini playbooks/docker/deploy_<service>.yml \
  --vault-password-file .vault_pass \
  2>&1 | tee /tmp/run2.log

# Verify: second run must show changed=0 for deploy/reconcile tasks
grep -E 'changed=[^0]' /tmp/run2.log && echo 'IDEMPOTENCY FAIL' || echo 'IDEMPOTENCY PASS'

PR evidence pack (required for stack-impacting changes)

For any PR that modifies a stack template, deploy playbook, or the swarm_stack_deploy role, attach the following to the PR description:

### Idempotency evidence

**Stack:** <service>
**Date:** YYYY-MM-DD
**Operator:** @username

**Run 1 summary:**

PLAY RECAP *** swarm-manager-1 : ok=N changed=N ...


**Run 2 summary (must show changed=0 for stack tasks):**

PLAY RECAP *** swarm-manager-1 : ok=N changed=0 ...


**Validate-only passed:** yes/no  
**Lint passed:** yes/no (CI enforced)  
**Syntax check passed:** yes/no (CI enforced)  

Important

A PR that cannot demonstrate changed=0 on the second run MUST NOT be merged.

Before committing changes, always run syntax checks:

cd /home/chester/homelab/ansible

# Check specific playbook
ansible-playbook -i inventory/hosts.ini playbooks/your-playbook.yml --syntax-check

# Preflight validation (control node sanity)
ansible-playbook -i inventory/hosts.ini playbooks/preflight/validate_control_node.yml

Idempotency testing

High-risk playbooks (those modifying system state) should be tested for idempotency:

# Run playbook twice; second run should report "changed=0"
ansible-playbook -i inventory/hosts.ini playbooks/your-playbook.yml
ansible-playbook -i inventory/hosts.ini playbooks/your-playbook.yml

Review process

Pre-commit (developer)

  1. Run inventory parse gate and connectivity gate
  2. Run syntax check on modified playbooks
  3. Run ansible-lint on modified playbooks/roles (Tier 2: mandatory for stack files)
  4. For stack changes, run validate-only preflight
  5. For stack changes, run idempotency proof (two-run) and collect evidence
  6. Ensure required secrets are provided via vault (no plaintext defaults)

Pre-merge (reviewer)

  1. Verify security checklist items are addressed
  2. Spot-check modularity (no 500+ line playbooks)
  3. Confirm environment-specific values are in inventory, not defaults
  4. Confirm no root-level duplicate Ansible directories were introduced
  5. For stack changes: verify PR evidence pack is attached and shows changed=0 on second run
  6. For critical changes (security, networking), require idempotency proof
  • Weekly: Triage Critical/High findings from drift reports
  • Biweekly: Run preflight validation suite
  • Monthly: Generate fresh standards-drift audit and review trends

Roadmap

As baseline quality improves, the repository will:

  1. Phase 1 (current): Mandatory idempotency gate for stack changes. Lint advisory for non-stack playbooks. Gitea CI blocks stack PRs on lint + syntax + preflight failures. no-changed-when promoted from skip to warn (visible everywhere).
  2. Phase 2 (3 months): Mandatory lint for all new/modified playbooks. no-changed-when moved to blocking; bootstrap exceptions suppressed inline with # noqa: no-changed-when on specific tasks.
  3. Phase 3 (6 months): Full baseline coverage, stricter profile. All remaining idempotency violations resolved. Two-run check automated in CI for eligible stacks.
  4. Phase 4 (12 months): Fully blocking CI on every commit. Molecule/integration tests for multi-node Swarm scenarios.

References