diff --git a/.ansible/.lock b/.ansible/.lock deleted file mode 100644 index e69de29..0000000 diff --git a/ansible/.ansible-lint b/ansible/.ansible-lint deleted file mode 100644 index 7a89805..0000000 --- a/ansible/.ansible-lint +++ /dev/null @@ -1,31 +0,0 @@ ---- -# .ansible-lint - Architecture Enforcement Configuration -# This ensures idempotency, security, and best practices. - -# Use the 'safety' profile to enforce strict security and reliability rules -profile: safety - -# Stop the build if these rules are violated -strict: true - -# Rules to explicitly enforce or ignore -warn_list: - - experimental # Notify me of experimental features but don't fail - - name[casing] # Warning only for task name capitalization - -skip_list: - - yaml[line-length] # Homelab scripts often have long strings/URLs - -# Exclude these paths from linting -exclude_paths: - - .cache/ - - .git/ - - archive/ # Legacy reference files - - roles/external/ # Don't lint roles downloaded from Galaxy - -# Enable offline mode for airgapped environments -offline: false - -# Enable FQCN enforcement (Fully Qualified Collection Names) -# e.g., ansible.builtin.copy instead of just 'copy' -# This is now enforced by the 'safety' profile by default diff --git a/ansible/.ansible-standards.md b/ansible/.ansible-standards.md deleted file mode 100644 index 9a27014..0000000 --- a/ansible/.ansible-standards.md +++ /dev/null @@ -1,25 +0,0 @@ -# Ansible Architectural Standards v1.0 ---- -metadata: - role: Lead Ansible Architect - enforcement: Strict - idempotency: Required - vault_encryption: Required ---- - -## 1. Project Philosophy -- **Agentless Execution:** Rely on SSH and Python 3. -- **Desired State:** Tasks must define the *result*, not the *command* (e.g., use `apt`, not `shell: apt install`). -- **Failure Domains:** Use `block/rescue` for all destructive or system-level changes (updates, partitioning). - -## 2. Technical Specs -- **Connection:** SSH via ED25519 keys; `ansible_user` must have passwordless sudo or Vault-stored credentials. -- **Variables:** - `defaults/main.yml`: Default values (lowest priority). - - `vars/main.yml`: Role-specific constants. - - `group_vars/`: Environment-specific overrides. -- **Naming:** Kebab-case for files (`web-server.yml`), snake_case for variables (`web_server_port`). - -## 3. Maintenance Logic -- **Serial Execution:** `serial: 1` for hypervisor/cluster nodes. -- **Reboot Strategy:** Always check for `/var/run/reboot-required` before initiating a `reboot` task. -- **Service Verification:** Post-task loops must verify that critical services (e.g., `pveproxy`) are `started`. \ No newline at end of file diff --git a/ansible/DEVELOPMENT-SETUP.md b/ansible/DEVELOPMENT-SETUP.md deleted file mode 100644 index 830fbe2..0000000 --- a/ansible/DEVELOPMENT-SETUP.md +++ /dev/null @@ -1,69 +0,0 @@ -# Development Setup Manifest -**Version:** 1.0 -**Target Environment:** Ansible Control Node & Local Workstation - -This document outlines the software and configurations required to develop, lint, and execute Ansible playbooks within this ecosystem. - ---- - -## 1. CLI Tools (The Engine Room) - -These tools must be installed on your **Control Node**. If you are developing locally, they should also be installed on your workstation. - -| Tool | Function | Purpose | Cost | -| :--- | :--- | :--- | :--- | -| **Ansible-Core** | Execution Engine | Processes YAML playbooks and manages SSH connections. | Free | -| **Ansible-Lint** | Static Analysis | Validates code against best practices and idempotency rules. | Free | -| **Molecule** | Testing Framework | Runs playbooks against temporary containers to verify roles. | Free | -| **Ansible-Vault** | Secret Management | Encrypts sensitive data (passwords/API keys) at rest. | Free | -| **Proxmoxer** | Python API Library | Allows Ansible to communicate with the Proxmox VE API. | Free | -| **ssh-pass** | Auth Utility | Enables password-based login during the initial key-copy phase. | Free | - -### Installation Command (Debian/Ubuntu) -```bash -sudo apt update && sudo apt install -y ansible ansible-lint sshpass python3-pip -pip3 install proxmoxer --break-system-packages -``` - ---- - -## 2. VSCode Extensions (The Cockpit) - -For the best development experience, install these extensions in Visual Studio Code. - -### **Ansible (by Red Hat)** -* **What it does:** Provides syntax highlighting, jinja2 auto-completion, and direct linting integration. -* **Why you want it:** It catches "broken" YAML and missing parameters while you type. -* **Cost:** Free - -### **YAML (by Red Hat)** -* **What it does:** Validates the structure of `.yml` and `.yaml` files. -* **Why you want it:** Ansible is hypersensitive to indentation; this extension prevents 90% of syntax errors. -* **Cost:** Free - -### **GitLens (by GitKraken)** -* **What it does:** Provides "blame" annotations and repository heatmaps. -* **Why you want it:** Crucial for tracking *why* a system configuration was changed three months ago. -* **Cost:** Free Core (Pro features available via sub) - -### **Remote - SSH (by Microsoft)** -* **What it does:** Connects VSCode directly to your Control Node over SSH. -* **Why you want it:** Allows you to code on your main PC but use the environment/tools installed on the Control Node. -* **Cost:** Free - ---- - -## 3. Configuration Files - -To ensure the tools above work correctly, the following files should exist in your project root: - -1. **`.ansible-lint`**: Defines the "Safety" profile to enforce architecture standards. -2. **`ansible.cfg`**: Configures default inventory paths and SSH behavior. -3. **`.ssh/id_ed25519`**: The private key used for node authentication. - ---- - -## 4. LLM Context Hook -When using an LLM to generate code for this project, provide the following context to ensure compatibility: - -> "My environment uses **Ansible-Core** with the **Proxmoxer** API library. I enforce standards via **Ansible-Lint** using the **safety** profile. All playbooks must pass these checks and use **ED25519** keys for authentication." \ No newline at end of file diff --git a/ansible/QUICK-REFERENCE.md b/ansible/QUICK-REFERENCE.md deleted file mode 100644 index 6e2e1e0..0000000 --- a/ansible/QUICK-REFERENCE.md +++ /dev/null @@ -1,332 +0,0 @@ -# Ansible Quick Reference - -## Current Environment Status - -**Last Validated:** April 13, 2026 -**Status:** 🟒 OPERATIONAL -**Managed Nodes:** 4 (Watchtower, Heimdall, Waldorf, PVE01) - ---- - -## Quick Commands - -### Health Checks - -```bash -# Basic connectivity test -ansible all -m ping - -# Full environment validation -./validate-environment.sh - -# Check Ansible version -ansible --version - -# List all managed hosts -ansible-inventory --graph -``` - -### Ad-Hoc Commands - -```bash -# Execute command on all nodes -ansible all -m command -a "uptime" - -# Execute with privilege escalation -ansible all -m command -a "whoami" --become - -# Check disk space on all nodes -ansible all -m shell -a "df -h /" - -# Gather facts from specific group -ansible docker_nodes -m setup -``` - -### Playbook Operations - -```bash -# Syntax check -ansible-playbook playbooks/test-connection.yml --syntax-check - -# Dry run (check mode) -ansible-playbook playbooks/test-connection.yml --check - -# Execute playbook -ansible-playbook playbooks/test-connection.yml - -# Run with verbose output -ansible-playbook playbooks/test-connection.yml -vvv - -# Limit to specific hosts -ansible-playbook playbooks/test-connection.yml --limit heimdall -``` - -### Ansible Vault Operations - -```bash -# View encrypted file -ansible-vault view inventory/group_vars/all/vault.yml - -# Edit encrypted file -ansible-vault edit inventory/group_vars/all/vault.yml - -# Encrypt a file -ansible-vault encrypt path/to/file.yml - -# Decrypt a file -ansible-vault decrypt path/to/file.yml - -# Change vault password -ansible-vault rekey inventory/group_vars/all/vault.yml -``` - -### Linting & Quality - -```bash -# Lint specific playbook -ansible-lint playbooks/test-connection.yml - -# Lint all playbooks -ansible-lint playbooks/*.yml - -# Lint with strict mode -ansible-lint --strict playbooks/ - -# Show configuration -ansible-config list -ansible-config dump --only-changed -``` - ---- - -## Inventory Groups - -The inventory is organized into hardware and functional groups: - -### Hardware Groups -- **control_plane** - Watchtower (Ansible control node) -- **docker_nodes** - Heimdall, Waldorf -- **physical_servers** - Heimdall, Waldorf -- **raspberry_pi** - Watchtower -- **proxmox_cluster** - PVE01 - -### Functional Groups -- **core_services** - Heimdall (Komodo, Gitea, Traefik) -- **media_services** - Waldorf (Plex, Tunarr) -- **nfs_clients** - Heimdall, Waldorf - -### Targeting Examples - -```bash -# All Docker hosts -ansible docker_nodes -m ping - -# Only physical servers -ansible physical_servers -m command -a "lsblk" - -# Just the control plane -ansible control_plane -m setup - -# NFS clients only -ansible nfs_clients -m shell -a "df -h /mnt/appdata" -``` - ---- - -## Files & Directories - -``` -ansible/ -β”œβ”€β”€ ansible.cfg # Main configuration -β”œβ”€β”€ inventory/ -β”‚ β”œβ”€β”€ hosts.ini # Node definitions -β”‚ └── host_vars/ # Per-host variables -β”œβ”€β”€ group_vars/ -β”‚ └── all.yml # Global variables -β”œβ”€β”€ vault/ -β”‚ └── .vault_pass # Vault password (gitignored) -β”œβ”€β”€ playbooks/ -β”‚ β”œβ”€β”€ test-connection.yml # Basic connectivity test -β”‚ β”œβ”€β”€ gather-node-facts.yml # System discovery -β”‚ β”œβ”€β”€ quick-facts.yml # Rapid diagnostics -β”‚ β”œβ”€β”€ onboard-nodes.yml # Node initialization -β”‚ └── onboard-proxmox.yml # Proxmox setup -β”œβ”€β”€ roles/ -β”‚ └── proxmox_post_install/ # Custom role -└── validate-environment.sh # Health check script -``` - ---- - -## Configuration Highlights - -### ansible.cfg Key Settings - -- **Inventory:** `inventory/hosts.ini` -- **SSH Key:** `~/.ssh/id_ed25519` -- **Host Key Checking:** Disabled (homelab trusted network) -- **Vault Password:** `vault/.vault_pass` -- **Forks:** 5 (parallel execution limit) -- **Fact Caching:** Enabled (JSON, 1 hour TTL) -- **Privilege Escalation:** sudo (passwordless) - -### Security Configuration - -- ED25519 SSH keys (modern, fast, secure) -- Ansible Vault for secrets (AES256 encryption) -- Vault password file permissions: 600 (owner read/write only) -- No passwords in inventory files -- StrictHostKeyChecking disabled (acceptable for isolated homelab) - ---- - -## Common Workflows - -### Adding a New Managed Node - -1. **Generate and copy SSH key:** -```bash -ssh-copy-id -i ~/.ssh/id_ed25519.pub user@new-node-ip -``` - -2. **Test connectivity:** -```bash -ssh -i ~/.ssh/id_ed25519 user@new-node-ip "hostname" -``` - -3. **Add to inventory** (`inventory/hosts.ini`): -```ini -[docker_nodes] -new_node ansible_host=10.0.0.XXX ansible_user=user -``` - -4. **Verify:** -```bash -ansible new_node -m ping -``` - -### Creating a New Playbook - -1. **Create file in playbooks/ directory:** -```yaml ---- -- name: My New Playbook - hosts: all - gather_facts: true - - tasks: - - name: Example task - ansible.builtin.debug: - msg: "Hello from {{ inventory_hostname }}" -``` - -2. **Validate syntax:** -```bash -ansible-playbook playbooks/my-playbook.yml --syntax-check -``` - -3. **Lint the playbook:** -```bash -ansible-lint playbooks/my-playbook.yml -``` - -4. **Test in check mode:** -```bash -ansible-playbook playbooks/my-playbook.yml --check -``` - -5. **Execute:** -```bash -ansible-playbook playbooks/my-playbook.yml -``` - -### Troubleshooting Connection Issues - -```bash -# Verbose SSH debugging -ansible node_name -m ping -vvvv - -# Test raw connectivity (bypasses Python) -ansible node_name -m raw -a "echo test" - -# Check SSH key authentication -ssh -vvv -i ~/.ssh/id_ed25519 user@node-ip - -# Verify inventory parsing -ansible-inventory --host node_name - -# Test privilege escalation -ansible node_name -m command -a "whoami" --become -vv -``` - ---- - -## Integration Points - -### VSCode Remote Development - -1. Open VSCode -2. Install "Remote - SSH" extension -3. Connect to Watchtower: `chester@10.0.0.200` -4. Open folder: `/home/chester/homelab/ansible` -5. Install extensions on remote: - - Ansible (by Red Hat) - - YAML (by Red Hat) - -### Git Workflow - -```bash -# Check status -git status - -# Add changed playbooks -git add playbooks/ - -# Commit with descriptive message -git commit -m "feat(ansible): add system maintenance playbook" - -# Push to Gitea -git push origin main -``` - ---- - -## Performance Tips - -- **Use fact caching** (already enabled) to avoid re-gathering system info -- **Limit playbook scope** with `--limit` flag when testing -- **Increase forks** for large inventories (currently 5) -- **Use pipelining** (already enabled) for faster SSH operations -- **Disable gathering** for simple tasks: `gather_facts: false` - ---- - -## Security Best Practices - -βœ… **Already Implemented:** -- SSH key-based authentication (no passwords) -- Ansible Vault for sensitive data -- Vault password file secured (600 permissions) -- Passwordless sudo configured safely - -⚠️ **Recommendations:** -- Rotate SSH keys annually -- Audit Vault contents quarterly -- Review ansible.log for suspicious activity -- Limit Ansible user privileges where possible - ---- - -## Next Steps - -1. **Create comprehensive validation playbook** (validate-connectivity.yml) -2. **Build Docker stack deployment role** -3. **Implement automated system updates playbook** -4. **Set up Molecule for role testing** -5. **Integrate with Komodo for CI/CD automation** - ---- - -**Document Version:** 1.0 -**Last Updated:** April 13, 2026 -**Maintained By:** FrankGPT (Ansible Architect) diff --git a/ansible/README.md b/ansible/README.md deleted file mode 100644 index 3538290..0000000 --- a/ansible/README.md +++ /dev/null @@ -1,47 +0,0 @@ -# Ansible Infrastructure Automation - -This directory contains the Ansible automation framework for homelab infrastructure management. - -## πŸ“ Directory Structure - -``` -ansible/ -β”œβ”€β”€ .ansible-lint # Linting rules (enforces safety & best practices) -β”œβ”€β”€ .ansible-standards.md # Architectural standards and conventions -β”œβ”€β”€ DEVELOPMENT-SETUP.md # Control node setup requirements -β”œβ”€β”€ README.md # This file -└── archive/ # ⚠️ REFERENCE ONLY - Legacy implementation -``` - ---- - -## ⚠️ Important: Archive Directory - -**The `archive/` directory contains the previous iteration of the Ansible infrastructure.** - -- **Purpose:** Reference and migration source only -- **Status:** Not actively maintained -- **Action:** Do NOT execute playbooks or use configurations directly from `archive/` -- **Migration Status:** In progress - components are being refactored into the new structure - ---- - -## πŸš€ Getting Started - -### Prerequisites - -Refer to [DEVELOPMENT-SETUP.md](DEVELOPMENT-SETUP.md) for: -- Required CLI tools (ansible-core, ansible-lint, proxmoxer) -- VSCode extensions (recommended for development) -- SSH key generation and vault configuration - -### Control Node Setup - -Watchtower (10.0.0.200) is the designated Ansible control node for this lab. - ---- - -## πŸ“š Additional Resources - -- **Standards:** See [.ansible-standards.md](.ansible-standards.md) for architectural requirements -- **Legacy Documentation:** Available in `archive/documentation/` for historical reference diff --git a/ansible/ansible.cfg b/ansible/ansible.cfg deleted file mode 100644 index a516b11..0000000 --- a/ansible/ansible.cfg +++ /dev/null @@ -1,36 +0,0 @@ -[defaults] -# Inventory configuration -inventory = inventory/hosts.ini -host_key_checking = False -deprecation_warnings = False -interpreter_python = auto_silent - -# Paths (relative to this ansible/ directory) -roles_path = ./roles:~/.ansible/roles:/usr/share/ansible/roles - -# Vault configuration -vault_password_file = vault/.vault_pass - -# Performance tuning -forks = 5 -timeout = 30 -gathering = smart -fact_caching = jsonfile -fact_caching_connection = /tmp/ansible_facts -fact_caching_timeout = 3600 - -# Callbacks for better output -callbacks_enabled = timer, profile_tasks - -# Logging -log_path = ansible.log - -[privilege_escalation] -become = False -become_method = sudo -become_user = root -become_ask_pass = False - -[ssh_connection] -ssh_args = -o ControlMaster=auto -o ControlPersist=60s -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -pipelining = True diff --git a/ansible/archive/.ansible-lint b/ansible/archive/.ansible-lint deleted file mode 100644 index 6925153..0000000 --- a/ansible/archive/.ansible-lint +++ /dev/null @@ -1,49 +0,0 @@ ---- -# Ansible Lint Configuration -# Enforces quality standards for playbooks and roles -# Documentation: https://ansible-lint.readthedocs.io/ - -# Exclude paths from linting -exclude_paths: - - .cache/ - - .git/ - - outputs/ - - scripts/ - -# Enable offline mode (do not check for latest Ansible version) -offline: true - -# Skip specific rules (with justification) -skip_list: - - 'yaml[line-length]' # Advisory: Many legitimate cases exceed 160 chars - - 'name[casing]' # Advisory: Emoji and stylistic choices in task names - # NOTE: no-changed-when removed from skip_list β€” now enforced as a warning - # (warn_list below). Stack playbooks and the swarm_stack_deploy role MUST - # be fully compliant. Bootstrap playbooks with legitimate raw/command use - # may suppress per-task with: # noqa: no-changed-when - - 'command-instead-of-module' # Advisory: Some Proxmox/specialized commands lack modules - - 'var-naming[no-role-prefix]' # Advisory: swarm_stack_deploy intentionally exposes a - # short 'stack_*' public API namespace. Renaming to 'swarm_stack_deploy_*' would be a - # breaking change for all callers. Suppress globally; revisit in Phase 3 refactor. - -# Warn on specific rules (advisory, not blocking) -warn_list: - - 'experimental' # Flag new/experimental syntax for review - - 'jinja[spacing]' # Encourage spacing in templates - - 'risky-file-permissions' # Flag overly permissive file modes - - 'no-changed-when' # Promoted from skip: visible on all command/shell tasks missing changed_when - # NEXT PHASE: move to blocking by removing from warn_list entirely - -# Additional quality checks -kinds: - - playbook: "playbooks/**/*.yml" - - tasks: "roles/*/tasks/**/*.yml" - - vars: "group_vars/**/*.yml" - - defaults: "roles/*/defaults/**/*.yml" - - handlers: "roles/*/handlers/**/*.yml" - -# Profile to use (min, basic, moderate, safety, shared, production) -profile: moderate - -# Treat warnings as errors (disable initially until baseline is clean) -# strict: false diff --git a/ansible/archive/.gitignore b/ansible/archive/.gitignore deleted file mode 100644 index f024627..0000000 --- a/ansible/archive/.gitignore +++ /dev/null @@ -1,27 +0,0 @@ -# Python Virtual Environment -.venv/ -venv/ -__pycache__/ -*.pyc - -# Ansible Runtime -*.retry -.ansible/ - -# IDE -.vscode/ -.idea/ - -# Secrets (never commit!) -group_vars/*/vault.yml -host_vars/*/vault.yml -*.vault -.vault_pass -outputs/**/containers.yml -outputs/**/env_keys/ -outputs/**/compose_files/ - -# Temporary Files -*.log -*.tmp -.DS_Store diff --git a/ansible/archive/.vault_pass b/ansible/archive/.vault_pass deleted file mode 100644 index 61a2038..0000000 --- a/ansible/archive/.vault_pass +++ /dev/null @@ -1 +0,0 @@ -Promci*1 diff --git a/ansible/archive/.yamllint b/ansible/archive/.yamllint deleted file mode 100644 index 26c4331..0000000 --- a/ansible/archive/.yamllint +++ /dev/null @@ -1,37 +0,0 @@ ---- -# yamllint configuration for Ansible project -# Aligned with .ansible-lint skip-list rationale. -# 'yaml[line-length]' is advisory: Jinja2 templates and Traefik labels -# routinely exceed 80 chars and wrapping them reduces readability. -# -# Rules below also satisfy ansible-lint's required yamllint constraints: -# comments.min-spaces-from-content: 1 -# comments-indentation: false -# braces.max-spaces-inside: 1 -# octal-values.forbid-implicit-octal: true -# octal-values.forbid-explicit-octal: true - -extends: default - -rules: - # Allow up to 160 chars β€” matches the rationale in .ansible-lint: - # "Many legitimate cases exceed 160 chars" (Traefik labels, Jinja2 expressions) - line-length: - max: 160 - level: warning - - # Docker Compose / Swarm stack files do not use YAML document start markers. - # Ansible playbooks do. Make this a warning rather than an error so stack - # templates are not penalised while playbooks are still encouraged to use ---. - document-start: - level: warning - - # Required by ansible-lint compatibility rules: - comments: - min-spaces-from-content: 1 - comments-indentation: false - braces: - max-spaces-inside: 1 - octal-values: - forbid-implicit-octal: true - forbid-explicit-octal: true diff --git a/ansible/archive/ansible.cfg b/ansible/archive/ansible.cfg deleted file mode 100644 index a82ecb3..0000000 --- a/ansible/archive/ansible.cfg +++ /dev/null @@ -1,12 +0,0 @@ -[defaults] -inventory = inventory/hosts.ini -host_key_checking = True -deprecation_warnings = False -interpreter_python = auto_silent -vault_password_file = .vault_pass - -# Paths (relative to this ansible/ directory) -roles_path = ./roles - -# Show task timing and profiling -callbacks_enabled = timer, profile_tasks diff --git a/ansible/archive/documentation/README.md b/ansible/archive/documentation/README.md deleted file mode 100644 index bbe7c04..0000000 --- a/ansible/archive/documentation/README.md +++ /dev/null @@ -1,70 +0,0 @@ -# Ansible Documentation - -This folder contains **Ansible-specific** technical documentation for the homelab automation framework. - -## Documentation Organization - -The homelab uses a **domain-based separation** for documentation: - -### Ansible-Specific Documentation (This Folder) - -Documentation about **how Ansible works** in this homelab: - -- **[ansible-knowledge/](ansible-knowledge/)** β€” Ansible syntax, YAML/Jinja2 reference, technical constraints -- **[playbooks/](playbooks/)** β€” Operational guides for running specific playbooks -- **[playbooks/README.md](playbooks/README.md)** β€” Playbook runbook index, including Watchtower monitoring onboarding and self-healing -- **[standards/ansible-quality-gates.md](standards/ansible-quality-gates.md)** β€” Ansible linting rules, security checklist, review workflow - -### Homelab-Wide Documentation (Root `/documentation/`) - -Documentation about **what the homelab allows** and architectural decisions: - -- **[/documentation/architecture/](../../documentation/architecture/)** β€” Architectural contracts (control-plane, compute-plane, networking, storage, access-identity) -- **[/documentation/standards/](../../documentation/standards/)** β€” Homelab-wide standards (naming conventions, environment constraints, architecture decisions) -- **[/documentation/policies/](../../documentation/policies/)** β€” Operational policies (networking policy, etc.) -- **[/documentation/handover.md](../../documentation/handover.md)** β€” Primary project handover document - -## Quick Reference - -### When Troubleshooting Ansible Issues - -1. **Syntax errors?** β†’ [ansible-knowledge/ansible-syntax.md](ansible-knowledge/ansible-syntax.md) -2. **Playbook not working?** β†’ [playbooks/README.md](playbooks/README.md) for operational guides -3. **Monitoring stack onboarding?** β†’ [playbooks/watchtower-monitoring-onboarding.md](playbooks/watchtower-monitoring-onboarding.md) -4. **Linting failures?** β†’ [standards/ansible-quality-gates.md](standards/ansible-quality-gates.md) - -### When Designing Infrastructure - -1. **What services can run where?** β†’ [/documentation/architecture/compute-plane.md](../../documentation/architecture/compute-plane.md) -2. **Network topology?** β†’ [/documentation/architecture/networking.md](../../documentation/architecture/networking.md) -3. **Storage architecture?** β†’ [/documentation/architecture/storage.md](../../documentation/architecture/storage.md) -4. **Naming conventions?** β†’ [/documentation/standards/naming-conventions.md](../../documentation/standards/naming-conventions.md) - -## Files in This Folder - -```text -ansible/documentation/ -β”œβ”€β”€ README.md # You are here -β”œβ”€β”€ ansible-knowledge/ # Ansible syntax and technical reference -β”‚ └── ansible-syntax.md -β”œβ”€β”€ playbooks/ # Operational guides for playbooks -β”‚ β”œβ”€β”€ README.md -β”‚ β”œβ”€β”€ manage_docker_environment.md -β”‚ β”œβ”€β”€ mount_nfs_shares.md -β”‚ β”œβ”€β”€ onboard_new_host.md -β”‚ β”œβ”€β”€ onboard-ansible-secrets.md -β”‚ └── watchtower-monitoring-onboarding.md -β”œβ”€β”€ reports/ # Analysis and audit reports -β”‚ └── prompt-analysis-2026-01-09.md -└── standards/ # Ansible-specific standards - └── ansible-quality-gates.md -``` - -## Contributing - -When adding new documentation: - -- **Ansible-specific content** (syntax, modules, playbook operations) β†’ Add to this folder -- **Homelab-wide content** (architecture, contracts, policies) β†’ Add to `/documentation/` at the repository root - -If unsure, ask: "Is this about how Ansible works, or about what the homelab architecture allows?" diff --git a/ansible/archive/documentation/ansible-knowledge/ansible-syntax.md b/ansible/archive/documentation/ansible-knowledge/ansible-syntax.md deleted file mode 100644 index a329b79..0000000 --- a/ansible/archive/documentation/ansible-knowledge/ansible-syntax.md +++ /dev/null @@ -1,86 +0,0 @@ -# Ansible Syntax Documentation - -## 1. Overview - -Ansible syntax defines the formal structure and permitted constructs for authoring Ansible playbooks, roles, tasks, and related configuration files. This document is the canonical reference for Ansible syntax. It supersedes all other interpretations and is immutable. - -## 2. Syntax - -### 2.1 Formal Rules - -- Ansible configuration files are written in YAML format. All files must conform to YAML 1.2 specification. -- Indentation is strictly enforced. Only spaces are permitted; tabs are prohibited. -- Key-value pairs must be separated by a colon and a space (`key: value`). -- Lists are denoted by a hyphen followed by a space (`- item`). -- Boolean values must be expressed as `true` or `false` (lowercase, unquoted). -- Strings may be unquoted or quoted using single (`'`) or double (`"`) quotes. Quoting is required if the string contains special characters, leading/trailing whitespace, or YAML-reserved words. -- Comments begin with a hash (`#`) and are ignored by the parser. -- Playbooks must begin with a list of plays. Each play is a YAML dictionary. -- Each play must define at minimum the `hosts` key. -- Tasks within plays are defined under the `tasks` key as a list. -- Modules are invoked as dictionary keys within a task, with module arguments as subkeys. -- Variable interpolation uses the Jinja2 syntax: `{{ variable_name }}`. -- Block constructs (`block`, `rescue`, `always`) must be defined as lists under their respective keys. -- Conditionals use the `when` key with a valid expression. -- Loops use the `loop` or legacy `with_*` constructs. -- Roles are included using the `roles` key as a list. -- Handlers are defined under the `handlers` key as a list. -- Tags are assigned using the `tags` key as a list. - -### 2.2 Constraints - -- All YAML files must be valid and parseable; syntax errors result in execution failure. -- Indentation must be consistent throughout the file; mixing spaces and tabs is strictly prohibited. -- Dictionary keys must be unique within their scope. -- Reserved words (e.g., `hosts`, `tasks`, `vars`, `roles`, `handlers`, `tags`) must not be used as variable names. -- Variable names must begin with a letter and may contain letters, numbers, and underscores only. -- Jinja2 expressions must be syntactically valid and properly closed. -- Only supported modules and plugins may be invoked; unknown modules result in failure. -- All constructs must be defined in the correct context (e.g., `tasks` only within plays or roles). -- File extensions: - - Playbooks: `.yml` or `.yaml` - - Inventory: `.ini`, `.yml`, `.yaml` - - Variable files: `.yml`, `.yaml` -- All files must use UTF-8 encoding. - -### 2.3 Valid and Invalid Constructs - -- Valid: - - Properly indented YAML with correct key-value structure. - - Use of supported Ansible keywords and modules. - - Jinja2 variable interpolation within strings. -- Invalid: - - Use of tabs for indentation. - - Duplicate keys within the same dictionary. - - Unclosed or malformed Jinja2 expressions. - - Use of unsupported or misspelled modules. - - Mixing YAML and JSON syntax within the same file. - -## 3. Best Practices - -### 3.1 Required Practices - -- Use consistent two-space indentation for all YAML files. -- Explicitly quote strings containing special characters or reserved words. -- Define all variables in dedicated variable files or under the `vars` key. -- Use descriptive names for plays, tasks, and variables. -- Validate YAML syntax before execution. - -### 3.2 Prohibited Practices - -- Do not use tabs for indentation. -- Do not use reserved Ansible keywords as variable names. -- Do not mix YAML and JSON syntax. -- Do not define duplicate keys within the same dictionary. - -### 3.3 Rationale - -- Consistent indentation and quoting prevent parsing errors and ensure predictable execution. -- Reserved keywords are protected to avoid namespace collisions and undefined behavior. - -## 4. Non-Goals / Explicit Exclusions - -- This document does not cover Ansible module functionality, plugin development, or execution semantics. -- This document does not provide tutorials, usage examples, or workflow guidance. -- This document does not address inventory file structure beyond syntax constraints. -- Any information not explicitly stated herein is undefined and not governed by this document. diff --git a/ansible/archive/documentation/contracts/AccessIdentity.md b/ansible/archive/documentation/contracts/AccessIdentity.md deleted file mode 100644 index ddbc373..0000000 --- a/ansible/archive/documentation/contracts/AccessIdentity.md +++ /dev/null @@ -1,56 +0,0 @@ -## βœ… **Point 5 – Access & Identity – FINAL** - -### **Role** - -* Defines how operators, admins, and services authenticate and access the homelab -* Covers remote access, SSO/identity, password/MFA policy, and onboarding/offboarding - ---- - -### **Remote access methods** - -* Supported: Omada VPN, Tailscale, VS Code Tunnel, SSH (as needed) -* Operator-only: all remote access methods -* End-user access: none (homelab is operator-managed only) -* Public-facing services: must be authenticated and proxied; no direct management UI exposure - ---- - -### **Identity & SSO** - -* Authentik is deployed and serves as the centralized SSO/identity provider for the homelab -* Operator/admin accounts are provisioned and managed via Authentik where possible; legacy per-service accounts should be migrated to SSO -* All new services must integrate with Authentik for authentication if supported -* Periodically review and update SSO integrations to ensure coverage and security - ---- - -### **Passwords, MFA, and secrets** - -* All admin/operator accounts must use strong, unique passwords -* MFA is required wherever supported (VPN, SSO, cloud, etc.) -* Credentials and secrets must be stored in a secure vault (e.g., Bitwarden, 1Password) - ---- - -### **Operational constraints / "never do this"** - -* Never expose management UIs (Proxmox, Watchtower, NAS, etc.) to the public internet -* Never share admin/operator credentials -* Never disable MFA on critical services -* All access changes must be documented and reviewed - ---- - -### **Onboarding/offboarding & change model** - -* Onboarding: create accounts, set up VPN/Tailscale, grant secrets vault access -* Offboarding: disable accounts, rotate credentials, audit access -* Changes to access policy require contract update - ---- - -### **Further considerations** - -* Exact VPN/Tailscale/SSO setup details, onboarding checklists, and secrets management procedures will live in a separate, detailed access/identity doc (to be referenced here) -* Access & identity contract should be reviewed at least annually or after major personnel/infra changes diff --git a/ansible/archive/documentation/contracts/ComputePlane.md b/ansible/archive/documentation/contracts/ComputePlane.md deleted file mode 100644 index 064b50c..0000000 --- a/ansible/archive/documentation/contracts/ComputePlane.md +++ /dev/null @@ -1,96 +0,0 @@ -## βœ… **Point 2 – Compute Plane (OptiPlex Proxmox Cluster) – FINAL** - -### **Role** - -* Cluster that runs all Docker Swarm workloads -* Separate from out-of-band control (Watchtower) -* Designed to tolerate loss of one physical node without losing quorum - ---- - -### **Physical hosts** - -* 3Γ— Dell OptiPlex Micro 7010: pve01-pve03 -* Local NVMe only; no shared storage dependency -* Hosts sized with headroom; no aggressive CPU/RAM overcommit by default - ---- - -### **Proxmox cluster** - -* 3-node Proxmox VE cluster with Corosync over LAN -* Static IPs on all hosts -* vmbr0 = primary LAN bridge; VLAN-capable but unused initially -* Proxmox HA: **off** by default (may be added later via separate design) - ---- - -### **VM layout per host** - -* Each OptiPlex runs exactly 2Γ— Ubuntu Server LTS VMs: - * 1Γ— Swarm Manager VM - * 1Γ— Swarm Worker VM -* No additional "misc" VMs on these hosts without an explicit architecture update - ---- - -### **Swarm roles and placement** - -* Total: 3 managers, 3 workers (one of each per host) -* Managers hold Swarm Raft state and scheduling decisions -* Workers run application workloads -* Managers are schedulable only for light/infra tasks; no heavy or noisy apps -* Node labels and placement constraints enforce "apps β†’ workers" by default - ---- - -### **Resource allocation (initial)** - -* **Manager VM** - * 2 vCPU - * 4–6 GB RAM - * ~40 GB disk -* **Worker VM** - * 4–6 vCPU - * 16–24 GB RAM - * β‰₯100 GB disk - ---- - -### **Storage model** - -* VM disks: local Proxmox storage (ZFS or LVM-thin), no shared VM disks -* Container data: bind-mounts inside VMs -* Swarm control plane and core workloads do **not** depend on shared storage -* Production data path: - * Primary: TerraMaster - * Backup: TerraMaster β†’ Synology via rsync - * Offsite: Synology β†’ cloud - ---- - -### **Networking assumptions** - -* All Proxmox hosts and VMs attach to primary LAN via vmbr0 -* Compute plane runs on a flat LAN at baseline -* Detailed VLAN and IP design will live in a separate networking architecture document that this spec can reference - ---- - -### **Operational constraints ("never do this")** - -* Do **not** run Docker workloads or Swarm nodes directly on Proxmox hosts -* Do **not** run heavy or stateful application stacks on manager VMs -* Do **not** introduce shared storage as a hard dependency for Swarm or cluster boot -* Do **not** use storage appliances (TerraMaster, Synology, etc.) as Swarm managers or workers - ---- - -### **Expansion and change model** - -* To add compute capacity: - * Add a new OptiPlex node to the Proxmox cluster - * Create at least one new Swarm Worker VM on that host - * Join the VM to Swarm with standard labels and constraints - * Gradually rebalance workloads; no redesign of existing nodes required -* Any change that alters manager count, enables Proxmox HA, or significantly changes storage/networking models requires an explicit architecture review and doc update diff --git a/ansible/archive/documentation/contracts/ControlPlane.md b/ansible/archive/documentation/contracts/ControlPlane.md deleted file mode 100644 index 6132fa4..0000000 --- a/ansible/archive/documentation/contracts/ControlPlane.md +++ /dev/null @@ -1,50 +0,0 @@ -## βœ… **Point 1 – Control Plane (β€œWatchtower”) – FINAL** - -### **Node** - -* **Raspberry Pi 5** -* OS: Raspberry Pi OS Lite (64-bit) - -### **Purpose** - -* Out-of-band control -* Automation authority -* Monitoring vantage point -* Recovery access when everything else is down - ---- - -### **Allowed services (explicit)** - -* VS Code Tunnel -* Ansible controller -* Tailscale (always-on) -* **Uptime Kuma** - - * Single container - * Bound to Tailscale IP only - * No reverse proxy - * No public ports - * Outbound alerts only (email / Discord / etc.) - -### **Explicit exclusions** - -* No Traefik -* No Authentik -* No Swarm membership -* No shared storage -* No stateful apps beyond Kuma’s local data - -### **Security posture** - -* SSH key-only -* Non-root admin -* Firewall: SSH + Tailscale -* Consider SD β†’ NAS image backups - -### **Operational contract** - -* If this node is down: changes pause, nothing breaks -* If everything else is down: this node is how you recover - ---- diff --git a/ansible/archive/documentation/contracts/Handover-AnsibleEngineer.md b/ansible/archive/documentation/contracts/Handover-AnsibleEngineer.md deleted file mode 100644 index c245e94..0000000 --- a/ansible/archive/documentation/contracts/Handover-AnsibleEngineer.md +++ /dev/null @@ -1,55 +0,0 @@ -# Homelab Ansible Handover – v2 Architecture - -## Purpose - -This document summarizes the current homelab architecture and operational contracts. It is intended as a handover for an Ansible engineer to begin developing and maintaining infrastructure automation playbooks. - ---- - -## Architecture Overview - -- **Control Plane:** Raspberry Pi 5 (β€œWatchtower”) – out-of-band management node. Runs Ansible controller, VS Code Tunnel, Tailscale, and Uptime Kuma. No production workloads or reverse proxies. -- **Compute Plane:** 3Γ— Dell OptiPlex Micro 7010 running Proxmox. Each host runs: - - 1Γ— Swarm Manager VM (control, light infra only) - - 1Γ— Swarm Worker VM (all app workloads) -- **Networking:** Flat LAN (`10.0.0.0/24`), static IPs for infra, IoT/guest VLANs segregated. Future VLAN segmentation planned. -- **Storage:** TerraMaster (primary data), Synology (backup, cloud sync). Rsync and cloud sync jobs run daily. -- **Access & Identity:** Authentik SSO for operator/admin accounts. Remote access via Omada VPN, Tailscale, VS Code Tunnel. MFA and password vault required. - ---- - -## Playbook Priorities & Expectations - -1. **Idempotency:** All playbooks must be safe to run repeatedly and should not cause drift or break contracts. -2. **Contracts:** Reference the v2 contracts in `architecture/v2/contracts/` for allowed/forbidden services, node roles, and operational constraints. -3. **Inventory:** Maintain a clear, up-to-date inventory (hosts, groups, roles) reflecting the contracts. -4. **Separation of Concerns:** - - Control plane (Watchtower) is for automation, monitoring, and recovery only. - - Compute plane (Proxmox VMs) runs all application workloads. - - Never deploy workloads or Swarm nodes directly on Proxmox hosts or NAS devices. -5. **Access:** Use Authentik SSO for all supported services. Document and automate onboarding/offboarding where possible. -6. **Backups:** Automate and verify backup flows (TerraMaster β†’ Synology β†’ cloud). Never skip scheduled backups. -7. **Security:** Never expose management UIs to the public internet. Enforce MFA and strong password policies. - ---- - -## Immediate Playbook Targets - -- Proxmox host and VM provisioning (with static IPs, labels, and roles) -- Docker Swarm cluster setup and node role enforcement -- NAS configuration and backup job automation -- Authentik SSO integration for new services -- Monitoring/alerting setup (Uptime Kuma, notifications) -- Access onboarding/offboarding automation - ---- - -## Reference - -- Full contracts: `architecture/v2/contracts/` -- Planning docs: `architecture/v2/plans/` -- README: `architecture/v2/README.md` - ---- - -**Contact the homelab owner for clarifications or to propose contract updates before making architectural changes.** diff --git a/ansible/archive/documentation/contracts/Networking.md b/ansible/archive/documentation/contracts/Networking.md deleted file mode 100644 index 672d356..0000000 --- a/ansible/archive/documentation/contracts/Networking.md +++ /dev/null @@ -1,69 +0,0 @@ -## βœ… **Point 3 – Networking – FINAL** - -### **Role** - -* Defines how all homelab components (control, compute, storage, users) connect and communicate -* Baseline: single-site, flat LAN for all core infra, with best-practice VLANs and segmentation as future upgrades - ---- - -### **Baseline LAN** - -* Primary LAN: `10.0.0.0/24` (gateway: `10.0.0.2`) -* DHCP range: `10.0.0.50–10.0.0.150` -* Static infra: `.2–.10` (infra), `.10–.14` (Proxmox), `.200+` (homelab), `.249` (Synology), `.250` (TerraMaster) -* Key static IPs: - * Watchtower: `10.0.0.200` - * Proxmox hosts: `10.0.0.10–.14` - * Synology: `10.0.0.249` - * TerraMaster: `10.0.0.250` -* All core infra and homelab services live in the "main" VLAN -* IoT is segregated; guest WiFi VLAN exists but is unused - ---- - -### **Service exposure & remote access** - -* Most services are reverse-proxied via Traefik and exposed to the internet -* Tailscale is used for network ingress, not direct service exposure -* Operator remote access: Omada VPN, Tailscale, VS Code Tunnel; SSH/terminal access can be added as needed -* Management UIs (Proxmox, Watchtower, NAS) are not intentionally public, but most services are proxied - ---- - -### **Interconnection & segmentation** - -* Watchtower can reach all Proxmox hosts, Synology, and TerraMaster directly (no firewall blocks) -* Homelab is entirely in the "main" VLAN; IoT is isolated; guest VLAN is unused -* Segmentation exists for IoT, but not for homelab/infra yet; setup should be reviewed periodically - ---- - -### **Future VLAN model (intent)** - -* Follow best practices for small networks: - * mgmt: hypervisors, switches, Watchtower - * workloads: Swarm worker VMs, app traffic - * storage: NAS traffic - * users/guests: client devices -* All VLANs must be isolated except via explicit firewall rules -* Review and update segmentation as needs evolve - ---- - -### **Operational constraints / "never do this"** - -* Never bridge production and lab VLANs -* Never expose management VLAN or core infra directly to the internet -* Never allow IoT VLAN to reach core infra or management -* Never mix guest and production traffic without a firewall -* All changes to VLANs, firewall, or router config must be deliberate and documented - ---- - -### **Further considerations** - -* Exact VLAN IDs, IP ranges, DHCP/DNS, and firewall rules will live in a separate, detailed networking doc (to be referenced here) -* Networking is single-site only; future multi-site/remote backup will require explicit design -* Router/firewall implementation details (e.g., Omada, OPNsense, UniFi) will be documented separately; this contract is vendor-neutral -* Review this contract and underlying network setup at least annually or after major infra changes diff --git a/ansible/archive/documentation/contracts/Storage.md b/ansible/archive/documentation/contracts/Storage.md deleted file mode 100644 index 70d0bb7..0000000 --- a/ansible/archive/documentation/contracts/Storage.md +++ /dev/null @@ -1,53 +0,0 @@ -## βœ… **Point 4 – Storage – FINAL** - -### **Role** - -* Defines how production and backup data is stored, protected, and accessed in the homelab -* Focuses on NAS devices (TerraMaster, Synology), backup flows, and operational rules - ---- - -### **NAS device roles** - -* **TerraMaster**: primary production data store -* **Synology**: backup target for TerraMaster, staging for offsite/cloud -* Both: never run compute workloads or join Swarm - ---- - -### **Data flows** - -* Production data written to TerraMaster -* Rsync from TerraMaster to Synology runs multiple times daily (staged for noon, repeats until 11pm) -* Synology uploads to cloud via daily cloud sync task -* VM/container data: backed up via app-level exports or VM snapshots (optional/TBD) - ---- - -### **Backup policy** - -* Minimum: daily local backup (TerraMaster β†’ Synology), daily offsite (Synology β†’ cloud) -* Retention: at least 30 days for critical data -* Verification: periodic restore tests (cadence TBD) - ---- - -### **Operational constraints / "never do this"** - -* Never run Docker/Swarm workloads on NAS -* Never use NAS as a dependency for Swarm control-plane health -* Never skip scheduled backups without explicit, documented exception - ---- - -### **Expansion and change model** - -* Add new storage only by explicit design update -* Changes to backup cadence, retention, or offsite policy require contract update - ---- - -### **Further considerations** - -* Exact backup scripts, schedules, and cloud provider details will live in a separate, detailed storage/backup doc (to be referenced here) -* Storage contract should be reviewed at least annually or after major infra changes diff --git a/ansible/archive/documentation/playbooks/README.md b/ansible/archive/documentation/playbooks/README.md deleted file mode 100644 index 2ef1141..0000000 --- a/ansible/archive/documentation/playbooks/README.md +++ /dev/null @@ -1,19 +0,0 @@ -# Playbook operation guides - -This folder contains operator-facing guides for playbook execution. - -## Available runbooks - -- [Authentik deployment checklist](deploy-authentik.md) -- [Manage Docker environment](manage_docker_environment.md) -- [Mount NFS shares](mount_nfs_shares.md) -- [Onboard ansible secrets](onboard-ansible-secrets.md) -- [Onboard non-Proxmox host (new + existing)](onboard_new_host.md) -- [Watchtower monitoring onboarding and self-healing](watchtower-monitoring-onboarding.md) - -## Usage pattern - -1. Validate prerequisites in the runbook. -2. Run playbook commands exactly as documented. -3. Verify service health and access paths. -4. Record outcomes and follow rollback steps when needed. diff --git a/ansible/archive/documentation/playbooks/deploy-ansible-mcp-watchtower.md b/ansible/archive/documentation/playbooks/deploy-ansible-mcp-watchtower.md deleted file mode 100644 index 19238e9..0000000 --- a/ansible/archive/documentation/playbooks/deploy-ansible-mcp-watchtower.md +++ /dev/null @@ -1,137 +0,0 @@ -# Deploy Ansible MCP server on Watchtower - -## Purpose - -Deploy a custom Ansible MCP server on Watchtower so AI tools can query inventory, -validate syntax, and run allowlisted playbooks through guarded tool calls. - -## Scope - -- Host: `watchtower` inventory group -- Playbook: `ansible/playbooks/ai/deploy_ansible_mcp_watchtower.yml` -- Runtime path: `/opt/ansible-mcp` -- Service name: `ansible-mcp` -- State and logs: `/var/lib/ansible-mcp` - -## Features delivered - -- MCP tools: - - `health` - - `list_inventory` - - `validate_syntax` - - `run_playbook` - - `get_job_status` - - `cancel_job` -- Path guardrails for playbook execution (allowlisted directories only) -- Optional explicit playbook allowlist for high-trust execution scopes -- Write-mode guardrails: - - global write toggle - - explicit confirm gate for write actions -- Auth guardrail: - - bearer token required when `ANSIBLE_MCP_API_TOKEN` is configured -- Input guardrails: - - max `extra_vars` payload size - - blocked `extra_vars` key list -- Background run tracking with per-run logs and status records -- JSONL audit records at `/var/lib/ansible-mcp/audit/events.jsonl` - -## Prerequisites - -1. Watchtower host is reachable from control node. -2. Python 3 is installed on Watchtower. -3. Inventory contains a valid `watchtower` group. -4. Ansible control node has access to this repository at `/home/chester/homelab`. - -## Deploy - -Run from `ansible/`: - -```bash -cd /home/chester/homelab/ansible -export ANSIBLE_MCP_API_TOKEN='set-a-strong-token-before-deploy' -ansible-playbook -i inventory/hosts.ini playbooks/ai/deploy_ansible_mcp_watchtower.yml -``` - -Validate only: - -```bash -cd /home/chester/homelab/ansible -ansible-playbook -i inventory/hosts.ini playbooks/ai/deploy_ansible_mcp_watchtower.yml --check -``` - -## Runtime configuration - -The playbook sets these environment variables in the systemd unit: - -- `ANSIBLE_MCP_REPO_ROOT=/home/chester/homelab/ansible` -- `ANSIBLE_MCP_INVENTORY=inventory/hosts.ini` -- `ANSIBLE_MCP_ALLOWED_PLAYBOOK_DIRS=playbooks` -- `ANSIBLE_MCP_ALLOWED_PLAYBOOKS=` (optional comma-separated explicit allowlist) -- `ANSIBLE_MCP_API_TOKEN=` (required for HTTP transport in current playbook) -- `ANSIBLE_MCP_ALLOW_WRITE=true` -- `ANSIBLE_MCP_REQUIRE_CONFIRM=true` -- `ANSIBLE_MCP_DEFAULT_TIMEOUT=900` -- `ANSIBLE_MCP_MAX_TIMEOUT=3600` -- `ANSIBLE_MCP_MAX_EXTRA_VARS_BYTES=16384` -- `ANSIBLE_MCP_BLOCKED_EXTRA_VARS_KEYS=ansible_password,ansible_become_password,vault_password` -- `ANSIBLE_MCP_STATE_DIR=/var/lib/ansible-mcp` -- `ANSIBLE_MCP_TRANSPORT=streamable-http` -- `ANSIBLE_MCP_HOST=0.0.0.0` -- `ANSIBLE_MCP_PORT=8449` - -## Verify - -```bash -# Service state -sudo systemctl status ansible-mcp --no-pager - -# Recent logs -sudo journalctl -u ansible-mcp -n 80 --no-pager - -# Listening port -ss -ltnp | grep 8449 -``` - -## Client connection example - -For MCP clients that support HTTP transport: - -```json -{ - "mcpServers": { - "ansible-watchtower": { - "type": "http", - "url": "http://10.0.0.200:8449/mcp", - "headers": { - "Authorization": "Bearer ${env:ANSIBLE_MCP_API_TOKEN}" - } - } - } -} -``` - -If you terminate TLS upstream (recommended), expose this endpoint through your -existing ingress and use an HTTPS URL. - -## Operational safety notes - -- Keep `ANSIBLE_MCP_REQUIRE_CONFIRM=true` in write mode. -- Keep `ANSIBLE_MCP_API_TOKEN` set and rotate it regularly. -- Prefer explicit `ANSIBLE_MCP_ALLOWED_PLAYBOOKS` over broad directory allowlists. -- Restrict `ANSIBLE_MCP_ALLOWED_PLAYBOOK_DIRS` to known-safe playbook roots. -- Do not grant broad filesystem access to the service user. -- Treat background run logs in `/var/lib/ansible-mcp/logs` as audit artifacts. - -## Rollback - -```bash -sudo systemctl disable --now ansible-mcp -sudo rm -f /etc/systemd/system/ansible-mcp.service -sudo systemctl daemon-reload -``` - -Optional cleanup: - -```bash -sudo rm -rf /opt/ansible-mcp /var/lib/ansible-mcp -``` diff --git a/ansible/archive/documentation/playbooks/deploy-authentik.md b/ansible/archive/documentation/playbooks/deploy-authentik.md deleted file mode 100644 index ceb2dfa..0000000 --- a/ansible/archive/documentation/playbooks/deploy-authentik.md +++ /dev/null @@ -1,606 +0,0 @@ -# Authentik deployment checklist - -## Purpose - -This runbook is the operator path for deploying, verifying, and handing off -Authentik as the homelab identity provider. - -It covers: - -- Preflight checks: secrets, Swarm state, storage, and network readiness. -- Deployment execution using the canonical Ansible playbook. -- Service convergence and health verification. -- Ingress and functional smoke tests against the live endpoint. -- Post-deploy hardening, evidence capture, and rollback guidance. -- Day-1 troubleshooting for common failure modes. - -## Scope - -- **Stack name:** `authentik` -- **Canonical playbook:** `ansible/playbooks/docker/deploy_authentik.yml` -- **Stack template:** `ansible/templates/stacks/authentik.stack.yml` -- **Target manager:** `swarm-manager-1` (`10.0.0.211`) -- **Public URL:** `https://sso.castaldifamily.com` -- **Data root:** `/mnt/homelab/apps/authentik` -- **Services deployed:** `authentik-postgres`, `authentik-redis`, `authentik-server`, `authentik-worker` - -> [!IMPORTANT] -> This stack uses **absolute bind mounts**. The deploy playbook requires all data -> directories to exist before deployment. If any path is missing, the preflight -> asserts will fail-safe and abort rather than bootstrap an empty installation -> over existing data. - ---- - -## Deployment flow - -```mermaid -flowchart LR - preflight[Phase 1 β€” Preflight] --> validation[Phase 2 β€” Validation run] - validation --> deploy[Phase 3 β€” Deploy] - deploy --> convergence[Phase 4 β€” Convergence] - convergence --> ingress[Phase 5 β€” Ingress checks] - ingress --> handoff[Phase 6 β€” Handoff] - - classDef phase fill:#dbeafe,stroke:#3b82f6; - class preflight,validation,deploy,convergence,ingress,handoff phase -``` - ---- - -## Phase 1 β€” Preflight checklist - -Complete all items in this phase before running any playbook command. - -### 1.1 Change window and ownership - -- [ ] Deployment owner is assigned. -- [ ] Rollback owner is assigned. -- [ ] Maintenance window is confirmed. -- [ ] No active cluster incidents in the latest Swarm audit - (`outputs/swarm_audit_*.md`). - -### 1.2 Control node readiness - -Run from the `ansible/` directory with the virtual environment active. - -```bash -# Confirm Python environment -source /home/chester/homelab/.venv/bin/activate - -# Confirm Ansible version (must be >= 2.18.0) -ansible --version - -# Confirm SSH access to all Swarm managers -ansible swarm_managers -i inventory/hosts.ini -m ping -``` - -- [ ] Ansible version is `2.18.0` or higher. -- [ ] All Swarm managers return `pong`. -- [ ] Vault password is available (`.vault_pass` file present or `ANSIBLE_VAULT_PASSWORD_FILE` set). - -### 1.3 Secrets readiness - -The deploy playbook asserts both values are defined, non-empty, and not -placeholder strings. Verify them first: - -```bash -ansible -i inventory/hosts.ini localhost \ - -m ansible.builtin.debug \ - -a "msg={{ vault_authentik_secret_key | length }}" \ - -e "@group_vars/all.yml" \ - --vault-password-file .vault_pass -``` - -Repeat for `vault_authentik_postgres_password`. - -- [ ] `vault_authentik_secret_key` decrypts to a non-empty, non-placeholder value. -- [ ] `vault_authentik_postgres_password` decrypts to a non-empty, non-placeholder value. -- [ ] Neither value is any of: `change-me`, `changeme`, `your-random-secret`, `your-db-password`. - -### 1.4 Swarm cluster state - -```bash -# Confirm target manager is active and is control-plane -ssh chester@10.0.0.211 \ - "docker info --format '{{.Swarm.LocalNodeState}}|{{.Swarm.ControlAvailable}}'" -# Expected output: active|true - -# Confirm all managers are active -ansible swarm_managers -i inventory/hosts.ini \ - -m ansible.builtin.command \ - -a "docker info --format '{{.Swarm.LocalNodeState}}'" -``` - -- [ ] `swarm-manager-1` returns `active|true`. -- [ ] All three managers return `active`. -- [ ] No node shows `inactive`, `pending`, or `error`. - -### 1.5 External overlay network - -Authentik requires `proxy-net` to exist before stack deploy. - -```bash -ssh chester@10.0.0.211 \ - "docker network ls --filter name=proxy-net --format '{{.Name}}|{{.Driver}}|{{.Scope}}'" -# Expected: proxy-net|overlay|swarm -``` - -- [ ] `proxy-net` exists with `overlay` driver and `swarm` scope. - -> [!WARNING] -> If `proxy-net` is missing, create it before continuing: -> ```bash -> ssh chester@10.0.0.211 \ -> "docker network create --driver overlay --attachable proxy-net" -> ``` - -### 1.6 Persistent data paths - -All bind-mount paths must exist on `swarm-manager-1` **before** deploying. -The playbook will fail-safe if any are missing. - -```bash -ssh chester@10.0.0.211 "for d in \ - /mnt/homelab/apps/authentik \ - /mnt/homelab/apps/authentik/data \ - /mnt/homelab/apps/authentik/data/database \ - /mnt/homelab/apps/authentik/data/redis \ - /mnt/homelab/apps/authentik/data/media \ - /mnt/homelab/apps/authentik/data/config \ - /mnt/homelab/apps/authentik/data/blueprints; do - [ -d \"\$d\" ] && echo \"OK \$d\" || echo \"MISSING \$d\" -done" -``` - -- [ ] All 7 paths return `OK`. -- [ ] If any path is `MISSING`, create or restore from backup before proceeding. - -To create paths for a **fresh install** (no existing data to protect): - -```bash -ssh chester@10.0.0.211 "sudo mkdir -p \ - /mnt/homelab/apps/authentik/data/database \ - /mnt/homelab/apps/authentik/data/redis \ - /mnt/homelab/apps/authentik/data/media \ - /mnt/homelab/apps/authentik/data/config \ - /mnt/homelab/apps/authentik/data/blueprints" -``` - -> [!WARNING] -> Do not create missing paths if you are restoring an existing Authentik install. -> Restore from backup first to avoid initialising an empty database over -> pre-existing data. - ---- - -## Phase 2 β€” Validation-only run - -Run the playbook in validation mode to confirm all asserts pass before -changing anything on the cluster. - -```bash -cd /home/chester/homelab/ansible - -ansible-playbook \ - -i inventory/hosts.ini \ - playbooks/docker/deploy_authentik.yml \ - -e "stack_validate_only=true" \ - --vault-password-file .vault_pass -``` - -- [ ] Playbook completes with `0` failed tasks. -- [ ] Secrets assertion tasks pass (no `FAILED` on assert blocks). -- [ ] Swarm manager state assertion passes. -- [ ] Data path assertions pass for all 7 required directories. - -**Stop here if any assert fails.** Diagnose using the -[Troubleshooting matrix](#troubleshooting-matrix) below, then re-run validation -before proceeding. - ---- - -## Phase 3 β€” Deployment execution - -Run the standard deploy. All playbook output should be captured for the -evidence record. - -```bash -cd /home/chester/homelab/ansible - -ansible-playbook \ - -i inventory/hosts.ini \ - playbooks/docker/deploy_authentik.yml \ - --vault-password-file .vault_pass \ - 2>&1 | tee ../outputs/authentik_deploy_$(date +%Y%m%dT%H%M%S).log -``` - -- [ ] Playbook completes without `FAILED` tasks. -- [ ] Deployment result block is printed confirming stack name, manager, and URL. -- [ ] Log file is saved to `outputs/` with a timestamp. - -**Expected deployment result output:** - -``` -"Authentik deployment complete." -"Stack : authentik" -"Manager : swarm-manager-1 (10.0.0.211)" -"URL : https://sso.castaldifamily.com" -"Data root : /mnt/homelab/apps/authentik" -"Services : authentik-postgres, authentik-redis, authentik-server, authentik-worker" -``` - ---- - -## Phase 4 β€” Service convergence and health - -Verify that all four services are running, stable, and healthy. - -### 4.1 Service replica status - -```bash -ssh chester@10.0.0.211 \ - "docker service ls --filter label=com.docker.stack.namespace=authentik" -``` - -Expected replica counts: - -| Service | Expected | -| :--- | :---: | -| `authentik_authentik-postgres` | `1/1` | -| `authentik_authentik-redis` | `1/1` | -| `authentik_authentik-server` | `1/1` | -| `authentik_authentik-worker` | `1/1` | - -- [ ] All four services show `1/1` replicas. -- [ ] No service shows `0/1` or a failure count. - -### 4.2 Service placement - -All four services must be pinned to `swarm-manager-1`. - -```bash -ssh chester@10.0.0.211 \ - "docker service ps authentik_authentik-server --filter desired-state=running --format '{{.Node}} {{.CurrentState}}'" -# Expected: swarm-manager-1 Running ... -``` - -- [ ] `authentik-server` task is running on `swarm-manager-1`. -- [ ] `authentik-worker` task is running on `swarm-manager-1`. - -### 4.3 Container health checks - -```bash -# postgres health (pg_isready) -ssh chester@10.0.0.211 \ - "docker ps --filter name=authentik_authentik-postgres --format '{{.Status}}'" -# Expected: Up ... (healthy) - -# redis health (redis-cli ping) -ssh chester@10.0.0.211 \ - "docker ps --filter name=authentik_authentik-redis --format '{{.Status}}'" -# Expected: Up ... (healthy) -``` - -- [ ] `authentik-postgres` container shows `(healthy)`. -- [ ] `authentik-redis` container shows `(healthy)`. - -### 4.4 Critical startup log checks - -```bash -# Check server startup for migration and database connectivity -ssh chester@10.0.0.211 \ - "docker service logs authentik_authentik-server --since 10m --no-task-ids 2>&1 | tail -40" - -# Check worker for job queue connectivity -ssh chester@10.0.0.211 \ - "docker service logs authentik_authentik-worker --since 10m --no-task-ids 2>&1 | tail -40" -``` - -- [ ] No `FATAL` or `ERROR` messages relating to database connection in server logs. -- [ ] No `FATAL` or `ERROR` messages relating to Redis connection in server or worker logs. -- [ ] Database migration messages complete without errors. -- [ ] No repeated container restart events (no `started 2+ times`). - -### 4.5 Resource limits in effect - -| Service | Memory limit | CPU limit | -| :--- | :---: | :---: | -| `authentik-postgres` | 1 G | 0.75 | -| `authentik-redis` | 512 M | 0.50 | -| `authentik-server` | 2 G | 1.0 | -| `authentik-worker` | 1 G | 0.75 | - -```bash -ssh chester@10.0.0.211 \ - "docker service inspect authentik_authentik-server \ - --format '{{.Spec.TaskTemplate.Resources.Limits.MemoryBytes}}'" -# Expected: 2147483648 (2 GB) -``` - -- [ ] Resource limits are present and match the table above. - ---- - -## Phase 5 β€” Ingress and functional verification - -### 5.1 Traefik route registration - -Traefik routes are published via `traefik-kop`. Verify the route is active before -testing the public endpoint. - -```bash -# Check Traefik router for the authentik rule -curl -fsS http://10.0.0.151:8080/api/http/routers/authentik@docker \ - | python3 -m json.tool | grep -E '"rule"|"status"' -# Expected: "rule": "Host(...sso.castaldifamily.com...)", "status": "enabled" -``` - -- [ ] Traefik router `authentik@docker` exists and is `enabled`. -- [ ] Router rule matches `Host('sso.castaldifamily.com')`. -- [ ] Middlewares include `security-headers@file` and `ratelimit-basic@file`. - -### 5.2 HTTPS endpoint reachability - -```bash -# TLS handshake and HTTP 200/302 response -curl -fsS -o /dev/null -w "%{http_code} %{ssl_verify_result}" \ - https://sso.castaldifamily.com -# Expected: 200 0 (or 302 0 for a redirect to login) -``` - -- [ ] curl returns HTTP `200` or `302`. -- [ ] `ssl_verify_result` is `0` (certificate valid). -- [ ] Response is not a Traefik 404 or 502. - -### 5.3 Login page load - -Open `https://sso.castaldifamily.com` in a browser. - -- [ ] Authentik login page loads without JavaScript errors. -- [ ] Page title includes "authentik" or "Sign in". -- [ ] No TLS certificate warning from the browser. - -### 5.4 Admin UI readiness (if initial deploy) - -Navigate to `https://sso.castaldifamily.com/if/flow/initial-setup/` - -- [ ] Initial setup flow is reachable on first-run bootstrap. -- [ ] Skip this step if the instance already existed; do not re-run initial setup - on an existing install. - ---- - -## Phase 6 β€” Post-deploy handoff - -### 6.1 Monitoring integration - -Authentik is referenced as the SSO provider in `group_vars/all.yml`: - -```yaml -monitoring: - authentik_host: "https://sso.castaldifamily.com" -``` - -- [ ] Uptime Kuma has a monitor for `https://sso.castaldifamily.com`. -- [ ] Prometheus or health check system is alerting on `authentik_authentik-server` - replica count dropping below 1. - -### 6.2 Backup verification - -- [ ] `/mnt/homelab/apps/authentik/data/database` is included in backup scope. -- [ ] A manual backup snapshot was taken before or immediately after deploy. -- [ ] Restore procedure is documented and tested (or explicitly deferred). - -### 6.3 Secret rotation awareness - -| Secret | Rotation procedure | -| :--- | :--- | -| `vault_authentik_secret_key` | Update vault β†’ redeploy stack β†’ running sessions are invalidated | -| `vault_authentik_postgres_password` | Update vault AND postgres user password β†’ redeploy | - -- [ ] Rotation procedure is known to the deployment owner. - -### 6.4 Evidence capture - -```bash -# Save service state snapshot -ssh chester@10.0.0.211 \ - "docker service ls --filter label=com.docker.stack.namespace=authentik" \ - > ../outputs/authentik_service_snapshot_$(date +%Y%m%dT%H%M%S).txt -``` - -- [ ] Deploy log saved to `outputs/authentik_deploy_.log`. -- [ ] Service state snapshot saved to `outputs/authentik_service_snapshot_.txt`. -- [ ] Deployment timestamp and verification timestamp recorded in this checklist. - -### 6.5 Deployment sign-off - -| Field | Value | -| :--- | :--- | -| Deployment owner | | -| Deployment timestamp | | -| Verification timestamp | | -| Endpoint verified | `https://sso.castaldifamily.com` | -| Final status | ☐ GREEN β€” all phases passed | - ---- - -## Rollback procedure - -If deployment fails or causes instability, remove the stack and preserve data. - -```bash -cd /home/chester/homelab/ansible - -ansible-playbook \ - -i inventory/hosts.ini \ - playbooks/docker/deploy_authentik.yml \ - -e "authentik_deploy_state=absent" \ - --vault-password-file .vault_pass -``` - -> [!WARNING] -> `authentik_deploy_state=absent` removes the **Swarm stack** (containers, -> services, configs) but does **not** delete the bind-mount data directories. -> Data at `/mnt/homelab/apps/authentik` is preserved for re-deploy or restore. - -- [ ] Stack removed cleanly (`docker stack ls` shows no `authentik` entry). -- [ ] Data directories still intact on `swarm-manager-1`. -- [ ] Root cause identified before re-deploying. - ---- - -## Troubleshooting matrix - -### Validation assert fails: secrets not defined or placeholder - -**Symptom:** Playbook fails on `Assert vault_authentik_secret_key is defined` or -`Assert Authentik secrets are not placeholders`. - -**Check:** - -```bash -ansible -i inventory/hosts.ini localhost \ - -m ansible.builtin.debug \ - -a "var=vault_authentik_secret_key" \ - -e "@group_vars/all.yml" \ - --vault-password-file .vault_pass -``` - -**Fix:** Encrypt and store the correct value: - -```bash -ansible-vault encrypt_string 'YOUR-KEY' \ - --name 'vault_authentik_secret_key' \ - --vault-password-file .vault_pass -# Paste output into group_vars/vault/all.yml -``` - ---- - -### Validation assert fails: data paths missing - -**Symptom:** Playbook fails on `Assert required Authentik paths exist before deploy`. - -**Check:** - -```bash -ssh chester@10.0.0.211 "ls -la /mnt/homelab/apps/authentik/" -``` - -**Fix (fresh install only):** - -```bash -ssh chester@10.0.0.211 "sudo mkdir -p \ - /mnt/homelab/apps/authentik/data/{database,redis,media,config,blueprints}" -``` - -**Fix (existing install):** Restore from backup before creating directories. - ---- - -### Swarm assert fails: manager not active or not control plane - -**Symptom:** Playbook fails on `Assert target is an active Swarm manager`. - -**Check:** - -```bash -ssh chester@10.0.0.211 "docker info --format '{{.Swarm.LocalNodeState}}'" -``` - -**Fix:** Investigate Swarm manager health. Do not proceed until a healthy quorum -manager is the deploy target. - ---- - -### Services not converging to 1/1 - -**Symptom:** `docker service ls` shows `0/1` or a service cycles through restarts. - -**Check:** - -```bash -ssh chester@10.0.0.211 \ - "docker service ps authentik_authentik-server --no-trunc" -``` - -Look for failure reasons in the `Error` column. - -**Common causes:** - -| Cause | Evidence in logs | Fix | -| :--- | :--- | :--- | -| Secret key mismatch | `cryptography error` or `key invalid` in server logs | Re-check vault value, redeploy | -| Postgres not healthy yet | `connection refused` in server logs | Wait for postgres `(healthy)`, then check server | -| Redis not reachable | `redis connection error` in server or worker logs | Confirm `authentik-redis` is `1/1` healthy first | -| Missing bind-mount path | `no such file or directory` in container start | Create path, redeploy | -| Insufficient memory | OOM kill in `docker service ps` error column | Check node resources, adjust limits if needed | - ---- - -### Traefik route not registered or 502 response - -**Symptom:** `curl https://sso.castaldifamily.com` returns `502 Bad Gateway` or -connection refused. - -**Check:** - -```bash -# Confirm traefik-kop is running (Swarm stack) -ssh chester@10.0.0.211 \ - "docker service ls --filter name=traefik-kop" - -# Check server is listening on port 9000 -ssh chester@10.0.0.211 \ - "docker service ps authentik_authentik-server --filter desired-state=running" -``` - -**Common causes:** - -- `traefik-kop` is not running β†’ deploy monitoring stack first. -- `authentik-server` is not bound on port `9000` β†’ check replica and restart. -- `edge_routing.swarm.bind_ip` is incorrect in `group_vars/all.yml` β†’ verify - it resolves to an active Swarm node. -- Cloudflare DNS is not pointing to `10.0.0.151` β†’ verify DNS record for - `sso.castaldifamily.com`. - ---- - -### Database migration errors on first boot - -**Symptom:** Server logs show migration errors or `relation does not exist`. - -**Check:** - -```bash -ssh chester@10.0.0.211 \ - "docker service logs authentik_authentik-server --since 5m 2>&1 | grep -i 'migrat\|error\|fatal'" -``` - -**Fix:** Migrations run automatically on startup. If they fail: - -1. Check postgres is `(healthy)` and accepting connections. -2. Check `vault_authentik_postgres_password` in vault matches the running - postgres password. -3. Restart the server service to trigger a re-run: - -```bash -ssh chester@10.0.0.211 \ - "docker service update --force authentik_authentik-server" -``` - ---- - -## Reference - -| Resource | Location | -| :--- | :--- | -| Deploy playbook | `ansible/playbooks/docker/deploy_authentik.yml` | -| Stack template | `ansible/templates/stacks/authentik.stack.yml` | -| Shared variables | `ansible/group_vars/all.yml` | -| Vault secrets | `ansible/group_vars/vault/all.yml` | -| Authentik docs | | -| Authentik changelog | | -| Swarm cluster baseline | `outputs/swarm_audit_20260314T122134.md` | diff --git a/ansible/archive/documentation/playbooks/manage_docker_environment.md b/ansible/archive/documentation/playbooks/manage_docker_environment.md deleted file mode 100644 index f1e1a64..0000000 --- a/ansible/archive/documentation/playbooks/manage_docker_environment.md +++ /dev/null @@ -1,245 +0,0 @@ -# Docker Environment Management Playbook - -## Overview - -The `manage_docker_environment.yml` playbook provides comprehensive Docker management capabilities for your homelab, including installation, configuration, container management, health monitoring, and maintenance tasks. - -## Target Hosts - -- **Primary:** `docker_hosts` group (includes docker-01 at 10.0.0.251) -- Can be run against any host in the `ubuntu_lab` group - -## Features - -### 1. Docker Installation -- Installs Docker CE with all required components -- Includes Docker Compose plugin -- Installs Docker BuildKit -- Configures Docker service for auto-start - -### 2. Configuration Management -- Configures Docker daemon with logging limits -- Adds specified users to the docker group -- Sets up storage driver (overlay2) -- Creates custom Docker networks - -### 3. Container Management -- Lists all running containers -- Creates standard networks (backend, frontend) -- Provides container inventory - -### 4. Health Monitoring -- Checks Docker disk usage -- Identifies unhealthy containers -- Reports system status - -### 5. Maintenance & Cleanup -- Removes stopped containers -- Prunes unused images -- Cleans up unused volumes -- Removes orphaned networks - -### 6. Configuration Backup -- Backs up docker-compose files -- Creates timestamped copies in `/opt/docker-backups` - -## Usage - -### Basic Execution - -```bash -# Run all tasks -ansible-playbook -i inventory/hosts.ini playbooks/manage_docker_environment.yml - -# Check mode (dry run) -ansible-playbook -i inventory/hosts.ini playbooks/manage_docker_environment.yml --check - -# Run with specific tags -ansible-playbook -i inventory/hosts.ini playbooks/manage_docker_environment.yml --tags "health,monitoring" -``` - -### Available Tags - -| Tag | Description | -| :--- | :--- | -| `install` | Docker installation tasks | -| `setup` | Installation + configuration | -| `config` | Configuration management only | -| `containers` | Container management tasks | -| `management` | Container inventory and network setup | -| `health` | Health checks and monitoring | -| `monitoring` | Same as health | -| `maintenance` | Cleanup and pruning tasks | -| `cleanup` | Same as maintenance | -| `backup` | Configuration backup tasks | - -### Tag Combinations - -```bash -# Install and configure Docker (first run) -ansible-playbook -i inventory/hosts.ini playbooks/manage_docker_environment.yml --tags "install,config" - -# Daily health check -ansible-playbook -i inventory/hosts.ini playbooks/manage_docker_environment.yml --tags "health" - -# Weekly maintenance -ansible-playbook -i inventory/hosts.ini playbooks/manage_docker_environment.yml --tags "maintenance" \ - -e "docker_cleanup_enabled=true" - -# Full system audit -ansible-playbook -i inventory/hosts.ini playbooks/manage_docker_environment.yml --tags "containers,health" -``` - -## Configuration Variables - -### Docker Users - -```yaml -docker_users: - - chester - - additional_user -``` - -### Daemon Configuration - -```yaml -docker_daemon_options: - log-driver: "json-file" - log-opts: - max-size: "10m" - max-file: "3" - storage-driver: "overlay2" - insecure-registries: - - "registry.local:5000" -``` - -### Cleanup Settings - -```yaml -# Enable cleanup tasks (default: false for safety) -docker_cleanup_enabled: true - -# Remove images older than X days -docker_cleanup_older_than_days: 30 -``` - -## Examples - -### First-Time Setup - -```bash -# Install Docker on new host -ansible-playbook -i inventory/hosts.ini playbooks/manage_docker_environment.yml \ - --limit docker-01 \ - --tags "install,config" -``` - -### Regular Maintenance Workflow - -```bash -# 1. Check health status -ansible-playbook -i inventory/hosts.ini playbooks/manage_docker_environment.yml \ - --tags "health" - -# 2. Review disk usage, then run cleanup if needed -ansible-playbook -i inventory/hosts.ini playbooks/manage_docker_environment.yml \ - --tags "maintenance" \ - -e "docker_cleanup_enabled=true" - -# 3. Backup configurations -ansible-playbook -i inventory/hosts.ini playbooks/manage_docker_environment.yml \ - --tags "backup" -``` - -### Add Custom Networks - -```yaml -# In the playbook or as extra vars: -docker_networks: - - name: web_tier - driver: bridge - - name: database_tier - driver: bridge - internal: true -``` - -## Safety Features - -- **Cleanup Disabled by Default:** Cleanup tasks require explicit enabling via `docker_cleanup_enabled=true` -- **Check Mode Compatible:** All tasks support `--check` for dry-run testing -- **Idempotent:** Can be run multiple times safely -- **Non-Destructive Monitoring:** Health checks don't modify system state - -## Prerequisites - -- Ubuntu/Debian-based system -- SSH access with sudo privileges -- Python 3 with pip available -- Internet connection for package downloads - -## Post-Execution - -After running the playbook: - -1. **Verify Docker installation:** - ```bash - ssh chester@10.0.0.251 "docker --version && docker compose version" - ``` - -2. **Test Docker without sudo:** - ```bash - ssh chester@10.0.0.251 "docker ps" - ``` - - > [!NOTE] - > Users may need to log out and back in for group membership changes to take effect. - -3. **Check Docker status:** - ```bash - ssh chester@10.0.0.251 "sudo systemctl status docker" - ``` - -## Troubleshooting - -### Docker service won't start - -```bash -# Check Docker daemon logs -ssh chester@10.0.0.251 "sudo journalctl -u docker -n 50" - -# Validate daemon.json syntax -ssh chester@10.0.0.251 "sudo cat /etc/docker/daemon.json | jq ." -``` - -### Permission denied errors - -```bash -# Verify group membership -ssh chester@10.0.0.251 "groups" - -# Force group update (requires re-login) -ssh chester@10.0.0.251 "newgrp docker" -``` - -### High disk usage - -```bash -# Run cleanup manually -ansible-playbook -i inventory/hosts.ini playbooks/manage_docker_environment.yml \ - --tags "maintenance" \ - -e "docker_cleanup_enabled=true" -``` - -## Integration with Other Playbooks - -This playbook works alongside: - -- [init_swarm_cluster.yml](../../playbooks/init_swarm_cluster.yml) - Run Docker setup first -- [bootstrap_ai_workstation.yml](../../playbooks/bootstrap_ai_workstation.yml) - Can install Docker as dependency - -## Next Steps - -1. **Deploy Applications:** Create docker-compose files in `/opt/docker/` -2. **Set Up Monitoring:** Integrate with Prometheus/Grafana -3. **Automate Backups:** Schedule regular configuration backups -4. **Container Orchestration:** Consider Swarm or K3s for multi-host deployments diff --git a/ansible/archive/documentation/playbooks/mount_nfs_shares.md b/ansible/archive/documentation/playbooks/mount_nfs_shares.md deleted file mode 100644 index f35458d..0000000 --- a/ansible/archive/documentation/playbooks/mount_nfs_shares.md +++ /dev/null @@ -1,347 +0,0 @@ -# Mount NFS Shares - -**Playbook:** `playbooks/storage/mount_nfs_shares.yml` -**Purpose:** Configure NFS client mounts on Docker Swarm nodes for persistent storage -**Target:** All Swarm nodes (managers + workers) - ---- - -## Overview - -This playbook configures NFS mounts from the TerraMaster NAS to Docker Swarm nodes, providing shared storage for application data and media files. It ensures all nodes have consistent access to centralized storage while maintaining the storage contract principle that NAS is not a dependency for Swarm control-plane operations. - ---- - -## Prerequisites - -### On TerraMaster NAS (10.0.0.250) - -* NFS service enabled -* Two NFS exports configured: - * `/Volume1/appdata` β€” Application data, configs, persistent volumes - * `/Volume2/media` β€” Media files (Plex, etc.) -* NFS permissions allow access from Swarm subnet (10.0.0.0/24) - -### On Swarm Nodes - -* Ubuntu 24.04 LTS (Noble) -* SSH access as `chester` user with sudo privileges -* Network connectivity to TerraMaster on port 2049 (NFS) - ---- - -## What It Does - -1. **Installs NFS client** β€” `nfs-common` package -2. **Creates mount points** β€” `/mnt/homelab` and `/mnt/media` -3. **Configures fstab** β€” Persistent mounts survive reboots -4. **Mounts shares immediately** β€” Makes storage available without reboot -5. **Verifies accessibility** β€” Tests that mounts are readable - ---- - -## Usage - -### Run on all Swarm nodes - -```bash -cd /home/chester/homelab/ansible -ansible-playbook playbooks/storage/mount_nfs_shares.yml -``` - -### Run with specific tags - -```bash -# Only install packages and create directories -ansible-playbook playbooks/storage/mount_nfs_shares.yml --tags setup - -# Only update fstab (no mount action) -ansible-playbook playbooks/storage/mount_nfs_shares.yml --tags config - -# Mount without fstab changes (testing) -ansible-playbook playbooks/storage/mount_nfs_shares.yml --tags mount - -# Verify existing mounts -ansible-playbook playbooks/storage/mount_nfs_shares.yml --tags verify -``` - -### Limit to specific nodes - -```bash -# Only managers -ansible-playbook playbooks/storage/mount_nfs_shares.yml --limit swarm_managers - -# Only workers -ansible-playbook playbooks/storage/mount_nfs_shares.yml --limit swarm_workers - -# Single node -ansible-playbook playbooks/storage/mount_nfs_shares.yml --limit swarm-worker-1 -``` - ---- - -## Configuration - -### Variables - -Defined in the playbook (`vars` section): - -| Variable | Value | Description | -|----------|-------|-------------| -| `nfs_server` | `10.0.0.250` | TerraMaster NAS IP address | -| `nfs_mounts[0].src` | `/Volume1/appdata` | NFS export path for application data | -| `nfs_mounts[0].dest` | `/mnt/homelab` | Local mount point for app data | -| `nfs_mounts[1].src` | `/Volume2/media` | NFS export path for media | -| `nfs_mounts[1].dest` | `/mnt/media` | Local mount point for media | -| `nfs_mounts[*].opts` | `defaults` | Mount options | - -### Customizing Mount Options - -To change mount options (e.g., add `noatime` for performance): - -```yaml -nfs_mounts: - - src: "/Volume1/appdata" - dest: "/mnt/homelab" - opts: "defaults,noatime,rw" -``` - -Common NFS options: -- `noatime` β€” Don't update access times (performance) -- `hard` β€” Retry indefinitely if NFS server unavailable (default) -- `soft` β€” Fail after timeout (risky for data integrity) -- `rsize=8192,wsize=8192` β€” Adjust read/write buffer sizes -- `nfsvers=4` β€” Force NFSv4 (recommended) - ---- - -## Using NFS Mounts in Docker - -### Method 1: Bind Mounts (Current Approach) - -**Docker Compose:** -```yaml -services: - app: - image: myapp:latest - volumes: - - /mnt/homelab/appdata/myapp:/data - - /mnt/media:/media:ro # Read-only for safety -``` - -**Pros:** -- Simple and transparent -- Easy to debug with standard Linux tools -- One mount serves all containers - -**Cons:** -- Services coupled to host filesystem paths -- Must ensure mount exists before container starts - ---- - -### Method 2: Docker NFS Volumes (Alternative) - -**Docker Compose:** -```yaml -volumes: - homelab_data: - driver: local - driver_opts: - type: nfs - o: addr=10.0.0.250,rw,nfsvers=4 - device: ":/Volume1/appdata" - - media: - driver: local - driver_opts: - type: nfs - o: addr=10.0.0.250,ro,nfsvers=4 - device: ":/Volume2/media" - -services: - app: - image: myapp:latest - volumes: - - homelab_data:/data - - media:/media:ro -``` - -**Pros:** -- Portable volume names (no hardcoded paths) -- Docker manages mount lifecycle -- Per-service isolation possible -- Automatic retry on NFS failure - -**Cons:** -- More complex configuration -- Harder to inspect with standard tools -- Must define volumes in every compose file - ---- - -### Recommendation - -**Use bind mounts (Method 1)** for now: -- You already have working fstab configuration -- Simpler to manage across 6 nodes -- Better visibility for troubleshooting -- Can switch to Docker volumes later if needed - ---- - -## Verification - -### Check mount status - -```bash -# On any Swarm node -df -h | grep mnt - -# Expected output: -# 10.0.0.250:/Volume1/appdata 500G 100G 400G 20% /mnt/homelab -# 10.0.0.250:/Volume2/media 2.0T 500G 1.5T 25% /mnt/media -``` - -### Test write access - -```bash -# On a Swarm node -sudo touch /mnt/homelab/test-write -ls -l /mnt/homelab/test-write -sudo rm /mnt/homelab/test-write -``` - -### Check fstab persistence - -```bash -cat /etc/fstab | grep mnt -# Should show both NFS entries -``` - ---- - -## Troubleshooting - -### Mount fails with "Connection refused" - -**Cause:** NFS service not running or firewall blocking port 2049 - -**Solution:** -```bash -# Test NFS connectivity -showmount -e 10.0.0.250 - -# If fails, check TerraMaster NFS settings -``` - ---- - -### Mount fails with "Permission denied" - -**Cause:** NFS export permissions don't allow Swarm node IPs - -**Solution:** Update TerraMaster NFS export to allow `10.0.0.0/24` subnet - ---- - -### Mount succeeds but directory is empty - -**Cause:** Mounted wrong export path or path doesn't exist on NAS - -**Solution:** -```bash -# List available exports -showmount -e 10.0.0.250 -``` - ---- - -### Mount exists but containers can't write - -**Cause:** NFS mounted read-only or wrong permissions - -**Solution:** -```bash -# Check mount options -mount | grep "/mnt/homelab" - -# Remount with write permissions if needed -sudo mount -o remount,rw /mnt/homelab -``` - ---- - -### Stale NFS file handle errors - -**Cause:** NFS server restarted or export changed - -**Solution:** -```bash -# Unmount and remount -sudo umount -f /mnt/homelab -sudo mount -a -``` - ---- - -## Safety Considerations - -### Storage Contract Compliance - -βœ… **Compliant:** -- Mounting NFS on all nodes for data access -- Using NAS for application data (not control-plane state) -- Swarm can operate if NFS is temporarily unavailable - -❌ **Violations to avoid:** -- Don't store Swarm raft data on NFS -- Don't run manager services that require NFS to stay healthy -- Don't use NFS for `/var/lib/docker` or other system paths - ---- - -### Backup Verification - -Per storage contract: -- Data on `/mnt/homelab` backed up via TerraMaster β†’ Synology rsync -- Verify backup jobs are running: Check Synology logs -- Test restores periodically - ---- - -## Maintenance - -### Adding new NFS shares - -1. Configure export on TerraMaster -2. Add entry to `nfs_mounts` list in playbook -3. Run playbook with `--tags setup,config,mount` - -### Removing NFS shares - -1. Unmount: `sudo umount /mnt/someshare` -2. Remove from `/etc/fstab` -3. Remove directory: `sudo rmdir /mnt/someshare` - ---- - -## Related Documentation - -- [Storage Contract](../contracts/storage.md) β€” NAS roles and backup policy -- [Environment Constraints](../standards/environment-constraints.md) β€” Network and hardware specs -- [Architecture Decisions](../../documentation/standards/architecture-decisions.md) β€” ADR-003 (Watchtower role) - ---- - -## Tags Reference - -| Tag | Purpose | -|-----|---------| -| `setup` | Install packages, create directories | -| `packages` | Install NFS client only | -| `filesystem` | Create mount point directories only | -| `config` | Update fstab only | -| `fstab` | Alias for `config` | -| `mount` | Execute mount operations | -| `verify` | Test mounts and display status | diff --git a/ansible/archive/documentation/playbooks/onboard-ansible-secrets.md b/ansible/archive/documentation/playbooks/onboard-ansible-secrets.md deleted file mode 100644 index fd86ba9..0000000 --- a/ansible/archive/documentation/playbooks/onboard-ansible-secrets.md +++ /dev/null @@ -1,153 +0,0 @@ -# Ansible secrets onboarding playbook - -## Overview - -This guide onboards secret management for passwords, API keys, and tokens using -Ansible Vault. It defines a repeatable workflow for creating encrypted variable -files, loading them safely in playbooks, and consuming secrets with idempotent -Ansible modules. - -## What this establishes - -### 1. Standard secret file layout - -- `group_vars//vault.yml` for group-level secrets -- `host_vars//vault.yml` for host-level secrets -- Secret variable names with `_pass` or `_secret` suffixes - -### 2. Encrypted-at-rest secret storage - -- Secrets are created and edited with `ansible-vault` -- Plaintext secrets are not committed to Git -- Existing ignore rules in [ansible/.gitignore](../../.gitignore) protect vault - files from accidental commits - -### 3. Safe secret consumption patterns - -- Use `ansible.builtin.template`, `ansible.builtin.copy`, and - `ansible.builtin.lineinfile` instead of ad-hoc shell commands -- Mark sensitive tasks with `no_log: true` -- Set explicit file ownership and mode for rendered secret files - -## Prerequisites - -- Ansible installed on the control node -- Access to [ansible.cfg](../../ansible.cfg) and your inventory -- A vault password strategy: - - Interactive prompt (`--ask-vault-pass`) for manual runs - - Password file (`--vault-password-file`) for controlled automation - -> [!IMPORTANT] -> Do not store vault passwords in repository files or plaintext notes. - -## Step-by-step onboarding - -### Step 1: Create vault files - -```bash -# Group-level secrets -ansible-vault create group_vars/docker/vault.yml - -# Host-level secrets -ansible-vault create host_vars/docker-01/vault.yml -``` - -### Step 2: Add secrets with naming conventions - -```yaml -# group_vars/docker/vault.yml -grafana_admin_pass: "replace-me" -watchtower_api_key_secret: "replace-me" -``` - -### Step 3: Reference secrets in playbooks or roles - -```yaml -# playbooks/example.yml -- name: Configure app secrets - hosts: docker_hosts - become: true - tasks: - - name: Render application environment file - ansible.builtin.template: - src: templates/app.env.j2 - dest: /opt/app/.env - owner: root - group: root - mode: "0600" - no_log: true -``` - -```jinja2 -# templates/app.env.j2 -GRAFANA_ADMIN_PASSWORD={{ grafana_admin_pass }} -WATCHTOWER_API_KEY={{ watchtower_api_key_secret }} -``` - -### Step 4: Run with vault decryption - -```bash -# Interactive -ansible-playbook -i inventory/hosts.ini playbooks/example.yml --ask-vault-pass - -# Automated (secured local file) -ansible-playbook -i inventory/hosts.ini playbooks/example.yml \ - --vault-password-file ~/.ansible/.vault-pass -``` - -### Step 5: Verify idempotency and secrecy - -```bash -# Syntax check -ansible-playbook -i inventory/hosts.ini playbooks/example.yml --syntax-check - -# Idempotency check (run twice; second run should be unchanged) -ansible-playbook -i inventory/hosts.ini playbooks/example.yml --ask-vault-pass -ansible-playbook -i inventory/hosts.ini playbooks/example.yml --ask-vault-pass -``` - -## Why module-first instead of shell - -- `ansible.builtin.template` and `ansible.builtin.copy` are idempotent and track - file diffs -- Explicit `owner`, `group`, and `mode` improve auditability -- `shell` can leak secrets into command history and logs if not handled - carefully -- Module output is safer to control with `no_log: true` - -## Security guardrails - -- Keep `no_log: true` on any task that reads, writes, or debugs secret values -- Never print secret variables with `ansible.builtin.debug` -- Scope secrets to the narrowest level possible (host before group when needed) -- Rotate credentials by updating vault values and re-running playbooks -- Prefer separate vault files per scope to limit blast radius - -## Troubleshooting - -### Decryption failed - -```bash -ansible-vault view group_vars/docker/vault.yml -``` - -Use the same vault password source used during file creation. - -### Variable is undefined - -- Confirm secret file path matches inventory group/host names -- Confirm variable names match exactly in templates and tasks -- Run with `-vv` and inspect which variable files loaded - -### Secret file committed by mistake - -1. Rotate affected credentials immediately -2. Remove file from tracking -3. Rewrite Git history if secrets were pushed to remote - -## Integration notes - -- Follow the quality checklist in - [Ansible quality gates](../standards/ansible-quality-gates.md) -- Keep naming aligned with - [Naming conventions](../standards/naming-conventions.md) diff --git a/ansible/archive/documentation/playbooks/onboard_new_host.md b/ansible/archive/documentation/playbooks/onboard_new_host.md deleted file mode 100644 index 07c60d9..0000000 --- a/ansible/archive/documentation/playbooks/onboard_new_host.md +++ /dev/null @@ -1,363 +0,0 @@ -# Non-Proxmox Host Onboarding Playbook - -## Overview - -The `playbooks/onboarding/generic_host.yml` playbook automates bootstrap for non-Proxmox hosts and supports two profiles: - -- `new`: full onboarding with security hardening. -- `existing`: safe onboarding for pre-existing production hosts (key setup, Python, sudo, packages; skips SSH hardening). - -Use `existing` for live systems like `10.0.0.151` (Traefik) and `10.0.0.251`. - -## What It Does - -### 1. Connectivity Test -- Verifies SSH connection to target host -- Uses raw commands (no Python required initially) -- Provides clear error messages if connection fails - -### 2. SSH Key Authentication -- Creates `.ssh` directory with correct permissions -- Copies your public SSH key to `authorized_keys` -- Validates passwordless SSH authentication - -### 3. Python & Prerequisites -- Installs Python3 if not present -- Installs `python3-apt` for Ansible module support -- Gathers system facts - -### 4. Passwordless Sudo -- Creates sudoers configuration for your user -- Validates sudo configuration syntax -- Tests passwordless sudo access - -### 5. Essential Packages -- Updates apt cache -- Installs essential tools (git, vim, curl, htop, etc.) - -### 6. Basic Security -- Disables root SSH login -- Disables password authentication (SSH keys only) -- Configures UFW firewall (allows SSH) - -### 7. Final Validation -- Tests complete passwordless authentication -- Displays comprehensive onboarding summary - -## Usage - -### Method 1: Existing production hosts (safe profile) - -```bash -ansible-playbook -i inventory/hosts.ini playbooks/onboarding/generic_host.yml \ - -e "target_host=docker_hosts" \ - -e "onboard_user=chester" \ - -e "onboarding_profile=existing" \ - -k -K -``` - -This is the recommended process for hosts that already run production workloads. - -### Method 2: Net-new host onboarding (full hardening) - -```bash -# Onboard a single host -ansible-playbook -i inventory/hosts.ini playbooks/onboarding/generic_host.yml \ - -e "target_host=docker-01" \ - -e "onboard_user=chester" \ - -e "onboarding_profile=new" \ - -k -K - -# -k: Prompt for SSH password -# -K: Prompt for sudo password -``` - -### Method 3: Using Environment Variables - -```bash -# Set credentials via environment -export ANSIBLE_SSH_PASS='your_password' -export ANSIBLE_BECOME_PASS='your_password' - -# Run playbook -ansible-playbook -i inventory/hosts.ini playbooks/onboarding/generic_host.yml \ - -e "target_host=docker-01" -``` - -### Method 4: Onboard Multiple Hosts - -```bash -# Add new hosts to inventory first -# Then onboard them all at once -ansible-playbook -i inventory/hosts.ini playbooks/onboarding/generic_host.yml \ - -e "target_host=new_servers" \ - -e "onboarding_profile=existing" \ - -k -K - -# Where 'new_servers' is a group in your inventory -``` - -## Required Variables - -| Variable | Description | Default | -| :--- | :--- | :--- | -| `target_host` | Host or group to onboard | `all` | -| `onboard_user` | Username for SSH/sudo | `chester` | -| `onboarding_profile` | `new` (harden) or `existing` (safe) | `new` | -| `onboard_password` | SSH and sudo password | From env or prompt | - -## Prerequisites - -### On Your Control Machine (jumpbox) -- SSH key pair exists (`~/.ssh/id_ed25519`) -- Ansible installed -- Network connectivity to target host - -### On Target Host -- SSH server running -- User account with sudo privileges -- Network connectivity from control machine - -## Step-by-Step First-Time Onboarding - -### Step 1: Add Host to Inventory - -```ini -# inventory/hosts.ini -[new_hosts] -new-server ansible_host=10.0.0.252 -``` - -### Step 2: Test Connectivity - -```bash -# Verify SSH access manually first -ssh chester@10.0.0.252 -``` - -### Step 3: Run Onboarding Playbook - -```bash -ansible-playbook -i inventory/hosts.ini playbooks/onboarding/generic_host.yml \ - -e "target_host=new-server" \ - -e "onboard_user=chester" \ - -e "onboarding_profile=new" \ - -k -K -``` - -### Step 4: Verify Passwordless Access - -```bash -# Test Ansible ping without password -ansible -i inventory/hosts.ini new-server -m ping - -# Test SSH without password -ssh chester@10.0.0.252 'sudo whoami' -``` - -## Tag-Based Execution - -Run specific sections only: - -```bash -# Test connectivity only -ansible-playbook -i inventory/hosts.ini playbooks/onboarding/generic_host.yml \ - -e "target_host=docker-01" \ - --tags "connectivity" \ - -k - -# Setup SSH keys only -ansible-playbook -i inventory/hosts.ini playbooks/onboarding/generic_host.yml \ - -e "target_host=docker-01" \ - --tags "ssh" \ - -k -K - -# Skip security hardening -ansible-playbook -i inventory/hosts.ini playbooks/onboarding/generic_host.yml \ - -e "target_host=docker-01" \ - --skip-tags "security" \ - -k -K -``` - -### Available Tags - -| Tag | Section | -| :--- | :--- | -| `connectivity` | Connection test | -| `test` | Connection test | -| `ssh` | SSH key setup | -| `setup` | All setup tasks | -| `python` | Python installation | -| `prerequisites` | Package prerequisites | -| `sudo` | Passwordless sudo | -| `packages` | Essential packages | -| `security` | Security hardening | -| `hardening` | Security hardening | -| `validate` | Final validation | -| `summary` | Onboarding summary | - -## Expected Output - -``` -PLAY [Onboard New Host to Ansible Management] ************************************ - -TASK [Test raw connection (no Python required)] ********************************** -ok: [docker-01] - -TASK [Display connection status] ************************************************* -ok: [docker-01] => { - "msg": "βœ… Successfully connected to docker-01" -} - -... - -TASK [Display onboarding summary] ************************************************ -ok: [docker-01] => { - "msg": [ - "════════════════════════════════════════════════", - "βœ… HOST ONBOARDING COMPLETE", - "════════════════════════════════════════════════", - "Host: docker-01 (waldorf)", - "IP: 10.0.0.251", - "OS: Ubuntu 24.04", - "Python: 3.12.3", - "SSH Key Auth: βœ… Enabled", - "Passwordless Sudo: βœ… Enabled", - "Ansible User: chester", - "════════════════════════════════════════════════" - ] -} -``` - -## Troubleshooting - -### SSH Connection Failed - -```bash -# Test manual SSH first -ssh chester@10.0.0.252 - -# Check SSH service on target -ssh chester@10.0.0.252 'sudo systemctl status sshd' - -# Verify firewall allows SSH -ssh chester@10.0.0.252 'sudo ufw status' -``` - -### Python Installation Failed - -```bash -# Manually install Python -ssh chester@10.0.0.252 'sudo apt-get update && sudo apt-get install -y python3' -``` - -### Sudo Password Prompt Still Appears - -```bash -# Check sudoers configuration -ssh chester@10.0.0.252 'sudo cat /etc/sudoers.d/chester' - -# Verify syntax -ssh chester@10.0.0.252 'sudo visudo -c' -``` - -### SSH Key Not Working After Setup - -```bash -# Check authorized_keys permissions -ssh chester@10.0.0.252 'ls -la ~/.ssh/authorized_keys' - -# Should be: -rw------- (600) - -# Check SSH config on target -ssh chester@10.0.0.252 'sudo grep -E "PubkeyAuthentication|PasswordAuthentication" /etc/ssh/sshd_config' -``` - -## Security Considerations - -### SSH Hardening Applied - -- βœ… Root login disabled -- βœ… Password authentication disabled (after key setup) -- βœ… SSH keys required for all access - -### Post-Onboarding Recommendations - -1. **Review SSH Configuration** - ```bash - ssh chester@host 'sudo sshd -T | grep -E "permit|password|pubkey"' - ``` - -2. **Configure Firewall Rules** - ```bash - # Allow only required services - ssh chester@host 'sudo ufw allow 22/tcp && sudo ufw enable' - ``` - -3. **Enable Automatic Security Updates** - ```bash - ssh chester@host 'sudo apt-get install unattended-upgrades' - ``` - -4. **Set Up Fail2Ban** - ```bash - ssh chester@host 'sudo apt-get install fail2ban' - ``` - -## Integration with Other Playbooks - -After onboarding, you can run any playbook without passwords: - -```bash -# Install Docker -ansible-playbook -i inventory/hosts.ini playbooks/manage_docker_environment.yml \ - --limit new-server \ - --tags "install" - -# Configure networking -ansible-playbook -i inventory/hosts.ini playbooks/baseline_network_config.yml \ - --limit new-server -``` - -## Bulk Onboarding Workflow - -For onboarding multiple hosts at once: - -### 1. Create Temporary Inventory - -```ini -# inventory/new-hosts.ini -[pending_onboard] -server-01 ansible_host=10.0.0.101 -server-02 ansible_host=10.0.0.102 -server-03 ansible_host=10.0.0.103 - -[pending_onboard:vars] -ansible_user=chester -``` - -### 2. Run Onboarding - -```bash -ansible-playbook -i inventory/new-hosts.ini playbooks/onboarding/generic_host.yml \ - -e "target_host=pending_onboard" \ - -e "onboarding_profile=existing" \ - -k -K -``` - -### 3. Merge into Main Inventory - -After successful onboarding, add hosts to your main [inventory/hosts.ini](../../inventory/hosts.ini) file. - -## Next Steps - -After successful onboarding: - -1. **Assign to appropriate groups** in [inventory/hosts.ini](../../inventory/hosts.ini) -2. **Configure group_vars** for role-specific settings -3. **Run role-specific playbooks** (Docker, networking, etc.) -4. **Deploy monitoring exporter for standalone hosts** - ```bash - ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml --tags docker-hosts - ``` -5. **Document host purpose** in your infrastructure documentation diff --git a/ansible/archive/documentation/playbooks/watchtower-monitoring-onboarding.md b/ansible/archive/documentation/playbooks/watchtower-monitoring-onboarding.md deleted file mode 100644 index bc418e7..0000000 --- a/ansible/archive/documentation/playbooks/watchtower-monitoring-onboarding.md +++ /dev/null @@ -1,236 +0,0 @@ -# Watchtower monitoring onboarding and self-healing runbook - -## Purpose - -This runbook is the operator path for deploying, validating, and maintaining the full -Watchtower monitoring stack. - -It covers: - -- Monitoring stack onboarding (all services). -- Integration points between services and external Traefik. -- Day-1 troubleshooting, including Authentik outpost restart loops. -- Self-healing execution with safe, repeatable reconciliation. - -## Scope - -The canonical Watchtower monitoring scope is: - -- traefik-kop -- Prometheus -- Grafana -- Uptime Kuma -- node-exporter -- watchtower-cadvisor -- Dozzle -- Authentik outpost for Dozzle -- Loki -- Promtail -- blackbox-exporter - -## Architecture summary - -- External Traefik ingress runs on `10.0.0.151` and is not migrated into Swarm. -- Swarm exporters run on Swarm nodes. -- Watchtower hosts aggregation, storage, visualization, and logging services. -- Traefik labels are used for HTTPS-routed UIs (Grafana, Dozzle, Uptime Kuma). - -## Prerequisites - -1. Inventory groups are defined and reachable: `swarm_managers`, `swarm_workers`, - `swarm_hosts`, and `watchtower`. -2. Docker is installed on all target nodes. -3. Overlay network `proxy-net` exists for Swarm workloads. -4. Vault file exists at `ansible/group_vars/vault/all.yml` or equivalent secrets are - provided through secure environment variables. -5. Required secrets are present: - - `vault_grafana_admin_password` - - `vault_authentik_outpost_dozzle_token` - -If Authentik token is not available yet, set `monitoring_enable_authentik_outpost=false` -for bootstrap deployment and keep Dozzle private until token onboarding is complete. - -> [!WARNING] -> Never hardcode tokens or passwords in compose files, playbooks, or helper scripts. -> Use Vault variables and rotate credentials if any plaintext secret was committed. - -## Deployment order - -1. Exporters on Swarm nodes (`node-exporter`, `cAdvisor`). -2. Dozzle agent on Swarm managers. -3. Watchtower stack (`traefik-kop`, Prometheus, Grafana, Uptime Kuma, Dozzle, - Authentik outpost, Loki, Promtail). -4. Post-deploy verification and dashboard bootstrap. - -## Deploy commands - -Run from `ansible/`: - -```bash -ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml -``` - -Target only Swarm exporters: - -```bash -ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml --tags swarm -``` - -Target only Watchtower stack: - -```bash -ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml --tags watchtower -``` - -## Service-by-service onboarding checks - -### traefik-kop - -- Verify service starts and can reach Redis endpoint `10.0.0.151:6379`. -- Verify route updates are visible from external Traefik behavior. - -### Prometheus - -- Verify readiness endpoint: - -```bash -curl -fsS http://10.0.0.200:9091/-/ready -``` - -- Verify targets include expected managers, workers, and Watchtower node-exporter. - -### Grafana - -- Verify HTTPS route at configured domain. -- Confirm login with admin user and vault-provided password. -- Add data sources: - - Prometheus: `http://prometheus:9090` - - Loki: `http://loki:3100` - -### Uptime Kuma - -- Verify HTTPS route and UI load. -- Add core checks for: - - External Traefik endpoint - - Watchtower host health - - Swarm manager API reachability - -### node-exporter and cAdvisor - -- Verify metrics endpoints are reachable from each node. -- Confirm Prometheus scrape status is `up` for all exporters. -- Verify local Watchtower cAdvisor endpoint: - -```bash -curl -fsS http://10.0.0.200:18080/metrics | head -``` - -### Dozzle and Authentik outpost - -- Verify Dozzle HTTPS route. -- Verify Authentik outpost endpoint routing under `/outpost.goauthentik.io/`. -- Verify forward-auth middleware is attached and blocking unauthenticated access. - -### Loki and Promtail - -- Verify Loki API health via container logs and ingestion behavior. -- Verify Promtail discovers Docker logs and labels streams by project/service. - -### blackbox-exporter (network and endpoint probes) - -- Verify Blackbox exporter is reachable: - -```bash -curl -fsS http://10.0.0.200:9115/metrics | head -``` - -- Verify Prometheus shows probe targets in `blackbox-probes` job. -- Add probe targets through `monitoring_probe_targets` in group vars. - -## Day-1 troubleshooting - -### Authentik outpost restart loop - -1. Verify token presence in rendered `.env` for stack directory. -1. Confirm token matches active Authentik outpost token in Authentik admin. -1. Confirm Traefik middleware label references the same outpost service. -1. Check container logs: - -```bash -docker logs authentik-outpost-dozzle --tail 200 -``` - -1. Reconcile stack after token correction: - -```bash -ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml --tags watchtower -``` - -### Backlog item: Authentik token pending - -1. Keep `monitoring_enable_authentik_outpost=false` while token is unavailable. -1. Do not expose Dozzle publicly without Authentik forward-auth. -1. Re-enable outpost after token handoff and re-run watchtower tag. - -### Prometheus missing targets - -1. Confirm inventory contains correct node IPs and groups. -2. Re-run deployment to re-render scrape config. -3. Query target API and inspect dropped targets. - -### Blackbox probes failing - -1. Confirm target is reachable from Watchtower network path. -1. Confirm probe module matches target protocol (`icmp`, `tcp_connect`, `http_2xx`). -1. Confirm Prometheus relabeling routes probes to `watchtower_ip:9115`. - -### Dozzle cannot see remote logs - -1. Confirm `dozzle-agent` service is healthy on manager nodes. -2. Confirm remote agent endpoints and ports are reachable. -3. Confirm Docker socket mount is present and read-only where expected. - -## Self-healing model - -Self-healing is implemented as scheduled reconciliation, not ad-hoc manual edits. - -### Current helper script status - -- `ansible/scripts/pi_pull_updates.sh` is retained as a helper and now expects - configurable environment variables instead of embedded credentials. -- `ansible/scripts/pi_init.sh` is optional for operator bootstrap and is not - required for monitoring stack reconciliation. - -### Recommended execution pattern - -1. Use `ansible-pull` to sync and apply `ansible/playbooks/self-heal/watchtower.yml`. -2. Run through a scheduler (prefer `systemd` timer for reliability and observability). -3. Keep logs in a persistent path and alert on repeated failures. - -Example manual run: - -```bash -REPO_URL=git@git.castaldifamily.com:nathan/homelab.git \ -PLAYBOOK_PATH=ansible/playbooks/self-heal/watchtower.yml \ -/home/chester/homelab/ansible/scripts/pi_pull_updates.sh -``` - -> [!IMPORTANT] -> If your repository is private, use SSH deploy keys or vault-backed secret injection. -> Do not place long-lived personal access tokens in script files. - -## Idempotency and rollback - -- Re-running deployment playbooks is expected and safe; desired state is reconciled. -- Keep stack definitions in Git and avoid manual edits in `/opt/stacks`. -- Rollback method: - 1. Revert the offending commit in Git. - 2. Re-run deployment playbook. - 3. Validate endpoints and target health. - -## Operational safety rules - -- Do not run services as root unless technically required and documented. -- Avoid broad host mounts unless required for telemetry collection. -- Keep exposed admin ports behind Traefik and authentication middleware. -- Validate health and auth behavior before declaring changes complete. diff --git a/ansible/archive/documentation/reports/prompt-analysis-2026-01-09.md b/ansible/archive/documentation/reports/prompt-analysis-2026-01-09.md deleted file mode 100644 index 5022895..0000000 --- a/ansible/archive/documentation/reports/prompt-analysis-2026-01-09.md +++ /dev/null @@ -1,648 +0,0 @@ ---- -title: "Prompt Repository Analysis Report" -date: "2026-01-09" -author: "FrankGPT v4" -type: "Analysis" ---- - -# Prompt Repository Analysis Report - -## Executive Summary - -Analyzed **26 prompt files** across the `.github/prompts/` directory. The repository contains a mix of production-ready, draft, and deprecated prompts with varying levels of sophistication. - -**Key Findings:** -- **Overlap Issues:** 7 prompts have significant overlap and can be converged -- **Deprecated Content:** 3 "OLD.*" prompts should be archived or removed -- **Draft Quality:** 4 draft prompts lack implementation detail -- **Top 5 Adjustments Needed:** See Section 4 for detailed recommendations - ---- - -## 1. Overlap Analysis: Convergence Opportunities - -### 1.1 Service Management Workflows (High Overlap) - -**Affected Prompts:** -- `service-new.prompt.md` -- `service-review.prompt.md` -- `service-standardize.prompt.md` -- `service-troubleshoot.prompt.md` -- `service-decommission.prompt.md` -- `service-migration.prompt.md` - -**Analysis:** -All six prompts share a common structure: -- Gated, step-by-step workflow -- Service-focused (Docker/Compose) -- Inventory integration (`.github/knowledge/inventory.md`) -- Explicit confirmation phrases -- Upstream documentation validation - -**Current Duplication:** -- **Pre-flight checks:** SSH validation, service discovery logic repeated 6 times -- **Inventory lookups:** Same RAG pattern in `service-new`, `service-review`, `service-standardize` -- **Gate structure:** Nearly identical gate format across all service prompts -- **Output format:** All produce Markdown reports with similar sections - -**Convergence Recommendation:** - -**Option A: Meta-Prompt Architecture (Recommended)** - -Create a single `service-workflow.meta.prompt.md` that defines: - -```yaml -# service-workflow.meta.prompt.md -workflows: - - name: new - gates: [0, 1, 2, 3, 4, 5] - phases: [validate_sources, plan, analyze, patch, verify] - - name: review - gates: [0, 1, 2, 3, 4] - phases: [discover, compare, report, patch, verify] - - name: standardize - gates: [0, 1, 2, 3, 4] - phases: [locate, assess_risk, propose, apply, bounce] -``` - -Then reduce individual prompts to: - -```markdown -# service-new.prompt.md ---- -extends: service-workflow.meta -workflow: new ---- -[Workflow-specific customizations only] -``` - -**Option B: Consolidate to Single File with Modes** - -Create `service-management.prompt.md` with mode flags: - -```markdown -# Usage -/service-management mode=new app=traefik -/service-management mode=review app=immich -``` - -**Impact:** -- **Reduction:** 6 files β†’ 1 meta-prompt + 6 lightweight configs (or 1 unified file) -- **Maintenance:** Single source of truth for gates, inventory logic, security checks -- **Risk:** Low if phased migration - ---- - -### 1.2 Session Management (Medium Overlap) - -**Affected Prompts:** -- `session-start.prompt.md` -- `session-end.prompt.md` -- `session-status.prompt.md` -- `OLD.session-start.prompt.md` -- `OLD.session-end.prompt.md` -- `OLD.session-status.prompt.md` - -**Analysis:** -- **OLD.* versions:** Clearly deprecated (no frontmatter, less structured) -- **Current versions:** All reference `SESSION_SNAPSHOT*.md` and perform RAG searches -- **Overlap:** All three prompts perform git status checks and snapshot retrieval - -**Convergence Recommendation:** - -**Create:** `session-lifecycle.prompt.md` - -```markdown -# session-lifecycle.prompt.md -modes: - - start: Load snapshot, check drift, present menu - - status: Quick realignment without full context - - end: Generate snapshot, git operations -``` - -**Impact:** -- **Reduction:** 6 files β†’ 1 unified prompt -- **Archive:** Move OLD.* to `.github/prompts/archive/` -- **Risk:** Very low, well-defined workflows - ---- - -### 1.3 Markdown Conversion (Low Overlap but Redundant) - -**Affected Prompts:** -- `md2htmlDARK.prompt.md` -- `md2htmlLIGHT.prompt.md` - -**Analysis:** -Both prompts are 90% identical, differing only in CSS color schemes. - -**Convergence Recommendation:** - -**Single Prompt with Parameter:** - -```markdown -# md2html.prompt.md -theme: ${input:theme} # Options: dark, light -``` - -**Impact:** -- **Reduction:** 2 files β†’ 1 file -- **Risk:** None - ---- - -### 1.4 Draft Prompts (Should Be Eliminated or Completed) - -**Affected Prompts:** -- `service-decommission.prompt.md` (draft) -- `service-migration.prompt.md` (draft) -- `security-hardening.prompt.md` (draft) -- `performance-tuning.prompt.md` (draft) - -**Analysis:** -All four are labeled "Draft" with generic checklists. They lack: -- Gate structure used in other prompts -- RAG integration -- Specific commands or validation steps -- Safety guardrails - -**Recommendation:** -Either: -1. **Complete them** using the pattern from `service-new.prompt.md` (gated workflow) -2. **Archive them** to `.github/prompts/drafts/` until needed -3. **Eliminate them** if not actively used - -**Impact:** -- Reduces "prompt noise" in main directory -- Sets quality bar for production prompts - ---- - -## 2. Summary of Convergence Opportunities - -| Prompt Group | Current Count | Proposed Count | Reduction | -| :--- | :---: | :---: | :---: | -| Service Management | 6 | 1 (+ 6 configs) | 83% code duplication | -| Session Lifecycle | 6 | 1 | 83% | -| Markdown HTML | 2 | 1 | 50% | -| Drafts | 4 | 0 (archived) | 100% | -| **Total Prompts** | **26** | **15–17** | **35–42% reduction** | - ---- - -## 3. Quality Tiers - -### Tier 1: Production-Ready (8 prompts) -These prompts have complete implementation, gate structure, and clear success criteria: - -1. βœ… `service-new.prompt.md` - Best-in-class structure -2. βœ… `service-review.prompt.md` - Comprehensive validation -3. βœ… `service-standardize.prompt.md` - Clear versioning logic -4. βœ… `service-troubleshoot.prompt.md` - OODA loop methodology -5. βœ… `sso-onboarding.prompt.md` - Authentik integration -6. βœ… `create-commit.msg.prompt.md` - RAG + Conventional Commits -7. βœ… `clean-git.prompt.md` - ReAct protocol, security checks -8. βœ… `generateVulnerabilitiesReport.prompt.md` - Structured output - -### Tier 2: Functional but Needs Polish (5 prompts) - -9. 🟑 `session-start.prompt.md` - Missing detailed menu structure -10. 🟑 `session-end.prompt.md` - Template fallback not defined -11. 🟑 `session-status.prompt.md` - Drift detection logic vague -12. 🟑 `reviewDockerCompose.prompt.md` - Good but lacks gates -13. 🟑 `ansible-tutor.prompt.md` - Too brief, needs examples - -### Tier 3: Draft/Incomplete (9 prompts) - -14. πŸ”΄ `service-decommission.prompt.md` - Generic checklist only -15. πŸ”΄ `service-migration.prompt.md` - Generic checklist only -16. πŸ”΄ `security-hardening.prompt.md` - Generic checklist only -17. πŸ”΄ `performance-tuning.prompt.md` - Generic checklist only -18. πŸ”΄ `create-readme.prompt.md` - Incomplete template -19. πŸ”΄ `doc-lint.prompt.md` - Phase 3 cut off mid-section -20. πŸ”΄ `md2htmlDARK.prompt.md` - Functional but unmaintained -21. πŸ”΄ `md2htmlLIGHT.prompt.md` - Duplicate -22. πŸ”΄ `README.md` - Outdated references - -### Tier 4: Deprecated (3 prompts) - -23. ⚫ `OLD.session-start.prompt.md` - Archive -24. ⚫ `OLD.session-end.prompt.md` - Archive -25. ⚫ `OLD.create-commit-msg.prompt.md` - Archive - ---- - -## 4. Top 5 Prompts Needing Adjustments - -### πŸ₯‡ Rank 1: `reviewDockerCompose.prompt.md` - -**Current State:** Functional mentor-led review prompt but lacks the safety gates present in newer prompts. - -**Issues:** -- No explicit confirmation gates (user can't stop workflow) -- No RAG integration with inventory or upstream docs -- Security audit logic not DRY (duplicates `generateVulnerabilitiesReport.prompt.md`) -- Missing rollback/recovery procedures - -**Impact Score:** 9/10 (Used for critical security audits) - -**Recommended Improvements:** - -1. **Add Gate Structure:** - ```markdown - ## Gate 0 β€” confirm target file - User must reply exactly: `REVIEW: ` - - ## Gate 1 β€” confirm findings - User must reply exactly: `CONFIRM FINDINGS: ` - - ## Gate 2 β€” apply patches (if requested) - User must reply exactly: `APPLY PATCHES: ` - ``` - -2. **Integrate with Vulnerability Report:** - ```markdown - ## Step 1 β€” Run Security Scan First - Before manual review, execute: - `/generateVulnerabilityReport` on the target file. - Reference its output to avoid duplicating security checks. - ``` - -3. **Add Inventory Cross-Check:** - ```markdown - ## Step 2 β€” Validate Against Inventory - Search `.github/knowledge/inventory.md` for the service. - Compare declared image version vs. upstream latest. - ``` - -4. **Define Rollback:** - ```markdown - ## Recovery Procedure - If changes break the service: - 1. `git checkout HEAD -- docker-compose.yml` - 2. `docker compose up -d` - ``` - ---- - -### πŸ₯ˆ Rank 2: `ansible-tutor.prompt.md` - -**Current State:** Minimal prompt with good intent but lacks examples and structure. - -**Issues:** -- Only ~15 lines (vs. 150+ in mature prompts) -- No gate structure for safety -- No examples of "good" vs. "bad" Ansible patterns -- Missing integration with existing playbooks in the repo -- No validation steps - -**Impact Score:** 8/10 (Critical for teaching correct Ansible patterns) - -**Recommended Improvements:** - -1. **Add Real-World Examples:** - ```markdown - ## Anti-Pattern Detection - - ### ❌ Bad: Shell Command Overuse - ```yaml - - name: Install Docker - shell: curl -fsSL get.docker.com | bash - ``` - - ### βœ… Good: Idempotent Module Use - ```yaml - - name: Install Docker - apt: - name: docker-ce - state: present - ``` - -2. **Integrate with Existing Repo:** - ```markdown - ## Step 1 β€” Scan Existing Playbooks - Before generating new code: - 1. Search workspace for `playbooks/*.yml` - 2. Extract patterns from `roles/*/tasks/main.yml` - 3. Align new code with existing style - ``` - -3. **Add Safety Gates:** - ```markdown - ## Gate 1 β€” Destructive Action Check - If the proposed task includes any of these modules: - - `shell` with `rm`, `dd`, `mkfs` - - `file` with `state: absent` on system paths - - STOP and require explicit confirmation: - User must reply: `I UNDERSTAND THE RISK: ` - ``` - -4. **Add Validation Workflow:** - ```markdown - ## Step 4 β€” Validation (Required) - 1. Run `ansible-playbook --syntax-check playbook.yml` - 2. Run `ansible-playbook --check playbook.yml` (dry-run) - 3. Provide copy/paste commands for user verification - ``` - ---- - -### πŸ₯‰ Rank 3: `session-status.prompt.md` - -**Current State:** Cognitive realignment prompt with vague drift detection logic. - -**Issues:** -- "Drift Check" criteria poorly defined -- No quantifiable metrics (how far off-track is "drift"?) -- Missing actionable output (no clear commands) -- Phase 3 output format not standardized - -**Impact Score:** 7/10 (Used frequently but output inconsistent) - -**Recommended Improvements:** - -1. **Define Drift Quantitatively:** - ```markdown - ## Phase 2: Drift Calculation - - Compute drift score: - - Active file NOT in snapshot "Files Changed": +2 drift - - Terminal command NOT in snapshot "Next Steps": +1 drift - - Open files > 5 and none in snapshot: +3 drift - - Drift Levels: - - 0-1: βœ… On track - - 2-3: ⚠️ Minor drift - - 4+: 🚨 Major drift (pruning required) - ``` - -2. **Standardize HUD Output:** - ```markdown - ## Phase 3: Heads-Up Display (HUD) - - ### Status Report - | Metric | Status | Action | - |:---|:---|:---| - | Drift Score | 4 🚨 | Pruning recommended | - | Last Snapshot | 2h ago | Recent | - | Active Task | Fix traefik labels | ⚠️ Not in snapshot | - | Blockers | None | - | - - ### Recommended Command - To realign, run: - ```bash - git checkout main - cd _thelab/core/web/traefik - ``` - ``` - -3. **Add Memory Compression:** - ```markdown - ## Phase 4: Context Compression (If Drift > 5) - Summarize current conversation in 3 bullets: - - What we tried - - What failed - - What's next - - Then clear terminal history to reduce cognitive load. - ``` - ---- - -### πŸ… Rank 4: Service Draft Prompts (Group) - -**Affected:** `service-decommission`, `service-migration`, `security-hardening`, `performance-tuning` - -**Current State:** All are generic checklists with no implementation logic. - -**Issues:** -- No gate structure -- No integration with existing tooling -- No validation steps -- No examples or commands - -**Impact Score:** 6/10 (Blocking future workflows) - -**Recommended Improvements:** - -**Template to Follow:** Use `service-new.prompt.md` as the gold standard. - -**Example: Complete `service-decommission.prompt.md`** - -```markdown ---- -description: "Guided, gated workflow for safely decommissioning a service." ---- - -# [ROLE] -You are a **DevOps SRE** acting as a **decomm specialist**. - -# [GOAL] -Safely retire a service by: -- Backing up all data and configs -- Validating no dependencies -- Removing from production -- Updating documentation - -# [INPUTS] -- Target service name: `${input:serviceName}` -- Backup destination: `${input:backupPath}` -- Inventory file path: `${input:inventoryFile}` - -# [WORKFLOW] - -## Gate 0 β€” select service for decommission -User must reply exactly: `DECOMMISSION: ` - -## Step 1 β€” dependency scan -Search all `docker-compose.yml` files for: -- Services with `depends_on: ` -- Networks shared with this service -- Volumes referenced by other services - -If dependencies found, STOP and list them. - -## Gate 1 β€” confirm no dependencies -User must reply exactly: `CONFIRM NO DEPS: ` - -## Step 2 β€” backup execution -1. Export service data: `docker compose cp :/data ./backup/` -2. Export configs: `docker compose config > backup/compose.yml` -3. Verify backup integrity - -## Gate 2 β€” confirm backup complete -User must reply exactly: `BACKUP VERIFIED: ` - -## Step 3 β€” removal -1. Stop service: `docker compose stop ` -2. Remove container: `docker compose rm ` -3. Remove from compose file -4. Remove from inventory - -## Step 4 β€” validation -1. `docker compose config` (syntax check) -2. `docker compose ps` (ensure service gone) -3. Check logs for errors in dependent services - -## Gate 3 β€” confirm clean removal -User must reply exactly: `REMOVAL CONFIRMED: ` - -## Step 5 β€” documentation update -Update: -- `.github/knowledge/inventory.md` (mark as decommissioned) -- `documentation/architecture/` (remove service from diagrams) -- `README.md` (if listed) -``` - ---- - -### πŸ… Rank 5: `doc-lint.prompt.md` - -**Current State:** Incomplete - Phase 3 report section is cut off. - -**Issues:** -- Output section truncated at line 50 (file continues to 61) -- Missing "Recommended Fixes" and "Low Priority" sections -- No auto-fix capability -- No integration with `style.markdown.md` validation - -**Impact Score:** 5/10 (Useful but incomplete) - -**Recommended Improvements:** - -1. **Complete the Report Structure:** - ```markdown - ### Phase 3: The Report - - #### πŸ”΄ Critical Errors (Must Fix) - - [Line 42] Missing language tag in code block - - [Line 105] Broken internal link: `./missing-file.md` - - #### 🟑 Recommended Improvements - - [Line 12] Use Sentence Case for heading - - [Line 67] Replace "e.g." with "for example" - - #### πŸ”΅ Low Priority / Style - - [Line 89] Consider adding more whitespace between sections - ``` - -2. **Add Auto-Fix Mode:** - ```markdown - ## Phase 4: Auto-Fix (Optional) - - If user replies exactly: `AUTO-FIX: ` - - Then apply these corrections: - - Add language tags to code blocks - - Convert headers to Sentence Case - - Remove trailing whitespace - - Fix relative links - ``` - -3. **Add Validation:** - ```markdown - ## Phase 5: Validation - - After fixes: - 1. Re-run lint - 2. Confirm 0 Critical Errors - 3. Generate pass/fail badge for README - ``` - ---- - -## 5. Implementation Roadmap - -### Phase 1: Immediate Cleanup (Week 1) -- [ ] Archive OLD.* prompts to `.github/prompts/archive/` -- [ ] Move draft prompts to `.github/prompts/drafts/` -- [ ] Converge `md2html` into single parameterized prompt -- [ ] Update `README.md` with accurate inventory - -### Phase 2: High-Impact Improvements (Weeks 2-3) -- [ ] Enhance `reviewDockerCompose.prompt.md` (Rank 1) -- [ ] Expand `ansible-tutor.prompt.md` (Rank 2) -- [ ] Fix `session-status.prompt.md` drift logic (Rank 3) -- [ ] Complete `doc-lint.prompt.md` (Rank 5) - -### Phase 3: Service Prompt Convergence (Week 4) -- [ ] Create `service-workflow.meta.prompt.md` -- [ ] Refactor 6 service prompts to use meta-prompt -- [ ] Test all workflows with real use cases - -### Phase 4: Draft Completion (Weeks 5-6) -- [ ] Complete `service-decommission.prompt.md` -- [ ] Complete `service-migration.prompt.md` -- [ ] Complete `security-hardening.prompt.md` -- [ ] Complete `performance-tuning.prompt.md` - ---- - -## 6. Metrics & Success Criteria - -### Baseline (Current State) -- **Total Prompts:** 26 -- **Production-Ready:** 8 (31%) -- **Code Duplication:** ~60% across service prompts -- **Deprecated Content:** 3 prompts - -### Target State (Post-Implementation) -- **Total Prompts:** 15-17 (-35%) -- **Production-Ready:** 15 (88%) -- **Code Duplication:** <20% -- **Deprecated Content:** 0 (archived) - -### Quality Gates -- βœ… All production prompts have gate structure -- βœ… All prompts have YAML frontmatter -- βœ… All prompts reference methodology (ReAct, CoT, etc.) -- βœ… All prompts include validation steps -- βœ… All prompts have rollback procedures - ---- - -## 7. Recommendations Summary - -### Critical Actions -1. **Converge service prompts** β†’ Single meta-prompt pattern (saves ~800 lines of duplicate code) -2. **Fix `reviewDockerCompose.prompt.md`** β†’ Add gates and integrate with vulnerability scanning -3. **Expand `ansible-tutor.prompt.md`** β†’ Add examples, safety checks, and validation - -### High Priority -4. **Archive deprecated prompts** β†’ Clean up OLD.* files -5. **Complete `doc-lint.prompt.md`** β†’ Finish truncated output section -6. **Standardize `session-status.prompt.md`** β†’ Quantify drift detection - -### Medium Priority -7. **Converge `md2html` prompts** β†’ Single parameterized version -8. **Complete draft prompts** β†’ Follow `service-new.prompt.md` pattern - -### Low Priority -9. **Update README.md** β†’ Reflect actual prompt inventory -10. **Add testing framework** β†’ Validate prompts before deployment - ---- - -## 8. Conclusion - -The prompt repository has strong foundational patterns (gated workflows, RAG integration, safety guardrails) but suffers from: -- **Duplication:** 60% code overlap in service management prompts -- **Inconsistency:** 3 quality tiers with 9 incomplete drafts -- **Maintenance Burden:** 26 prompts to update when patterns evolve - -**Recommended Strategy:** Phased convergence using meta-prompt architecture, starting with service management workflows (highest ROI). This reduces maintenance burden while preserving flexibility for specialized workflows. - -**Estimated Effort:** -- Phase 1 (Cleanup): 2-4 hours -- Phase 2 (High-Impact): 8-12 hours -- Phase 3 (Convergence): 16-20 hours -- Phase 4 (Draft Completion): 12-16 hours -- **Total:** 38-52 hours over 6 weeks - ---- - -**Report Generated:** 2026-01-09 -**Methodology:** Static analysis + pattern detection + quality scoring -**Scope:** 26 prompt files in `.github/prompts/` -**Next Review:** 2026-02-09 (post-Phase 2 completion) diff --git a/ansible/archive/documentation/standards/ansible-quality-gates.md b/ansible/archive/documentation/standards/ansible-quality-gates.md deleted file mode 100644 index 449ddee..0000000 --- a/ansible/archive/documentation/standards/ansible-quality-gates.md +++ /dev/null @@ -1,240 +0,0 @@ -# Ansible quality gates - -This document defines the quality standards, review checklist, and validation workflow for all Ansible code in this repository. - -## Philosophy - -Quality gates progress through three enforcement tiers: - -- **Tier 1 (Advisory):** Visible via lint warnings; not blocking. Baseline cleanup phase. -- **Tier 2 (Mandatory β€” current):** Must pass for swarm-impacting changes. CI enforces. -- **Tier 3 (Fully blocking):** All rules enforced on every commit. Target: Phase 3 roadmap. - -**Idempotency controls are Tier 2 (mandatory now) for all stack-impacting changes.** -This means: changed_when, manager-state assertions, secret preflight asserts, -bind-mount path asserts, and validate-only mode support are required, not advisory. - -## Linting - -### Configuration - -The repository includes [.ansible-lint](../../.ansible-lint) configuration that enforces: - -* **Moderate profile** β€” Balanced between permissive and strict -* **Advisory rules** β€” No blocking on known patterns (e.g., raw commands in bootstrap playbooks) -* **Warnings** β€” Experimental syntax and risky permissions are flagged but not blocked - -### Running lint checks - -```bash -# Lint all playbooks and roles -cd /home/chester/homelab/ansible -ansible-lint - -# Lint specific playbook -ansible-lint playbooks/onboarding/generic_host.yml - -# Lint entire role -ansible-lint roles/monitoring_stack/ -``` - -### Installing ansible-lint - -```bash -# On control node (Ubuntu/Debian) -sudo apt-get update -sudo apt-get install -y python3-pip -pip3 install ansible-lint - -# Verify installation -ansible-lint --version -``` - -## Quality checklist - -Use this checklist when creating or reviewing playbooks and roles: - -### Security - -* [ ] **No SSH bypasses** β€” `StrictHostKeyChecking=no` is forbidden -* [ ] **Host key checking enabled** β€” `ansible.cfg` must have `host_key_checking = True` -* [ ] **Secrets vaulted** β€” No plaintext passwords in defaults, vars, or playbooks -* [ ] **Secrets validated** β€” Roles requiring secrets include `assert` tasks to fail fast -* [ ] **File permissions explicit** β€” All `file`, `copy`, `template` tasks specify `mode` -* [ ] **No root by default** β€” Use `become: true` only when necessary - -### Idempotency - -* [x] **Changed semantics** β€” All `command`/`shell` tasks include `changed_when` (**mandatory**) -* [x] **Error handling** β€” All `command`/`shell` tasks include `failed_when` or `ignore_errors` (**mandatory**) -* [x] **Check mode safe** β€” Playbooks can run with `--check` without errors (**mandatory**) -* [x] **Replay safe** β€” Running twice produces no changes on second run (**mandatory**; PR evidence required) -* [x] **Manager assertion** β€” Swarm manager checks use exact equality (`== 'active|true'`), not substring search (**mandatory**) -* [x] **Absent idempotency** β€” Stack removal checks existence first; no false `changed` when already absent (**mandatory**) -* [x] **Validate-only mode** β€” All stack deploy playbooks support `stack_validate_only=true` (**mandatory**) - -### Modularity - -* [ ] **Roles over monoliths** β€” Multi-task logic belongs in roles, not massive playbooks -* [ ] **Builtin modules first** β€” Prefer `ansible.builtin.*` over `command`/`shell`/`raw` -* [ ] **Bootstrap exception** β€” `raw` commands are acceptable only for pre-Python tasks -* [ ] **Variables separated** β€” Environment-specific values live in `group_vars`, not role defaults - -### Maintainability - -* [ ] **Task names descriptive** β€” Each task has a clear, action-oriented name -* [ ] **Tags applied** β€” Logical grouping with tags (e.g., `setup`, `security`, `monitoring`) -* [ ] **Documentation inline** β€” Complex logic includes comments explaining "why" -* [ ] **Handlers for services** β€” Service restarts use handlers, not inline tasks - -## Mandatory pre-deploy gate (effective now β€” blocking for all stack changes) - -> [!IMPORTANT] -> All steps below MUST pass before merging any pull request that touches -> `ansible/templates/stacks/`, `ansible/playbooks/docker/deploy_*.yml`, -> or `ansible/roles/swarm_stack_deploy/`. -> The Gitea CI workflow (`.gitea/workflows/stack-idempotency.yml`) runs -> stages 1–3 automatically on every PR. The two-run idempotency proof -> (step 6 below) must be performed manually and included as PR evidence. - -For any swarm-impacting change, all checks below must pass before deployment: - -```bash -cd /home/chester/homelab/ansible - -# 1) Inventory parse gate -ansible-inventory -i inventory/hosts.ini --graph - -# 2) Connectivity gate -ansible -i inventory/hosts.ini swarm_hosts -m ping - -# 3) Swarm control-plane gate -ansible -i inventory/hosts.ini swarm_managers -m shell -a "docker info 2>/dev/null | grep -E 'Swarm:|Is Manager:'" - -# 4) Playbook syntax gate -ansible-playbook -i inventory/hosts.ini playbooks/your-playbook.yml --syntax-check - -# 5) Control node sanity gate -ansible-playbook -i inventory/hosts.ini playbooks/preflight/validate_control_node.yml - -# 6) Validate-only preflight (no Swarm mutations β€” mandatory for stack changes) -ansible-playbook -i inventory/hosts.ini playbooks/docker/deploy_.yml \ - -e "stack_validate_only=true" \ - --vault-password-file .vault_pass - -# 7) TWO-RUN IDEMPOTENCY PROOF (required for stack PRs β€” attach output as evidence) -# Run 1: apply desired state -ansible-playbook -i inventory/hosts.ini playbooks/docker/deploy_.yml \ - --vault-password-file .vault_pass \ - 2>&1 | tee /tmp/run1.log - -# Run 2: replay β€” MUST report changed=0 for stack tasks -ansible-playbook -i inventory/hosts.ini playbooks/docker/deploy_.yml \ - --vault-password-file .vault_pass \ - 2>&1 | tee /tmp/run2.log - -# Verify: second run must show changed=0 for deploy/reconcile tasks -grep -E 'changed=[^0]' /tmp/run2.log && echo 'IDEMPOTENCY FAIL' || echo 'IDEMPOTENCY PASS' -``` - -## PR evidence pack (required for stack-impacting changes) - -For any PR that modifies a stack template, deploy playbook, or the -`swarm_stack_deploy` role, attach the following to the PR description: - -``` -### Idempotency evidence - -**Stack:** -**Date:** YYYY-MM-DD -**Operator:** @username - -**Run 1 summary:** -``` -PLAY RECAP *** -swarm-manager-1 : ok=N changed=N ... -``` - -**Run 2 summary (must show changed=0 for stack tasks):** -``` -PLAY RECAP *** -swarm-manager-1 : ok=N changed=0 ... -``` - -**Validate-only passed:** yes/no -**Lint passed:** yes/no (CI enforced) -**Syntax check passed:** yes/no (CI enforced) -``` - -> [!IMPORTANT] -> A PR that cannot demonstrate changed=0 on the second run MUST NOT be merged. - - - -Before committing changes, always run syntax checks: - -```bash -cd /home/chester/homelab/ansible - -# Check specific playbook -ansible-playbook -i inventory/hosts.ini playbooks/your-playbook.yml --syntax-check - -# Preflight validation (control node sanity) -ansible-playbook -i inventory/hosts.ini playbooks/preflight/validate_control_node.yml -``` - -## Idempotency testing - -High-risk playbooks (those modifying system state) should be tested for idempotency: - -```bash -# Run playbook twice; second run should report "changed=0" -ansible-playbook -i inventory/hosts.ini playbooks/your-playbook.yml -ansible-playbook -i inventory/hosts.ini playbooks/your-playbook.yml -``` - -## Review process - -### Pre-commit (developer) - -1. Run inventory parse gate and connectivity gate -2. Run syntax check on modified playbooks - 3. Run ansible-lint on modified playbooks/roles (**Tier 2: mandatory for stack files**) - 4. For stack changes, run validate-only preflight - 5. For stack changes, run idempotency proof (two-run) and collect evidence - 6. Ensure required secrets are provided via vault (no plaintext defaults) - -### Pre-merge (reviewer) - - 1. Verify security checklist items are addressed - 2. Spot-check modularity (no 500+ line playbooks) - 3. Confirm environment-specific values are in inventory, not defaults - 4. Confirm no root-level duplicate Ansible directories were introduced - 5. **For stack changes: verify PR evidence pack is attached and shows changed=0 on second run** - 6. For critical changes (security, networking), require idempotency proof - -* **Weekly:** Triage Critical/High findings from drift reports -* **Biweekly:** Run preflight validation suite -* **Monthly:** Generate fresh standards-drift audit and review trends - -## Roadmap - -As baseline quality improves, the repository will: - -1. **Phase 1 (current):** Mandatory idempotency gate for stack changes. Lint advisory for - non-stack playbooks. Gitea CI blocks stack PRs on lint + syntax + preflight failures. - `no-changed-when` promoted from skip to warn (visible everywhere). -2. **Phase 2 (3 months):** Mandatory lint for all new/modified playbooks. - `no-changed-when` moved to blocking; bootstrap exceptions suppressed inline with - `# noqa: no-changed-when` on specific tasks. -3. **Phase 3 (6 months):** Full baseline coverage, stricter profile. All remaining - idempotency violations resolved. Two-run check automated in CI for eligible stacks. -4. **Phase 4 (12 months):** Fully blocking CI on every commit. Molecule/integration - tests for multi-node Swarm scenarios. - -## References - -* [Ansible Best Practices](https://docs.ansible.com/ansible/latest/tips_tricks/ansible_tips_tricks.html) -* [ansible-lint documentation](https://ansible-lint.readthedocs.io/) -* [environment-constraints.md](./environment-constraints.md) β€” Infrastructure-specific rules -* [naming-conventions.md](./naming-conventions.md) β€” File and variable naming standards diff --git a/ansible/archive/documentation/standards/environment-constraints.md b/ansible/archive/documentation/standards/environment-constraints.md deleted file mode 100644 index 8be63a0..0000000 --- a/ansible/archive/documentation/standards/environment-constraints.md +++ /dev/null @@ -1,151 +0,0 @@ -# Environment constraints - -**Date:** 2026-01-10 -**Status:** Living document -**Author:** Chester + FrankGPT - -## Purpose - -This document defines the hardware, software, and network constraints of the homelab environment. All playbooks and roles must respect these constraints. - ---- - -## Network topology - -> [!IMPORTANT] -> Current operational state is still a flat network on `10.0.0.0/24`. -> VLAN segmentation and target zone allocations in this document are migration targets, -> not fully applied runtime state. - -| Parameter | Value | -| :--- | :--- | -| Subnet | `10.0.0.0/24` | -| Gateway | `10.0.0.2` | -| Primary DNS | `10.0.0.2` | -| Secondary DNS | `8.8.8.8` | -| Domain | `local` (optional) | - -### IP allocation scheme - -| Range | Purpose | -| :--- | :--- | -| `10.0.0.1` | Reserved | -| `10.0.0.2` | Gateway / Primary DNS | -| `10.0.0.3 - 10.0.0.199` | DHCP / General devices | -| `10.0.0.200 - 10.0.0.209` | Proxmox hosts (physical) | -| `10.0.0.210 - 10.0.0.219` | Swarm managers (VMs) | -| `10.0.0.220 - 10.0.0.229` | Swarm workers (VMs) / legacy AI nodes during migration | -| `10.0.0.230 - 10.0.0.239` | AI workstations | -| `10.0.0.240 - 10.0.0.248` | Reserved / Future | -| `10.0.0.249 - 10.0.0.250` | NAS devices | -| `10.0.0.251 - 10.0.0.254` | Docker hosts / Misc | - ---- - -## Host categories - -### Proxmox cluster (physical) - -| Hostname | IP | Hardware | Notes | -| :--- | :---: | :--- | :--- | -| `pve01` | `10.0.0.201` | Lenovo SFF, 16 GB RAM, 512 GB NVMe | First node, 2Γ— NICs | -| `pve02` | `10.0.0.202` | (future) | | -| `pve03` | `10.0.0.203` | (future) | | -| `pve04` | `10.0.0.204` | (future) | | -| `pve05` | `10.0.0.205` | (future) | | - -**Constraints:** -- Proxmox VE 8.x or 9.x -- `ansible_user=root` for provisioning -- Python 3 available at `/usr/bin/python3` - -### Swarm nodes (VMs on Proxmox) - -| Role | Hostname pattern | IP range | Specs | -| :--- | :--- | :--- | :--- | -| Manager | `swarm-manager-X` | `.211 - .215` | 4 GB RAM, 2 vCPU, 32 GB disk | -| Worker | `swarm-worker-X` | `.221 - .225` | 4 GB RAM, 2 vCPU, 32 GB disk | - -**Constraints:** -- Ubuntu 24.04 LTS (Noble) -- Docker CE installed via official repo -- `ansible_user=chester` - -### AI workstations (physical) - -| Hostname | IP | Hardware | Notes | -| :--- | :---: | :--- | :--- | -| `ai-lenovo` | `10.0.0.220` | Laptop, 12 GB GPU | Ubuntu Server | - -**Constraints:** -- Ubuntu Server (not Desktop) -- GPU drivers managed separately -- `ansible_user=chester` - -### Storage / NAS (appliances) - -| Hostname | IP | Product | Notes | -| :--- | :---: | :--- | :--- | -| `synology` | `10.0.0.249` | Synology NAS | Proprietary Linux, limited shell | -| `terramaster` | `10.0.0.250` | TerraMaster NAS | Proprietary Linux, limited shell | - -**Constraints:** -- **Caution required** β€” proprietary OS, not standard Ubuntu -- Use `ansible_scp_if_ssh=True` for Synology -- Avoid destructive commands; test in check mode first -- Limited Python support; prefer `raw` module when needed - -### Controller (watchtower) - -| Hostname | IP | Hardware | Notes | -| :--- | :---: | :--- | :--- | -| `localhost` | N/A | Raspberry Pi 5 | Ansible controller | - -**Constraints:** -- `ansible_connection=local` -- Runs all playbooks from this host -- ARM64 architecture (consider when building containers) - ---- - -## Software standards - -| Component | Version | Notes | -| :--- | :--- | :--- | -| Ansible | 2.15+ | Core automation | -| Python | 3.10+ | Required on all managed hosts | -| Docker CE | Latest stable | Swarm mode | -| Proxmox VE | 8.x or 9.x | Hypervisor | -| Ubuntu | 24.04 LTS | Guest OS for VMs | - ---- - -## Firewall / ports - -| Port | Protocol | Purpose | Required on | -| :---: | :---: | :--- | :--- | -| 22 | TCP | SSH | All hosts | -| 8006 | TCP | Proxmox GUI | Proxmox hosts | -| 2377 | TCP | Swarm cluster mgmt | Swarm nodes | -| 7946 | TCP/UDP | Swarm node comm | Swarm nodes | -| 4789 | UDP | Swarm overlay network | Swarm nodes | - ---- - -## Documentation mandate - -> [!IMPORTANT] -> **FrankGPT core principle:** Documentation is not optional. -> -> - Every decision must be recorded in `documentation/standards/` -> - Every playbook must have a header comment explaining usage -> - Every variable must be documented in defaults or group_vars -> - When in doubt, write it down - ---- - -## Change log - -| Date | Change | Author | -| :--- | :--- | :--- | -| 2026-01-10 | Initial creation | Chester + FrankGPT | diff --git a/ansible/archive/documentation/standards/naming-conventions.md b/ansible/archive/documentation/standards/naming-conventions.md deleted file mode 100644 index 133fc66..0000000 --- a/ansible/archive/documentation/standards/naming-conventions.md +++ /dev/null @@ -1,178 +0,0 @@ -# Naming conventions - -**Date:** 2026-01-10 -**Status:** Approved -**Author:** Chester + FrankGPT - -## Purpose - -Consistent naming reduces cognitive load, prevents errors, and makes the codebase navigable for future contributors (including future-you). - ---- - -## General principles - -1. **Be descriptive:** Names should explain *what* something is or *what* it does. -2. **Be consistent:** Once you pick a pattern, stick to it everywhere. -3. **Avoid abbreviations:** Write `network` not `net`, `manager` not `mgr` β€” unless the abbreviation is industry-standard (e.g., `vm`, `ip`, `ssh`). -4. **Use English:** All identifiers, comments, and documentation in English. - ---- -## Files and folders - -| Element | Convention | Example | -| :--- | :--- | :--- | -| Folders | lowercase, singular noun | `docker/`, `proxmox/`, `onboarding/` | -| Playbooks | `snake_case.yml` | `provision_swarm_vms.yml` | -| Roles | `snake_case` | `proxmox_post_install` | -| Templates | `filename.ext.j2` | `docker-compose.yml.j2` | -| Variable files | `snake_case.yml` | `swarm_defaults.yml` | - -### Playbook naming pattern - -Use **verb + object** format: - -| Verb | Use when | Example | -| :--- | :--- | :--- | -| `provision_` | Creating infrastructure | `provision_swarm_vms.yml` | -| `configure_` | Modifying settings | `configure_nas.yml` | -| `deploy_` | Pushing applications | `deploy_portainer.yml` | -| `init_` | First-time setup | `init_cluster.yml` | -| `update_` | Applying updates | `update_containers.yml` | -| `validate_` | Checking correctness | `validate_karakeep.yml` | -| `test_` | Running tests | `test_ollama.yml` | -| `enforce_` | Ensuring compliance | `enforce_access.yml` | -| `remove_` | Deleting resources | `remove_old_images.yml` | - -**Exceptions:** Master/orchestrator playbooks may be named after their target scope: -- `proxmox_host.yml` β€” orchestrates full PVE onboarding -- `ai_workstation.yml` β€” orchestrates AI host setup - ---- - -## Inventory - -| Element | Convention | Example | -| :--- | :--- | :--- | -| Group names | `snake_case` | `proxmox_cluster`, `swarm_managers` | -| Hostnames | `kebab-case` | `pve-01`, `swarm-manager-1` | -| Child groups | `parent:children` syntax | `ubuntu_lab:children` | - -### Hostname pattern - -``` -- -``` - -| Role | Pattern | Examples | -| :--- | :--- | :--- | -| Proxmox hosts | `pve-0X` | `pve-01`, `pve-02` | -| Swarm managers | `swarm-manager-X` | `swarm-manager-1` | -| Swarm workers | `swarm-worker-X` | `swarm-worker-1` | -| AI workstations | `ai-` | `ai-lenovo`, `ai-surface1` | -| Docker hosts | `` or `docker-0X` | `waldorf`, `docker-01` | -| Storage | `` | `synology`, `terramaster` | - ---- - -## Variables - -| Element | Convention | Example | -| :--- | :--- | :--- | -| All variables | `snake_case` | `vm_disk_size` | -| Role defaults | Prefix with role name | `proxmox_post_install_enabled` | -| Boolean vars | Use positive names | `enable_ha` (not `disable_ha`) | -| List vars | Plural nouns | `required_packages`, `allowed_users` | -| Dict vars | Singular noun | `vm_config`, `network_settings` | - -### Variable prefixes by scope - -| Scope | Prefix | Example | -| :--- | :--- | :--- | -| Role-specific | `_` | `proxmox_post_install_enabled` | -| Playbook-local | `_` (single underscore) | `_temp_file` | -| Global/shared | none | `ansible_user`, `ssh_key_path` | - -### Reserved variable names - -Never override these Ansible built-ins: -- `inventory_hostname`, `ansible_host`, `ansible_user` -- `ansible_become`, `ansible_become_pass` -- `hostvars`, `groups`, `group_names` - ---- - -## Tasks and handlers - -| Element | Convention | Example | -| :--- | :--- | :--- | -| Task names | Sentence case, descriptive | `Install required packages` | -| Handler names | `Restart ` or `Reload ` | `Restart docker` | -| Block names | ` ` | `Configure SSH access` | -| Tags | `snake_case`, short | `install`, `configure`, `test` | - -### Task naming rules - -1. **Start with a verb:** `Install`, `Configure`, `Create`, `Remove`, `Ensure`, `Check` -2. **Be specific:** `Install Docker CE` not `Install Docker` -3. **No trailing punctuation:** `Install packages` not `Install packages.` -4. **Use present tense:** `Create user` not `Created user` - ---- - -## Tags - -Use tags to allow selective execution: - -| Tag | Purpose | Example usage | -| :--- | :--- | :--- | -| `install` | Package installation | `--tags install` | -| `configure` | Configuration changes | `--tags configure` | -| `test` | Validation/testing | `--tags test` | -| `cleanup` | Removal/pruning | `--tags cleanup` | -| `never` | Skip unless explicit | `--tags never,dangerous_task` | - ---- - -## Secrets and sensitive data - -| Element | Convention | Example | -| :--- | :--- | :--- | -| Vault files | `vault_.yml` | `vault_production.yml` | -| Secret vars | Suffix with `_secret` or `_pass` | `db_password`, `api_key_secret` | -| Encrypted strings | Use `!vault` tag | `password: !vault |...` | - ---- - -## Git branches (if applicable) - -| Branch | Purpose | -| :--- | :--- | -| `main` | Production-ready playbooks | -| `develop` | Integration branch | -| `feature/` | New features | -| `fix/` | Bug fixes | -| `docs/` | Documentation updates | - ---- - -## Quick reference card - -``` -Files: snake_case.yml -Folders: lowercase/ -Roles: snake_case -Hostnames: kebab-case -Groups: snake_case -Variables: snake_case -Tasks: Sentence case, verb first -Tags: snake_case -``` - ---- - -## References - -- [Ansible Best Practices β€” Variable Naming](https://docs.ansible.com/ansible/latest/tips_tricks/ansible_tips_tricks.html) -- [Ansible Lint β€” Naming Rules](https://ansible.readthedocs.io/projects/lint/rules/name/) -- [Google Shell Style Guide](https://google.github.io/styleguide/shellguide.html) β€” for script naming inspiration diff --git a/ansible/archive/documentation/standards/vm-vs-lxc-decision.md b/ansible/archive/documentation/standards/vm-vs-lxc-decision.md deleted file mode 100644 index 9c204ee..0000000 --- a/ansible/archive/documentation/standards/vm-vs-lxc-decision.md +++ /dev/null @@ -1,51 +0,0 @@ -# Decision: VM vs LXC for Docker Swarm nodes - -**Date:** 2026-01-10 -**Status:** Approved -**Author:** Chester + FrankGPT - -## Context - -We need to run Docker Swarm manager and worker nodes on Proxmox VE hosts. Two options exist: - -1. **QEMU/KVM Virtual Machines (VMs)** -2. **LXC Containers** - -## Decision - -**Use VMs for all Docker Swarm nodes.** - -## Rationale - -| Factor | VM | LXC | -| :--- | :--- | :--- | -| Docker support | Officially supported | Unsupported (requires hacks) | -| Stability | High | Medium (kernel updates can break) | -| Isolation | Full kernel isolation | Shared kernel | -| Resource overhead | Higher (~1-2 GB RAM baseline) | Lower (~256 MB baseline) | -| Maintenance | Standard Ubuntu updates | AppArmor/seccomp tuning required | - -**Trade-off accepted:** We accept the higher resource overhead of VMs in exchange for stability and official Docker support. - -## Specifications - -| Parameter | Value | -| :--- | :--- | -| Base image | Ubuntu 24.04 LTS (Noble) cloud-init | -| Disk | 32 GB per VM | -| RAM | 4 GB per VM | -| vCPU | 2 per VM | -| Network bridge | `vmbr0` (bridged to LAN) | -| Storage pool | `local-lvm` | - -## Capacity planning (per physical host) - -- Physical NVMe: 512 GB -- Available in `local-lvm`: ~357 GB -- Initial allocation: 2 VMs Γ— 32 GB = 64 GB -- Remaining: ~293 GB (room for 4+ additional VMs) - -## References - -- [community-scripts/ProxmoxVE docker-vm.sh](https://github.com/community-scripts/ProxmoxVE) β€” reference implementation -- Docker documentation on supported platforms diff --git a/ansible/archive/get-docker.sh b/ansible/archive/get-docker.sh deleted file mode 100644 index 9a7bddb..0000000 --- a/ansible/archive/get-docker.sh +++ /dev/null @@ -1,764 +0,0 @@ -#!/bin/sh -set -e -# Docker Engine for Linux installation script. -# -# This script is intended as a convenient way to configure docker's package -# repositories and to install Docker Engine, This script is not recommended -# for production environments. Before running this script, make yourself familiar -# with potential risks and limitations, and refer to the installation manual -# at https://docs.docker.com/engine/install/ for alternative installation methods. -# -# The script: -# -# - Requires `root` or `sudo` privileges to run. -# - Attempts to detect your Linux distribution and version and configure your -# package management system for you. -# - Doesn't allow you to customize most installation parameters. -# - Installs dependencies and recommendations without asking for confirmation. -# - Installs the latest stable release (by default) of Docker CLI, Docker Engine, -# Docker Buildx, Docker Compose, containerd, and runc. When using this script -# to provision a machine, this may result in unexpected major version upgrades -# of these packages. Always test upgrades in a test environment before -# deploying to your production systems. -# - Isn't designed to upgrade an existing Docker installation. When using the -# script to update an existing installation, dependencies may not be updated -# to the expected version, resulting in outdated versions. -# -# Source code is available at https://github.com/docker/docker-install/ -# -# Usage -# ============================================================================== -# -# To install the latest stable versions of Docker CLI, Docker Engine, and their -# dependencies: -# -# 1. download the script -# -# $ curl -fsSL https://get.docker.com -o install-docker.sh -# -# 2. verify the script's content -# -# $ cat install-docker.sh -# -# 3. run the script with --dry-run to verify the steps it executes -# -# $ sh install-docker.sh --dry-run -# -# 4. run the script either as root, or using sudo to perform the installation. -# -# $ sudo sh install-docker.sh -# -# Command-line options -# ============================================================================== -# -# --version -# Use the --version option to install a specific version, for example: -# -# $ sudo sh install-docker.sh --version 23.0 -# -# --channel -# -# Use the --channel option to install from an alternative installation channel. -# The following example installs the latest versions from the "test" channel, -# which includes pre-releases (alpha, beta, rc): -# -# $ sudo sh install-docker.sh --channel test -# -# Alternatively, use the script at https://test.docker.com, which uses the test -# channel as default. -# -# --mirror -# -# Use the --mirror option to install from a mirror supported by this script. -# Available mirrors are "Aliyun" (https://mirrors.aliyun.com/docker-ce), and -# "AzureChinaCloud" (https://mirror.azure.cn/docker-ce), for example: -# -# $ sudo sh install-docker.sh --mirror AzureChinaCloud -# -# --setup-repo -# -# Use the --setup-repo option to configure Docker's package repositories without -# installing Docker packages. This is useful when you want to add the repository -# but install packages separately: -# -# $ sudo sh install-docker.sh --setup-repo -# -# Automatic Service Start -# -# By default, this script automatically starts the Docker daemon and enables the docker -# service after installation if systemd is used as init. -# -# If you prefer to start the service manually, use the --no-autostart option: -# -# $ sudo sh install-docker.sh --no-autostart -# -# Note: Starting the service requires appropriate privileges to manage system services. -# -# ============================================================================== - - -# Git commit from https://github.com/docker/docker-install when -# the script was uploaded (Should only be modified by upload job): -SCRIPT_COMMIT_SHA="f381ee68b32e515bb4dc034b339266aff1fbc460" - -# strip "v" prefix if present -VERSION="${VERSION#v}" - -# The channel to install from: -# * stable -# * test -DEFAULT_CHANNEL_VALUE="stable" -if [ -z "$CHANNEL" ]; then - CHANNEL=$DEFAULT_CHANNEL_VALUE -fi - -DEFAULT_DOWNLOAD_URL="https://download.docker.com" -if [ -z "$DOWNLOAD_URL" ]; then - DOWNLOAD_URL=$DEFAULT_DOWNLOAD_URL -fi - -DEFAULT_REPO_FILE="docker-ce.repo" -if [ -z "$REPO_FILE" ]; then - REPO_FILE="$DEFAULT_REPO_FILE" - # Automatically default to a staging repo fora - # a staging download url (download-stage.docker.com) - case "$DOWNLOAD_URL" in - *-stage*) REPO_FILE="docker-ce-staging.repo";; - esac -fi - -mirror='' -DRY_RUN=${DRY_RUN:-} -REPO_ONLY=${REPO_ONLY:-0} -NO_AUTOSTART=${NO_AUTOSTART:-0} -while [ $# -gt 0 ]; do - case "$1" in - --channel) - CHANNEL="$2" - shift - ;; - --dry-run) - DRY_RUN=1 - ;; - --mirror) - mirror="$2" - shift - ;; - --version) - VERSION="${2#v}" - shift - ;; - --setup-repo) - REPO_ONLY=1 - shift - ;; - --no-autostart) - NO_AUTOSTART=1 - ;; - --*) - echo "Illegal option $1" - ;; - esac - shift $(( $# > 0 ? 1 : 0 )) -done - -case "$mirror" in - Aliyun) - DOWNLOAD_URL="https://mirrors.aliyun.com/docker-ce" - ;; - AzureChinaCloud) - DOWNLOAD_URL="https://mirror.azure.cn/docker-ce" - ;; - "") - ;; - *) - >&2 echo "unknown mirror '$mirror': use either 'Aliyun', or 'AzureChinaCloud'." - exit 1 - ;; -esac - -case "$CHANNEL" in - stable|test) - ;; - *) - >&2 echo "unknown CHANNEL '$CHANNEL': use either stable or test." - exit 1 - ;; -esac - -command_exists() { - command -v "$@" > /dev/null 2>&1 -} - -# version_gte checks if the version specified in $VERSION is at least the given -# SemVer (Maj.Minor[.Patch]), or CalVer (YY.MM) version.It returns 0 (success) -# if $VERSION is either unset (=latest) or newer or equal than the specified -# version, or returns 1 (fail) otherwise. -# -# examples: -# -# VERSION=23.0 -# version_gte 23.0 // 0 (success) -# version_gte 20.10 // 0 (success) -# version_gte 19.03 // 0 (success) -# version_gte 26.1 // 1 (fail) -version_gte() { - if [ -z "$VERSION" ]; then - return 0 - fi - version_compare "$VERSION" "$1" -} - -# version_compare compares two version strings (either SemVer (Major.Minor.Path), -# or CalVer (YY.MM) version strings. It returns 0 (success) if version A is newer -# or equal than version B, or 1 (fail) otherwise. Patch releases and pre-release -# (-alpha/-beta) are not taken into account -# -# examples: -# -# version_compare 23.0.0 20.10 // 0 (success) -# version_compare 23.0 20.10 // 0 (success) -# version_compare 20.10 19.03 // 0 (success) -# version_compare 20.10 20.10 // 0 (success) -# version_compare 19.03 20.10 // 1 (fail) -version_compare() ( - set +x - - yy_a="$(echo "$1" | cut -d'.' -f1)" - yy_b="$(echo "$2" | cut -d'.' -f1)" - if [ "$yy_a" -lt "$yy_b" ]; then - return 1 - fi - if [ "$yy_a" -gt "$yy_b" ]; then - return 0 - fi - mm_a="$(echo "$1" | cut -d'.' -f2)" - mm_b="$(echo "$2" | cut -d'.' -f2)" - - # trim leading zeros to accommodate CalVer - mm_a="${mm_a#0}" - mm_b="${mm_b#0}" - - if [ "${mm_a:-0}" -lt "${mm_b:-0}" ]; then - return 1 - fi - - return 0 -) - -is_dry_run() { - if [ -z "$DRY_RUN" ]; then - return 1 - else - return 0 - fi -} - -is_wsl() { - case "$(uname -r)" in - *microsoft* ) true ;; # WSL 2 - *Microsoft* ) true ;; # WSL 1 - * ) false;; - esac -} - -is_darwin() { - case "$(uname -s)" in - *darwin* ) true ;; - *Darwin* ) true ;; - * ) false;; - esac -} - -deprecation_notice() { - distro=$1 - distro_version=$2 - echo - printf "\033[91;1mDEPRECATION WARNING\033[0m\n" - printf " This Linux distribution (\033[1m%s %s\033[0m) reached end-of-life and is no longer supported by this script.\n" "$distro" "$distro_version" - echo " No updates or security fixes will be released for this distribution, and users are recommended" - echo " to upgrade to a currently maintained version of $distro." - echo - printf "Press \033[1mCtrl+C\033[0m now to abort this script, or wait for the installation to continue." - echo - sleep 10 -} - -get_distribution() { - lsb_dist="" - # Every system that we officially support has /etc/os-release - if [ -r /etc/os-release ]; then - lsb_dist="$(. /etc/os-release && echo "$ID")" - fi - # Returning an empty string here should be alright since the - # case statements don't act unless you provide an actual value - echo "$lsb_dist" -} - -start_docker_daemon() { - # Use systemctl if available (for systemd-based systems) - if command_exists systemctl; then - is_dry_run || >&2 echo "Using systemd to manage Docker service" - if ( - is_dry_run || set -x - $sh_c systemctl enable --now docker.service 2>/dev/null - ); then - is_dry_run || echo "INFO: Docker daemon enabled and started" >&2 - else - is_dry_run || echo "WARNING: unable to enable the docker service" >&2 - fi - else - # No service management available (container environment) - if ! is_dry_run; then - >&2 echo "Note: Running in a container environment without service management" - >&2 echo "Docker daemon cannot be started automatically in this environment" - >&2 echo "The Docker packages have been installed successfully" - fi - fi - >&2 echo -} - -echo_docker_as_nonroot() { - if is_dry_run; then - return - fi - if command_exists docker && [ -e /var/run/docker.sock ]; then - ( - set -x - $sh_c 'docker version' - ) || true - fi - - # intentionally mixed spaces and tabs here -- tabs are stripped by "<<-EOF", spaces are kept in the output - echo - echo "================================================================================" - echo - if version_gte "20.10"; then - echo "To run Docker as a non-privileged user, consider setting up the" - echo "Docker daemon in rootless mode for your user:" - echo - echo " dockerd-rootless-setuptool.sh install" - echo - echo "Visit https://docs.docker.com/go/rootless/ to learn about rootless mode." - echo - fi - echo - echo "To run the Docker daemon as a fully privileged service, but granting non-root" - echo "users access, refer to https://docs.docker.com/go/daemon-access/" - echo - echo "WARNING: Access to the remote API on a privileged Docker daemon is equivalent" - echo " to root access on the host. Refer to the 'Docker daemon attack surface'" - echo " documentation for details: https://docs.docker.com/go/attack-surface/" - echo - echo "================================================================================" - echo -} - -# Check if this is a forked Linux distro -check_forked() { - - # Check for lsb_release command existence, it usually exists in forked distros - if command_exists lsb_release; then - # Check if the `-u` option is supported - set +e - lsb_release -a -u > /dev/null 2>&1 - lsb_release_exit_code=$? - set -e - - # Check if the command has exited successfully, it means we're in a forked distro - if [ "$lsb_release_exit_code" = "0" ]; then - # Print info about current distro - cat <<-EOF - You're using '$lsb_dist' version '$dist_version'. - EOF - - # Get the upstream release info - lsb_dist=$(lsb_release -a -u 2>&1 | tr '[:upper:]' '[:lower:]' | grep -E 'id' | cut -d ':' -f 2 | tr -d '[:space:]') - dist_version=$(lsb_release -a -u 2>&1 | tr '[:upper:]' '[:lower:]' | grep -E 'codename' | cut -d ':' -f 2 | tr -d '[:space:]') - - # Print info about upstream distro - cat <<-EOF - Upstream release is '$lsb_dist' version '$dist_version'. - EOF - else - if [ -r /etc/debian_version ] && [ "$lsb_dist" != "ubuntu" ] && [ "$lsb_dist" != "raspbian" ]; then - if [ "$lsb_dist" = "osmc" ]; then - # OSMC runs Raspbian - lsb_dist=raspbian - else - # We're Debian and don't even know it! - lsb_dist=debian - fi - dist_version="$(sed 's/\/.*//' /etc/debian_version | sed 's/\..*//')" - case "$dist_version" in - 13|14|forky) - dist_version="trixie" - ;; - 12) - dist_version="bookworm" - ;; - 11) - dist_version="bullseye" - ;; - 10) - dist_version="buster" - ;; - 9) - dist_version="stretch" - ;; - 8) - dist_version="jessie" - ;; - esac - fi - fi - fi -} - -do_install() { - echo "# Executing docker install script, commit: $SCRIPT_COMMIT_SHA" - - if command_exists docker; then - cat >&2 <<-'EOF' - Warning: the "docker" command appears to already exist on this system. - - If you already have Docker installed, this script can cause trouble, which is - why we're displaying this warning and provide the opportunity to cancel the - installation. - - If you installed the current Docker package using this script and are using it - again to update Docker, you can ignore this message, but be aware that the - script resets any custom changes in the deb and rpm repo configuration - files to match the parameters passed to the script. - - You may press Ctrl+C now to abort this script. - EOF - ( set -x; sleep 20 ) - fi - - user="$(id -un 2>/dev/null || true)" - - sh_c='sh -c' - if [ "$user" != 'root' ]; then - if command_exists sudo; then - sh_c='sudo -E sh -c' - elif command_exists su; then - sh_c='su -c' - else - cat >&2 <<-'EOF' - Error: this installer needs the ability to run commands as root. - We are unable to find either "sudo" or "su" available to make this happen. - EOF - exit 1 - fi - fi - - if is_dry_run; then - sh_c="echo" - fi - - # perform some very rudimentary platform detection - lsb_dist=$( get_distribution ) - lsb_dist="$(echo "$lsb_dist" | tr '[:upper:]' '[:lower:]')" - - if is_wsl; then - echo - echo "WSL DETECTED: We recommend using Docker Desktop for Windows." - echo "Please get Docker Desktop from https://www.docker.com/products/docker-desktop/" - echo - cat >&2 <<-'EOF' - - You may press Ctrl+C now to abort this script. - EOF - ( set -x; sleep 20 ) - fi - - case "$lsb_dist" in - - ubuntu) - if command_exists lsb_release; then - dist_version="$(lsb_release --codename | cut -f2)" - fi - if [ -z "$dist_version" ] && [ -r /etc/lsb-release ]; then - dist_version="$(. /etc/lsb-release && echo "$DISTRIB_CODENAME")" - fi - ;; - - debian|raspbian) - dist_version="$(sed 's/\/.*//' /etc/debian_version | sed 's/\..*//')" - case "$dist_version" in - 13) - dist_version="trixie" - ;; - 12) - dist_version="bookworm" - ;; - 11) - dist_version="bullseye" - ;; - 10) - dist_version="buster" - ;; - 9) - dist_version="stretch" - ;; - 8) - dist_version="jessie" - ;; - esac - ;; - - centos|rhel) - if [ -z "$dist_version" ] && [ -r /etc/os-release ]; then - dist_version="$(. /etc/os-release && echo "$VERSION_ID")" - fi - ;; - - *) - if command_exists lsb_release; then - dist_version="$(lsb_release --release | cut -f2)" - fi - if [ -z "$dist_version" ] && [ -r /etc/os-release ]; then - dist_version="$(. /etc/os-release && echo "$VERSION_ID")" - fi - ;; - - esac - - # Check if this is a forked Linux distro - check_forked - - # Print deprecation warnings for distro versions that recently reached EOL, - # but may still be commonly used (especially LTS versions). - case "$lsb_dist.$dist_version" in - centos.8|centos.7|rhel.7) - deprecation_notice "$lsb_dist" "$dist_version" - ;; - debian.buster|debian.stretch|debian.jessie) - deprecation_notice "$lsb_dist" "$dist_version" - ;; - raspbian.buster|raspbian.stretch|raspbian.jessie) - deprecation_notice "$lsb_dist" "$dist_version" - ;; - ubuntu.focal|ubuntu.bionic|ubuntu.xenial|ubuntu.trusty) - deprecation_notice "$lsb_dist" "$dist_version" - ;; - ubuntu.oracular|ubuntu.mantic|ubuntu.lunar|ubuntu.kinetic|ubuntu.impish|ubuntu.hirsute|ubuntu.groovy|ubuntu.eoan|ubuntu.disco|ubuntu.cosmic) - deprecation_notice "$lsb_dist" "$dist_version" - ;; - fedora.*) - if [ "$dist_version" -lt 41 ]; then - deprecation_notice "$lsb_dist" "$dist_version" - fi - ;; - esac - - # Run setup for each distro accordingly - case "$lsb_dist" in - ubuntu|debian|raspbian) - pre_reqs="ca-certificates curl" - apt_repo="deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] $DOWNLOAD_URL/linux/$lsb_dist $dist_version $CHANNEL" - ( - if ! is_dry_run; then - set -x - fi - $sh_c 'apt-get -qq update >/dev/null' - $sh_c "DEBIAN_FRONTEND=noninteractive apt-get -y -qq install $pre_reqs >/dev/null" - $sh_c 'install -m 0755 -d /etc/apt/keyrings' - $sh_c "curl -fsSL \"$DOWNLOAD_URL/linux/$lsb_dist/gpg\" -o /etc/apt/keyrings/docker.asc" - $sh_c "chmod a+r /etc/apt/keyrings/docker.asc" - $sh_c "echo \"$apt_repo\" > /etc/apt/sources.list.d/docker.list" - $sh_c 'apt-get -qq update >/dev/null' - ) - - if [ "$REPO_ONLY" = "1" ]; then - exit 0 - fi - - pkg_version="" - if [ -n "$VERSION" ]; then - if is_dry_run; then - echo "# WARNING: VERSION pinning is not supported in DRY_RUN" - else - # Will work for incomplete versions IE (17.12), but may not actually grab the "latest" if in the test channel - pkg_pattern="$(echo "$VERSION" | sed 's/-ce-/~ce~.*/g' | sed 's/-/.*/g')" - search_command="apt-cache madison docker-ce | grep '$pkg_pattern' | head -1 | awk '{\$1=\$1};1' | cut -d' ' -f 3" - pkg_version="$($sh_c "$search_command")" - echo "INFO: Searching repository for VERSION '$VERSION'" - echo "INFO: $search_command" - if [ -z "$pkg_version" ]; then - echo - echo "ERROR: '$VERSION' not found amongst apt-cache madison results" - echo - exit 1 - fi - if version_gte "18.09"; then - search_command="apt-cache madison docker-ce-cli | grep '$pkg_pattern' | head -1 | awk '{\$1=\$1};1' | cut -d' ' -f 3" - echo "INFO: $search_command" - cli_pkg_version="=$($sh_c "$search_command")" - fi - pkg_version="=$pkg_version" - fi - fi - ( - pkgs="docker-ce${pkg_version%=}" - if version_gte "18.09"; then - # older versions didn't ship the cli and containerd as separate packages - pkgs="$pkgs docker-ce-cli${cli_pkg_version%=} containerd.io" - fi - if version_gte "20.10"; then - pkgs="$pkgs docker-compose-plugin docker-ce-rootless-extras$pkg_version" - fi - if version_gte "23.0"; then - pkgs="$pkgs docker-buildx-plugin" - fi - if version_gte "28.2"; then - pkgs="$pkgs docker-model-plugin" - fi - if ! is_dry_run; then - set -x - fi - $sh_c "DEBIAN_FRONTEND=noninteractive apt-get -y -qq install $pkgs >/dev/null" - ) - if [ "$NO_AUTOSTART" != "1" ]; then - start_docker_daemon - fi - echo_docker_as_nonroot - exit 0 - ;; - centos|fedora|rhel) - if [ "$(uname -m)" = "s390x" ]; then - echo "Effective v27.5, please consult RHEL distro statement for s390x support." - exit 1 - fi - repo_file_url="$DOWNLOAD_URL/linux/$lsb_dist/$REPO_FILE" - ( - if ! is_dry_run; then - set -x - fi - if command_exists dnf5; then - $sh_c "dnf -y -q --setopt=install_weak_deps=False install dnf-plugins-core" - $sh_c "dnf5 config-manager addrepo --overwrite --save-filename=docker-ce.repo --from-repofile='$repo_file_url'" - - if [ "$CHANNEL" != "stable" ]; then - $sh_c "dnf5 config-manager setopt \"docker-ce-*.enabled=0\"" - $sh_c "dnf5 config-manager setopt \"docker-ce-$CHANNEL.enabled=1\"" - fi - $sh_c "dnf makecache" - elif command_exists dnf; then - $sh_c "dnf -y -q --setopt=install_weak_deps=False install dnf-plugins-core" - $sh_c "rm -f /etc/yum.repos.d/docker-ce.repo /etc/yum.repos.d/docker-ce-staging.repo" - $sh_c "dnf config-manager --add-repo $repo_file_url" - - if [ "$CHANNEL" != "stable" ]; then - $sh_c "dnf config-manager --set-disabled \"docker-ce-*\"" - $sh_c "dnf config-manager --set-enabled \"docker-ce-$CHANNEL\"" - fi - $sh_c "dnf makecache" - else - $sh_c "yum -y -q install yum-utils" - $sh_c "rm -f /etc/yum.repos.d/docker-ce.repo /etc/yum.repos.d/docker-ce-staging.repo" - $sh_c "yum-config-manager --add-repo $repo_file_url" - - if [ "$CHANNEL" != "stable" ]; then - $sh_c "yum-config-manager --disable \"docker-ce-*\"" - $sh_c "yum-config-manager --enable \"docker-ce-$CHANNEL\"" - fi - $sh_c "yum makecache" - fi - ) - - if [ "$REPO_ONLY" = "1" ]; then - exit 0 - fi - - pkg_version="" - if command_exists dnf; then - pkg_manager="dnf" - pkg_manager_flags="-y -q --best" - else - pkg_manager="yum" - pkg_manager_flags="-y -q" - fi - if [ -n "$VERSION" ]; then - if is_dry_run; then - echo "# WARNING: VERSION pinning is not supported in DRY_RUN" - else - if [ "$lsb_dist" = "fedora" ]; then - pkg_suffix="fc$dist_version" - else - pkg_suffix="el" - fi - pkg_pattern="$(echo "$VERSION" | sed 's/-ce-/\\\\.ce.*/g' | sed 's/-/.*/g').*$pkg_suffix" - search_command="$pkg_manager list --showduplicates docker-ce | grep '$pkg_pattern' | tail -1 | awk '{print \$2}'" - pkg_version="$($sh_c "$search_command")" - echo "INFO: Searching repository for VERSION '$VERSION'" - echo "INFO: $search_command" - if [ -z "$pkg_version" ]; then - echo - echo "ERROR: '$VERSION' not found amongst $pkg_manager list results" - echo - exit 1 - fi - if version_gte "18.09"; then - # older versions don't support a cli package - search_command="$pkg_manager list --showduplicates docker-ce-cli | grep '$pkg_pattern' | tail -1 | awk '{print \$2}'" - cli_pkg_version="$($sh_c "$search_command" | cut -d':' -f 2)" - fi - # Cut out the epoch and prefix with a '-' - pkg_version="-$(echo "$pkg_version" | cut -d':' -f 2)" - fi - fi - ( - pkgs="docker-ce$pkg_version" - if version_gte "18.09"; then - # older versions didn't ship the cli and containerd as separate packages - if [ -n "$cli_pkg_version" ]; then - pkgs="$pkgs docker-ce-cli-$cli_pkg_version containerd.io" - else - pkgs="$pkgs docker-ce-cli containerd.io" - fi - fi - if version_gte "20.10"; then - pkgs="$pkgs docker-compose-plugin docker-ce-rootless-extras$pkg_version" - fi - if version_gte "23.0"; then - pkgs="$pkgs docker-buildx-plugin docker-model-plugin" - fi - if ! is_dry_run; then - set -x - fi - $sh_c "$pkg_manager $pkg_manager_flags install $pkgs" - ) - if [ "$NO_AUTOSTART" != "1" ]; then - start_docker_daemon - fi - echo_docker_as_nonroot - exit 0 - ;; - sles) - echo "Effective v27.5, please consult SLES distro statement for s390x support." - exit 1 - ;; - *) - if [ -z "$lsb_dist" ]; then - if is_darwin; then - echo - echo "ERROR: Unsupported operating system 'macOS'" - echo "Please get Docker Desktop from https://www.docker.com/products/docker-desktop" - echo - exit 1 - fi - fi - echo - echo "ERROR: Unsupported distribution '$lsb_dist'" - echo - exit 1 - ;; - esac - exit 1 -} - -# wrapped up in a function so that we have some protection against only getting -# half the file during "curl | sh" -do_install diff --git a/ansible/archive/group_vars/all.yml b/ansible/archive/group_vars/all.yml deleted file mode 100644 index c05fd22..0000000 --- a/ansible/archive/group_vars/all.yml +++ /dev/null @@ -1,262 +0,0 @@ -# Central YAML Source of Truth for Nathan's Lab (2026) -# Edit and commit this file; Ansible playbooks should read this as canonical. -lab_name: "nathan-lab-2026" -canonical_source: "ansible/group_vars/all.yml" - -# The standard operational user created on every managed host. -# Override per-host in host_vars/ if a node uses a different login. -lab_ansible_user: "chester" - -# Omada Open API credentials are sourced from the encrypted vault file. -omada_client_id: "{{ vault_omada_client_id }}" -omada_client_secret: "{{ vault_omada_client_secret }}" -omada_id: "{{ vault_omada_id }}" -omada_base_url: "{{ vault_omada_base_url }}" - -networks: - main: - vlan: 1 - cidr: "10.0.0.0/24" - dhcp_pool: "10.0.0.100-10.0.0.240" - gateway: "10.0.0.1" - purpose: "Family / wired / main SSID" - - infra: - vlan: 10 - cidr: "10.0.10.0/24" - reserved: "10.0.10.2-10.0.10.50" - purpose: "Management / Proxmox / NAS / Heimdall mgmt" - - iot: - vlan: 50 - cidr: "10.0.50.0/24" - dhcp_pool: "10.0.50.100-10.0.50.199" - purpose: "IoT devices (Omada)" - - guest: - vlan: 30 - cidr: "10.0.30.0/24" - dhcp_pool: "10.0.30.100-10.0.30.200" - purpose: "Guest WiFi (isolated)" - - compute: - vlan: 200 - cidr: "10.0.200.0/24" - purpose: "Swarm / AI grid / ephemeral compute" - -lab_hosts: - er7212pc: - role: gateway - current_ip: "10.0.0.2" - desired_ip: "10.0.0.2" - note: "DHCP + Omada controller" - - pve01: - physical_backing_host: "pve04" - role: proxmox - current_ip: "10.0.0.201" - desired_ip: "10.0.10.11" - - pve02: - role: proxmox - current_ip: "10.0.0.202" - desired_ip: "10.0.10.12" - - pve03: - role: proxmox - current_ip: "10.0.0.203" - desired_ip: "10.0.10.13" - - pve04: - replacement_status: "retired-identity-now-backing-pve01" - role: retired_physical_alias - current_ip: "10.0.0.204" - desired_ip: "10.0.10.14" - - swarm-manager-1: - current_ip: "10.0.0.211" - desired_ip: "10.0.200.11" - - swarm-manager-2: - current_ip: "10.0.0.212" - desired_ip: "10.0.200.12" - - swarm-manager-3: - current_ip: "10.0.0.213" - desired_ip: "10.0.200.13" - - statler: - role: standalone_vm - current_ip: "10.0.0.210" - desired_ip: "10.0.0.210" - hypervisor_host: "pve02" - note: "Standalone Ubuntu 24.04 VM planned on pve02 with 2 vCPU, 10 GB RAM, and 32 GB disk." - - swarm-worker-1: - current_ip: "10.0.0.221" - desired_ip: "10.0.200.21" - - swarm-worker-2: - current_ip: "10.0.0.222" - desired_ip: "10.0.200.22" - - swarm-worker-3: - current_ip: "10.0.0.223" - desired_ip: "10.0.200.23" - - ai-lenovo: - current_ip: "10.0.0.220" - desired_ip: "10.0.200.20" - onboarding_status: "tbd-needs-onboarding-like-heimdall" - ansible_managed: false - note: "Pending onboarding workflow before inclusion in active automation and monitoring groups." - - synology: - current_ip: "10.0.0.249" - desired_ip: "10.0.10.40" - - terramaster: - current_ip: "10.0.0.250" - desired_ip: "10.0.10.41" - - waldorf: - current_ip: "10.0.0.251" - desired_ip: "10.0.200.30" - lifecycle_status: "retired-shutdown" - ansible_managed: false - monitoring_enabled: false - note: "Retired host; excluded from active monitoring and deployment inventories." - - watchtower: - current_ip: "10.0.0.200" - desired_ip: "10.0.10.200" - - heimdall: - role: beelink - current_ip: null - desired_ip: - mgmt: "10.0.10.2" - lan: "10.0.0.50" - -# === MONITORING INFRASTRUCTURE === -# Environment-specific configuration for monitoring stack -monitoring: - stack_user: "chester" - heimdall_redis: "10.0.0.151:6379" - watchtower_ip: "10.0.0.200" - grafana_domain: "grafana.castaldifamily.com" - uptime_domain: "status.castaldifamily.com" - dozzle_domain: "logs.castaldifamily.com" - authentik_host: "https://sso.castaldifamily.com" - # grafana_admin_password: DEFINE IN VAULT - -# === EDGE ROUTING TOPOLOGY === -# Canonical ingress model: Traefik runs on a dedicated edge host outside Swarm. -# Swarm and standalone hosts publish routes through traefik-kop agents. -edge_routing: - ingress_mode: "external-traefik" - edge_host: - name: "heimdall" - ip: "10.0.0.151" - ssh_port: 22 - http_port: 80 - https_port: 443 - integration: - # Watchtower-hosted traefik-kop instance (publishes Watchtower container routes) - agent_image: "ghcr.io/jittering/traefik-kop:latest" - redis_addr: "10.0.0.151:6379" - bind_ip: "10.0.0.200" # Watchtower IP β€” correct for routes originating on Watchtower - swarm: - # Swarm-hosted traefik-kop instance (publishes Swarm service routes) - # bind_ip MUST be a Swarm node IP β€” the Swarm routing mesh makes published - # ports available on ALL nodes, so Traefik routes inbound requests here. - bind_ip: "10.0.0.212" # swarm-manager-2 (current Leader; was swarm-manager-1 before it went down) - proxy_network: "proxy-net" # Swarm overlay network; separate from heimdall's bridge of same name - stack_deploy_target: "swarm-manager-2" - migration_rules: - deploy_traefik_in_swarm: false - use_external_proxy_network: true - notes: - - "Services should attach to swarm overlay proxy-net for east-west traffic." - - "Ingress is terminated by external Traefik at 10.0.0.151 via traefik-kop updates." - -# Per-stack placement node overrides. -# Update when the deploy target node changes (e.g., after node replacement). -gitea_placement_node: "swarm-manager-2" -authentik_placement_node: "swarm-manager-2" - -# === SERVICE SECRETS (set via: ansible-vault encrypt_string) === -vault_gitea_db_password: !vault | - $ANSIBLE_VAULT;1.1;AES256 - 34623365623337336535656164623637656633356661373162356438646637333932663765323134 - 6261626565646166353966393366666434356434333263330a333666393765646233303663363738 - 65616665393235323132623462373435373637363262363539626163373061643930393730346633 - 3232373866663034310a343661306634313766313765623439626339353635626232663662323365 - 6666 -vault_authentik_secret_key: !vault | - $ANSIBLE_VAULT;1.1;AES256 - 61373834613362356638303166376135613133616139613963333632613430636136623062373161 - 6335636331386565386139376234663362396361653463660a613834313263653039376363396264 - 62383166346563326630323734643462326438643436626565656633636234323835333033353130 - 3535306539626339320a323431666164353038323166633663656265613266366535623130323165 - 38353833393934393764376331333464663337616432623033303830393464303966643036656538 - 34396337363163663566383063396130616530633363636461343531636438303963653733343830 - 66636165656563653164383364643032373135666263316137623761656332316130313235623232 - 33623462343639366566 -vault_authentik_postgres_password: !vault | - $ANSIBLE_VAULT;1.1;AES256 - 37356530373764353038343038663662333535323436336663613239333234363036626462656130 - 3138313535353838306563663565663230646561313234390a313166623232383364623766383961 - 30363065373065353365616239663562333833313139636137616561616465656462613238323932 - 3630333538366430370a616263633263336436303662373530323161316534313737366633643535 - 30326636383131353265613463363431666536313966366364666564623637343737 -vlan_defaults: - dns_domain: "home.lab" - ntp_servers: - - "10.0.10.2" - -# Plex bootstrap claim token β€” used only on first server claim. -vault_plex_claim: !vault | - $ANSIBLE_VAULT;1.1;AES256 - 31373365323534353264373735363937623566646633653434613038396463303164396138306661 - 3130323134656463383835366130663632323561326265350a653162643064643563383738373637 - 36363135613735663037303036613637313431336139343430313963393930303532666366336365 - 3734386639393336310a323964386233346134616164656663393731376632643037313734323830 - 65366334356531623339643066373237306263323063383963363330346665316435 - -# Authentik outpost tokens for standalone arr services on statler. -vault_authentik_token_sonarr: !vault | - $ANSIBLE_VAULT;1.1;AES256 - 39303463306665356436626265653339663163613464366237663234376135306366303739343266 - 3762646230666263393330373833393037613165373337380a336663646161613534353232663761 - 65376666663063643066323831366265633337653630666235636234393130646361383032383032 - 3433393235633762390a376561303866373739613663333461643938353931626134336665383164 - 34346538376436313438313733393963303735646632323739313137626466356138636266396434 - 61363737636139386665616438646439366139303739646530316566373563306565623637363661 - 343938653662646132373565303836353030 -vault_authentik_token_radarr: !vault | - $ANSIBLE_VAULT;1.1;AES256 - 32363735353663623031356362323765616232326234333564323839626236653634626263313765 - 6335653537656531396431366662616163366166633462390a346363633364363866373732373939 - 61666261616266333465393837383337313565613539303732396530333833666563653139353238 - 6537383336613933370a333662323339396463353134363635383430353133646331376533303861 - 30303765373566353633643261376430363837386239363261396235333033636563366231323564 - 35643564663866653831663633333436653330363130656631356166363731356639643238656530 - 643062636137396333383438623534346636 -vault_authentik_token_sabnzbd: !vault | - $ANSIBLE_VAULT;1.1;AES256 - 30373635366337343236353866623234383665386461356637353534666461613466373463616531 - 3837646263643864636331343364663563666531333861660a626335393762353862663564656465 - 61373430336336373062623563633832383261333035353432666265313435363132316561383130 - 3236643962313765630a386634313331643639363035623663616166313532623932643162633762 - 64353335393764653031633033323862643732326434613564363935336166386239613932653765 - 32323335306634326133613334386262316464613166373031376362653266653937303131653165 - 376436643431366561323866383231343362 -# Usage notes: -# - Treat this file as the single source of truth for IPs and VLANs. -# - Ansible playbooks should read `networks` and `lab_hosts` to render configs, -# update `inventory/hosts.ini`, and generate DHCP reservation templates. -# -# Discussion queue (2026-03-13): -# - Decide NAS + Ansible + Watchtower reporting model (agentless scrape, exporter sidecar, or API/blackbox only). -# - Decide Omada onboarding scope and what should be automated via Ansible versus documented/manual operations. \ No newline at end of file diff --git a/ansible/archive/group_vars/vault/.gitignore b/ansible/archive/group_vars/vault/.gitignore deleted file mode 100644 index f631e60..0000000 --- a/ansible/archive/group_vars/vault/.gitignore +++ /dev/null @@ -1,25 +0,0 @@ -# Vault encrypted variables directory -# -# This directory contains ENCRYPTED credentials and API keys using Ansible Vault. -# -# SECURITY POLICY: -# - βœ… DO commit: Encrypted .yml files (e.g., all.yml, production.yml) -# These are safe because they are encrypted and cannot be decrypted without the vault password. -# - ❌ DON'T commit: Plaintext passwords or unencrypted files -# Keep these patterns blocked in .gitignore -# - ❌ DON'T commit: .vault_pass, password files, or temporary backups - -# Ignore plaintext password/backup files -*.orig -*.bak -*.tmp -.vault_pass -password -vault_password -vault_password.txt - -# Ignore editor temporary files -*~ -*.swp -*.swo -.DS_Store diff --git a/ansible/archive/group_vars/vault/all.yml b/ansible/archive/group_vars/vault/all.yml deleted file mode 100644 index e0543d0..0000000 --- a/ansible/archive/group_vars/vault/all.yml +++ /dev/null @@ -1,27 +0,0 @@ -$ANSIBLE_VAULT;1.1;AES256 -62376339373839396561386638616366313633303966333566386138313162616463366339323834 -3962656465346564343161643561353434613163623861350a366362363134396231616165333265 -32613166336432356165386562333764323030306266323764353833613235393766653565326564 -6235353936336131630a383637303033333161613361366230663733313031323162386431646464 -64303164376463316232386366633039316638326634376137313264326533613137306164633061 -64616164353933646166383735653464336436633364623739386438636438306434346234613331 -62396363336162316363386665643961636161623731356532393537333264323731313933613830 -35343363353231303235396438666364666134643831396139643433656436636631633061623032 -64326337336165373439666639663861393765633132663337363931306462323533646633323832 -39626331663764393032316134613033306334303862346533343230326437326638626436303438 -63646130633163616262306665313637383065633563613739373365363133623631326665316334 -31376238616630633037613939643235353031633962313666383030613833643832663763323035 -62333633393339636561313463306433303537356161303664663566383065393031663232623465 -38383737373933303161633566663832636564663838343038613333346338636666313134353334 -39333862393665333366396661643832366133313164363731656139326630633064633137343036 -32633630623532646132623230653064623432626537653261323235356238303861663330346239 -35393563656634663339653862313136366537633130636538656439323437613164313836653136 -62346136646336363333303730616130616263623765366230663661626236663766616238336336 -31656561653062666563316439393733656636303164613433373265303266303038376465646533 -65626237383432353037636535646433336163316235343130343065643837653235343333326432 -31343766626531386338643232383865656362326266343034323238376232333433386535666537 -30333435366232303132306561643665303933393430373837326134393030323163303939376661 -35316661313035393531613865383234353766626338303439613136343634356131626137663437 -62663961623232373939356636333361666232626563383361323462666639653162636166666462 -31643434633162316532326336303335633466303731313438613936323364336336356631393032 -3263323261336361623430333331663263393862666435306639 diff --git a/ansible/archive/hosts.ini b/ansible/archive/hosts.ini deleted file mode 100644 index 7297ad8..0000000 --- a/ansible/archive/hosts.ini +++ /dev/null @@ -1,7 +0,0 @@ -# DEPRECATED FILE -# -# Canonical inventory path: -# ansible/inventory/hosts.ini -# -# This file is intentionally kept as a pointer to prevent accidental use of -# stale host definitions from older workflows. diff --git a/ansible/archive/inventory/host_vars/heimdall.yml b/ansible/archive/inventory/host_vars/heimdall.yml deleted file mode 100644 index c022772..0000000 --- a/ansible/archive/inventory/host_vars/heimdall.yml +++ /dev/null @@ -1,20 +0,0 @@ ---- -# host_vars/heimdall.yml -# Vault-encrypted host secrets for Heimdall edge role -heimdall_cf_dns_api_token: !vault | - $ANSIBLE_VAULT;1.1;AES256 - 39363263373530393233323165303336613536383739666137353635386163663536396539376233 - 6134386639313565336434656662343361353863303863610a643837353932393836316530623338 - 35656461346463386635336431383138376132666362353964363531613465383966616132366361 - 6133623330653562300a326134346666393462303739646266356633383366356364613432313533 - 32353462663233626664303630663139383031643034623930623630303837333933393062383031 - 3339663233626535633735303535353565323132303863633932 - -heimdall_dashboard_htpasswd: !vault | - $ANSIBLE_VAULT;1.1;AES256 - 34333333383665643861643735663664626538303836653565333430326530643434333835396630 - 6563386232623937626364323937356266363565353134370a616634363463633736663261646236 - 35653036666339663562653633393436366334343737666530626233323366373933636238383764 - 3261386363363766650a643266363636353730373161643762666430653233633033323634626166 - 66303230643836303933623564363766636531313436613232326138653764353037643965646136 - 3262393863306333383632396133386139663163376335333361 diff --git a/ansible/archive/inventory/host_vars/terramaster.yml b/ansible/archive/inventory/host_vars/terramaster.yml deleted file mode 100644 index aa42f20..0000000 --- a/ansible/archive/inventory/host_vars/terramaster.yml +++ /dev/null @@ -1,5 +0,0 @@ ---- -ansible_user: chester -ansible_ssh_private_key_file: /home/chester/.ssh/id_ed25519 -# TerraMaster key was deployed via terramaster_deploy_ssh_key.yml. -# If key auth breaks, re-run that playbook with --ask-pass to redeploy. diff --git a/ansible/archive/inventory/hosts.ini b/ansible/archive/inventory/hosts.ini deleted file mode 100644 index 2720cdd..0000000 --- a/ansible/archive/inventory/hosts.ini +++ /dev/null @@ -1,85 +0,0 @@ -# Generated inventory from ../group_vars/all.yml - -# --- Watchtower (local controller) --- -[watchtower] -localhost ansible_connection=local - -# --- Proxmox Cluster (management) --- -[proxmox_cluster] -pve01 ansible_host=10.0.0.201 ansible_user=root ansible_ssh_private_key_file=/home/chester/.ssh/id_ed25519 ansible_port=22 -pve02 ansible_host=10.0.0.202 ansible_user=root ansible_ssh_private_key_file=/home/chester/.ssh/id_ed25519 ansible_port=22 -pve03 ansible_host=10.0.0.203 ansible_user=root ansible_ssh_private_key_file=/home/chester/.ssh/id_ed25519 ansible_port=22 - -[proxmox_cluster:vars] -ansible_user=root -ansible_become=true -ansible_python_interpreter=/usr/bin/python3 - -# --- Swarm Managers --- -[swarm_managers] -swarm-manager-1 ansible_host=10.0.0.211 -swarm-manager-2 ansible_host=10.0.0.212 -swarm-manager-3 ansible_host=10.0.0.213 - -# --- Swarm Workers --- -[swarm_workers] -swarm-worker-1 ansible_host=10.0.0.221 -swarm-worker-2 ansible_host=10.0.0.222 -swarm-worker-3 ansible_host=10.0.0.223 - -[swarm_hosts:children] -swarm_managers -swarm_workers - -[swarm_hosts:vars] -ansible_user=chester -ansible_ssh_private_key_file=/home/chester/.ssh/id_ed25519 - -# --- Standalone Ubuntu VMs --- -[standalone_ubuntu] -statler ansible_host=10.0.0.210 - -[standalone_ubuntu:vars] -ansible_user=chester -ansible_ssh_private_key_file=/home/chester/.ssh/id_ed25519 - -# --- Heimdall (Edge Router / Traefik host) --- -[heimdall_hosts] -heimdall ansible_host=10.0.0.151 - -[heimdall_hosts:vars] -ansible_user=chester -ansible_ssh_private_key_file=/home/chester/.ssh/id_ed25519 - -# --- AI Grid --- -[ai_grid] - -# --- Docker Hosts --- -[docker_hosts] -statler ansible_host=10.0.0.210 - -# --- Storage --- -[storage] -synology ansible_host=10.0.0.249 ansible_scp_if_ssh=True -terramaster ansible_host=10.0.0.250 ansible_scp_if_ssh=True - -# --- Lifecycle: Onboarding TBD --- -[onboarding_tbd] -ai-lenovo ansible_host=10.0.0.220 - -# --- Lifecycle: Retired / Shutdown --- -[retired_hosts] -waldorf ansible_host=10.0.0.251 - -# --- Aggregate grouping --- -[ubuntu_lab:children] -swarm_managers -swarm_workers -standalone_ubuntu -ai_grid -docker_hosts -storage - -[ubuntu_lab:vars] -ansible_user=chester -ansible_ssh_private_key_file=/home/chester/.ssh/id_ed25519 diff --git a/ansible/archive/outputs/SWARM_TOPOLOGY_ANALYSIS_20260312.md b/ansible/archive/outputs/SWARM_TOPOLOGY_ANALYSIS_20260312.md deleted file mode 100644 index ed4fb6a..0000000 --- a/ansible/archive/outputs/SWARM_TOPOLOGY_ANALYSIS_20260312.md +++ /dev/null @@ -1,340 +0,0 @@ ---- -# Hardware Specifications & Docker Swarm Topology Analysis -# Generated: 2026-03-12 -# Subject Hosts: pve03 (10.0.0.203) vs pve04 (10.0.0.204) -# Context: Evaluating 3-node identical Proxmox cluster for Docker Swarm workloads - ---- - -## EXECUTIVE SUMMARY - -**Finding**: pve03 and pve04 are **NOT identical**, with meaningful differences: -- **pve03**: 10 cores, 23.6 GB RAM, unknown storage capacity (already clustered, running 3 VMs) -- **pve04**: 14 cores, 15 GB RAM, 238.5 GB NVMe SSD (fresh, not yet clustered) - -**Recommendation for "3 identically-spec'd devices":** -- **Option A (Recommended)**: Use **pve04 as the template model**. Procurement should source 3Γ— Intel Core i5-13500T machines with 15+ GB RAM and 240+ GB NVMe storage. pve04 is the better baseline (better single-thread performance, dedicated NVMe, fresh OS). -- **Option B**: Keep **pve03 as template**. Run a deeper audit on pve03's actual storage (it has 21 loop/dm devicesβ€”unclear if additional storage is attached). Backfill pve04 and a 3rd host to match pve03's full config. - -**Verdict**: **pve04 > pve03 for Swarm baseline**. The i5-13500T offers superior CPU performance (4600 MHz boost vs 2885 MHz), dedicated fast storage, and is freshly provisioned. Use pve04 as the reference architecture for the 3rd node. - ---- - -## DETAILED HARDWARE COMPARISON - -### CPU Specifications - -| Dimension | pve03 | pve04 | Status | -|-----------|-------|-------|--------| -| **Model** | Unknown / unrecognized | Intel Core i5-13500T | βœ… pve04 superior | -| **Architecture** | x86_64 | x86_64 | βœ… Match | -| **Socket Count** | 1 | 1 | βœ… Match | -| **Cores per Socket** | 10 | 14 | ⚠️ **MISMATCH** | -| **Logical CPUs (with HT)** | 10 | 20 | ⚠️ **MISMATCH** | -| **Max Frequency** | 2,885 MHz | 4,600 MHz | ⚠️ **pve04 55% faster** | -| **Min Frequency** | Unknown | 800 MHz | β€” | -| **Microcode Level** | 0x437 | 0x3a | β€” | - -**Interpretation:** -- pve04's i5-13500T is a **13th-gen Intel desktop CPU** (2023), significantly newer and faster than pve03 -- pve03's CPU could be a degraded/limited processor or a different i5/i7 SKUβ€”need clarification -- **For Docker Swarm workloads**: pve04's higher clock speed (4600 MHz) means better latency-sensitive tasks; pve03's 10 cores are still adequate for the planned 2 VMs (manager + worker) per node - -**Recommendation**: If strict "identical" is the mandate, **pve04 is the better model to replicate**. Purchasing 3Γ— i5-13500T machines ensures: -1. Consistent single-threaded performance -2. Known thermal/power envelope -3. Support (retail CPUs, widely available) - ---- - -### Memory (RAM) Specifications - -| Dimension | pve03 | pve04 | Status | -|-----------|-------|-------|--------| -| **Total RAM** | 23.6 GB | 15.0 GB | ⚠️ **MISMATCH** | -| **Free RAM** | 12.4 GB | 13.0 GB | ⚠️ pve03 has extra, currently used | -| **Used by OS + Proxmox** | ~11.2 GB | ~1.7 GB | ⚠️ pve03 heavier | - -**Interpretation:** -- pve03: 23.6 GB total (likely 2Γ— 12 GB or 4Γ— 8 GB SODIMM/UDIMM sticks) -- pve04: 15 GB total (likely 1Γ— 16 GB, with 1 GB reserved for BIOS/SMM) -- pve03 is using ~11 GB for the OS and Proxmox daemon + 3 running VMs -- pve04 is minimal (fresh install, no VMs) - -**Validation Against Swarm Requirements:** -- Each node will host 2 VMs: 1 manager (2 cores, 2 GB RAM) + 1 worker (2 cores, 2 GB RAM) -- Proxmox overhead: ~2-4 GB per node -- **Minimum needed: 8+ GB RAM per node** βœ… Both qualify -- **Optimal: 16 GB** βœ… pve04 meets this; pve03 exceeds it - -**Recommendation**: Use **16 GB as the standard** for 3-node cluster (matches pve04). This is cost-effective and provides ample headroom. - ---- - -### Storage Specifications - -| Dimension | pve03 | pve04 | Status | -|-----------|-------|-------|--------| -| **Primary Disk(s)** | Unknown (21 loop/dm devices detected) | 1Γ— 238.5 GB NVMe SSD | ⚠️ **pve04 transparent** | -| **Root FS Capacity** | 68 GB | 238.5 GB | ⚠️ **MISMATCH** | -| **Root FS Available** | 59 GB free | ~230 GB available | ⚠️ pve04 has more room | -| **Storage Type** | Unknown (likely SATA SSD or array) | Enterprise-grade NVMe | β€” | - -**Interpretation:** -- pve03's storage is **opaque**: 21 loop and device-mapper devices suggest: - - Possible RAID configuration (dm-* = device mapper) - - LVM (Logical Volume Manager) setup - - Possibly shared storage mounted - - Current state: ~68 GB LVM volume, 9 GB used -- pve04's storage is **straightforward**: Single 238.5 GB NVMe SSD, clean LVM setup, minimal OS footprint - -**VM Storage Requirements (per node):** -- 1 Manager VM: 32 GB disk (from provisionspec in your playbook) -- 1 Worker VM: 32 GB disk -- **Total per node: 64 GB guest storage** (+ Proxmox root FS) -- **Total available after OS: pve03 β‰ˆ 59 GB, pve04 β‰ˆ 230 GB** - -**⚠️ CRITICAL FINDING**: pve03 has **insufficient disk capacity** for the planned topology (needs 64 GB for VMs + OS buffer = ~80 GB, only has ~59 GB free). **Unless pve03 has additional storage mounted (not visible in the scan), it cannot host 2 full 32 GB VMs.** - -**Recommendation**: -1. **Immediate**: Verify pve03's storage architecture. Why 21 dm/loop devices? Is there additional NAS/SAN attached? -2. **For 3rd node procurement**: Use **pve04 as baseline**: - - 240+ GB NVMe SSD (minimum) - - Clean, single-drive configuration (KISS principle) - - Sufficient headroom for VMs + snapshots + log growth - ---- - -### Network Specifications - -| Dimension | pve03 | pve04 | Status | -|-----------|-------|-------|--------| -| **Interface Count** | 6 interfaces | 4 interfaces | β€” | -| **Bridge** | vmbr0 + tap devices | vmbr0 visible | βœ… Both standard | -| **Primary Network** | wlp0s20f3 + nic0 | wlp0s20f3 + nic0 | βœ… Match (suggest renaming nic0) | - -**Interpretation:** -- Both nodes have the **same network card models** (wlp0s20f3 = wireless, nic0 = Ethernet) -- pve03 has **2 tap devices** (tap301i0, tap302i0) = VM network interfaces from running VMs -- pve04 has **no tap devices** = freshly imaged, no VMs yet -- **Corosync / Proxmox Cluster**: Both will use vmbr0 for inter-node communication - -**Recommendation**: Both nodes are network-compatible. No issues for Docker Swarm overlay networking. - ---- - -### Proxmox & Cluster Status - -| Dimension | pve03 | pve04 | Status | -|-----------|-------|-------|--------| -| **Proxmox Version** | 9.1.6 | 9.1.1 | ⚠️ Versions differ by .5 patch | -| **Kernel** | 6.17.2-1-pve | 6.17.2-1-pve | βœ… Match | -| **OS Distro** | Debian trixie | Debian trixie | βœ… Match | -| **Cluster Status** | βœ… Clustered (homelab) | ❌ Not clustered | β€” | -| **Cluster Members** | pve01, pve02, pve03 | None yet | β€” | -| **VMs Running** | 3 VMs/containers | 0 VMs | β€” | -| **Uptime** | 4 days | ~0 days (fresh) | β€” | - -**Interpretation:** -- pve03 is an **active, production node** in the homelab cluster -- pve04 is a **fresh candidate** ready for integration -- Minor version difference (9.1.6 vs 9.1.1) is **not a blocker**β€”routine updates will align them - -**Recommendation**: Update both to the latest Proxmox 9.x patch level before final cluster formation. - ---- - -## DOCKER SWARM TOPOLOGY ANALYSIS - -### Target Design (from documentation/architecture/compute-plane.md) -- 3Γ— identically-spec'd physical Proxmox nodes -- 3Γ— Swarm Managers (1 per node, IPs: 10.0.0.211–213) -- 3Γ— Swarm Workers (1 per node, IPs: 10.0.0.221–223) -- Each VM: 2 vCPU, 4 GB RAM, 32 GB disk -- Proxmox cluster with Corosync for HA -- No overcommit - -### Capacity Analysis: pve04 as Reference Model - -#### CPU -- **pve04 Spec**: 14 cores, 1 socket, 4600 MHz peak -- **Planned Usage**: 4 vCPU (2 for manager, 2 for worker) = **28.6% utilization** -- **Proxmox/Corosync Overhead**: ~1 vCPU -- **Available Headroom**: 14 - 4 - 1 = **9 vCPU spare** -- **Verdict**: βœ… **EXCELLENT**. Can sustain workload + spikes + 2x VM migration - -#### Memory (15 GB) -- **Planned Usage**: 4 GB (manager) + 4 GB (worker) = 8 GB -- **Proxmox OS + daemons**: ~2–3 GB -- **Available Headroom**: 15 - 8 - 2.5 = **4.5 GB spare** -- **Verdict**: βœ… **ADEQUATE**. No aggressive swapping. Supports scheduled workload growth. - -#### Storage (240 GB) -- **Planned Usage**: 32 GB (manager) + 32 GB (worker) = 64 GB -- **Proxmox OS**: ~8 GB -- **Snapshots/Logs Buffer**: ~20 GB -- **Total Planned**: ~92 GB -- **Available Headroom**: 240 - 92 = **148 GB spare** -- **Verdict**: βœ… **EXCELLENT**. Ample room for workload scaling, backups, experiments. - -#### Network -- **Swarm Overlay**: vmbr0 at 1 Gbps -- **Expected inter-node throughput**: <100 Mbps for modest swarm (10–20 containers) -- **Verdict**: βœ… **ADEQUATE** for Docker Swarm in homelab. Upgrade to 10 Gbps if production-scale or data-intensive AI workloads planned. - ---- - -### High-Availability & Resilience - -#### Quorum Analysis -- **3 Proxmox Nodes**: Corosync quorum = 2/3 nodes required - - Can tolerate 1 node failure βœ… Good - - If node1 fails: quorum = nodes 2+3 (still β‰₯2) β†’ **cluster remains operational** -- **3 Swarm Managers**: Raft consensus quorum = 2/3 nodes required - - Can tolerate 1 manager failure βœ… Good - - If manager1 fails: quorum = managers 2+3 (still β‰₯2) β†’ **swarm remains operational** - -#### Failure Scenarios -| Scenario | Outcome | Swarm Impact | -|----------|---------|--------------| -| 1 node power fails | Surviving nodes take over VMs | Containers restart on node 2&3 | -| 1 node storage corrupt | Proxmox HA can restart VMs on peer | Brief service interruption (~30s) | -| 1 node network partition | Corosync detects; quorum = 2 survivors | Cluster continues; isolated node reboots | -| 2 nodes fail simultaneously | Game over; cluster non-functional | **ALL workload lost** | - -**Verdict**: Design supports N-1 failure tolerance. **Very good for homelab.** - ---- - -## SPECIAL CONSIDERATIONS FOR pve03 - -### Storage Mystery: 21 Loop/Device-Mapper Devices -**Questions to Investigate:** -1. Is pve03 mounted to external NAS/SAN (e.g., Synology 10.0.0.249)? -2. Is there a RAID or LVM snapshot setup? -3. Were multiple physical drives present originally, now failed? - -**Action Items:** -```bash -# From watchtower or pve03: -pvesh get /storage --output-format json # List all Proxmox storage targets -zfs list # If ZFS in use -lvs # LVM volumes -pvdisplay # LVM physical volumes -df -i # Inode usage (helps diagnose loop mounts) -``` - -**Implication**: Until pve03's storage is clarified, it **cannot be used as a template** for the 3rd identical host. - ---- - -## FINAL RECOMMENDATIONS - -### 1. **Short-Term (Immediate)** - -**Action**: Clarify pve03's storage architecture. -```bash -# SSH into pve03 via watchtower relay or direct if SSH key added -ssh root@10.0.0.203 "pvesh get /storage --output-format json" -ssh root@10.0.0.203 "lvs && pvs" -ssh root@10.0.0.203 "zfs list 2>/dev/null || echo 'ZFS not in use'" -``` - -**If pve03 has external storage**: -- Note the configuration (NAS IP, mount method, capacity) -- Plan to replicate in 3rd node - -**If pve03 is just a single drive**: -- Proceed with pve04 as template - -### 2. **Medium-Term (Before Final 3-Node Deployment)** - -**Option A: Adopt pve04 as Template (RECOMMENDED)** -- Procurement: 3Γ— machines with **Intel i5-13500T, 16 GB RAM, 256 GB NVMe** -- Cost: ~$200–300 per node (retail Core i5 desktop equivalent) -- Timeline: 1–2 weeks (sourcing) -- Next step: Install Proxmox 9.x on 3rd node; cluster join - -**Option B: Backfill pve03 Config to pve04 & 3rd Node** -- Upgrade pve04 RAM from 15 GB β†’ 24 GB (add 1Γ— 8 GB SODIMM) -- Verify pve03's external storage is documented -- Replicate in pve04 and 3rd node -- Cost: ~$30–50 per node (additional RAM) -- Timeline: 1 week -- Risk: Depends on clarifying pve03 fully - -**Recommendation Pick**: **Option A is cleaner**. pve04 is fresher, faster, and has clear config. - -### 3. **Long-Term (Post-3-Node Commissioning)** - -**Cluster Formation:** -```bash -# On pve04 (assuming elected as initial leader): -pvecm create homelab - -# On 3rd new node: -pvecm add - -# Verify: -pvesh get /cluster/status -``` - -**VM Provisioning:** -```bash -# Use your existing playbook: -ansible-playbook -i inventory/hosts.ini \ - playbooks/proxmox/provision_swarm_vms.yml \ - -e target_host=pve04 \ - -e target_host=pve0N # For 3rd node -``` - -**Docker Swarm Init:** -```bash -# On swarm-manager-1 (e.g., 10.0.0.211): -docker swarm init --advertise-addr 10.0.0.211 - -# On manager-2 & manager-3: -docker swarm join --token 10.0.0.211:2377 -``` - ---- - -## APPENDIX: Hardware Specs Collected - -### pve03 (10.0.0.203) – Full Details -``` -CPU: 10 cores, 1 socket, max 2885 MHz -Memory: 23.6 GB total, 12.4 GB free -Storage: 68 GB root LVM (59 GB free) + 21 dm/loop devices (TBD) -OS: Debian trixie, kernel 6.17.2-1-pve -Proxmox: 9.1.6 -Network: 6 interfaces (vmbr0, nic0, wlp0s20f3, tap301i0, tap302i0, lo) -Cluster Status: Clustered (homelab), 3 VMs running -Uptime: 4 days -``` - -### pve04 (10.0.0.204) – Full Details -``` -CPU: Intel Core i5-13500T, 14 cores, 1 socket, 20 vCPUs (HT), max 4600 MHz -Memory: 15.0 GB total, ~13.0 GB available, 8.0 GB swap -Storage: 238.5 GB NVMe SSD (nvme0n1), single drive -OS: Debian trixie, kernel 6.17.2-1-pve -Proxmox: 9.1.1 -Network: 4 interfaces (vmbr0, nic0, wlp0s20f3, lo) -Cluster Status: Not clustered yet, 0 VMs -Uptime: Fresh (just rebooted) -``` - ---- - -## CONCLUSION - -**pve04 is the superior choice** for replication to a 3-node cluster because of: -1. **CPU performance**: 4600 MHz vs 2885 MHz (55% faster single-thread) -2. **Storage clarity**: Single 240 GB NVMe (vs pve03's mysterious setup) -3. **Ballpark specifications**: 15 GB RAM + 240 GB SSD = excellent value for Swarm workloads -4. **Freshness**: No legacy config debt - -**Immediate action**: Clarify pve03's storage. Then either adopt pve04 as template or provide additional pve03 context to backfill. - -**Expected outcome**: 3-node Proxmox cluster running 6 Docker Swarm nodes (3 managers, 3 workers) with excellent resilience, performance, and headroom for future growth. diff --git a/ansible/archive/outputs/cluster-reconcile/node-replacement-mar13-2026-20260313T143107/cluster-reconcile-summary.txt b/ansible/archive/outputs/cluster-reconcile/node-replacement-mar13-2026-20260313T143107/cluster-reconcile-summary.txt deleted file mode 100644 index a9a948a..0000000 --- a/ansible/archive/outputs/cluster-reconcile/node-replacement-mar13-2026-20260313T143107/cluster-reconcile-summary.txt +++ /dev/null @@ -1,52 +0,0 @@ -Project: node-replacement-mar13-2026 -Mode: validate -Join node: pve01 -Join anchor host: pve02 -Join anchor IP: 10.0.0.202 -Timestamp: 20260313T143107 - -=== pvecm nodes (anchor) === - -Membership information ----------------------- - Nodeid Votes Name - 1 1 pve01 - 2 1 pve02 (local) - 3 1 pve03 - -=== pvecm status (anchor) === -Cluster information -------------------- -Name: homelab -Config Version: 4 -Transport: knet -Secure auth: on - -Quorum information ------------------- -Date: Fri Mar 13 14:31:11 2026 -Quorum provider: corosync_votequorum -Nodes: 3 -Node ID: 0x00000002 -Ring ID: 1.3c -Quorate: Yes - -Votequorum information ----------------------- -Expected votes: 3 -Highest expected: 3 -Total votes: 3 -Quorum: 2 -Flags: Quorate - -Membership information ----------------------- - Nodeid Votes Name -0x00000001 1 10.0.0.201 -0x00000002 1 10.0.0.202 (local) -0x00000003 1 10.0.0.203 - -=== service state on join node === -active -active -inactive diff --git a/ansible/archive/outputs/cluster-reconcile/node-replacement-mar13-2026-20260313T143115/cluster-reconcile-summary.txt b/ansible/archive/outputs/cluster-reconcile/node-replacement-mar13-2026-20260313T143115/cluster-reconcile-summary.txt deleted file mode 100644 index fa36b99..0000000 --- a/ansible/archive/outputs/cluster-reconcile/node-replacement-mar13-2026-20260313T143115/cluster-reconcile-summary.txt +++ /dev/null @@ -1,52 +0,0 @@ -Project: node-replacement-mar13-2026 -Mode: join -Join node: pve01 -Join anchor host: pve02 -Join anchor IP: 10.0.0.202 -Timestamp: 20260313T143115 - -=== pvecm nodes (anchor) === - -Membership information ----------------------- - Nodeid Votes Name - 1 1 pve01 - 2 1 pve02 (local) - 3 1 pve03 - -=== pvecm status (anchor) === -Cluster information -------------------- -Name: homelab -Config Version: 5 -Transport: knet -Secure auth: on - -Quorum information ------------------- -Date: Fri Mar 13 14:31:29 2026 -Quorum provider: corosync_votequorum -Nodes: 3 -Node ID: 0x00000002 -Ring ID: 1.3c -Quorate: Yes - -Votequorum information ----------------------- -Expected votes: 3 -Highest expected: 3 -Total votes: 3 -Quorum: 2 -Flags: Quorate - -Membership information ----------------------- - Nodeid Votes Name -0x00000001 1 10.0.0.201 -0x00000002 1 10.0.0.202 (local) -0x00000003 1 10.0.0.203 - -=== service state on join node === -active -active -active diff --git a/ansible/archive/outputs/cluster-reconcile/node-replacement-mar13-2026-20260313T143430/cluster-reconcile-summary.txt b/ansible/archive/outputs/cluster-reconcile/node-replacement-mar13-2026-20260313T143430/cluster-reconcile-summary.txt deleted file mode 100644 index 9c900ec..0000000 --- a/ansible/archive/outputs/cluster-reconcile/node-replacement-mar13-2026-20260313T143430/cluster-reconcile-summary.txt +++ /dev/null @@ -1,52 +0,0 @@ -Project: node-replacement-mar13-2026 -Mode: join -Join node: pve01 -Join anchor host: pve02 -Join anchor IP: 10.0.0.202 -Timestamp: 20260313T143430 - -=== pvecm nodes (anchor) === - -Membership information ----------------------- - Nodeid Votes Name - 1 1 pve01 - 2 1 pve02 (local) - 3 1 pve03 - -=== pvecm status (anchor) === -Cluster information -------------------- -Name: homelab -Config Version: 5 -Transport: knet -Secure auth: on - -Quorum information ------------------- -Date: Fri Mar 13 14:34:36 2026 -Quorum provider: corosync_votequorum -Nodes: 3 -Node ID: 0x00000002 -Ring ID: 1.3c -Quorate: Yes - -Votequorum information ----------------------- -Expected votes: 3 -Highest expected: 3 -Total votes: 3 -Quorum: 2 -Flags: Quorate - -Membership information ----------------------- - Nodeid Votes Name -0x00000001 1 10.0.0.201 -0x00000002 1 10.0.0.202 (local) -0x00000003 1 10.0.0.203 - -=== service state on join node === -active -active -active diff --git a/ansible/archive/outputs/dhcp_reservations.csv b/ansible/archive/outputs/dhcp_reservations.csv deleted file mode 100644 index 6daac80..0000000 --- a/ansible/archive/outputs/dhcp_reservations.csv +++ /dev/null @@ -1,18 +0,0 @@ -hostname,desired_ip,current_ip,mac,role,notes -er7212pc,10.0.0.2,10.0.0.2,,gateway,"DHCP server / Omada controller β€” no reservation needed" -pve01,10.0.10.11,10.0.0.201,,proxmox,"Proxmox mgmt - reserve for management interface" -pve02,10.0.10.12,10.0.0.202,,proxmox,"Proxmox mgmt - reserve for management interface" -pve03,10.0.10.13,10.0.0.203,,proxmox,"Proxmox mgmt - reserve for management interface" -swarm-manager-1,10.0.200.11,10.0.0.211,,swarm_manager,"Swarm manager - static preferred" -swarm-manager-2,10.0.200.12,10.0.0.212,,swarm_manager,"Swarm manager - static preferred" -swarm-manager-3,10.0.200.13,10.0.0.213,,swarm_manager,"Swarm manager - static preferred" -swarm-worker-1,10.0.200.21,10.0.0.221,,swarm_worker,"Worker - can be DHCP reservation or static" -swarm-worker-2,10.0.200.22,10.0.0.222,,swarm_worker,"Worker - can be DHCP reservation or static" -swarm-worker-3,10.0.200.23,10.0.0.223,,swarm_worker,"Worker - can be DHCP reservation or static" -ai-lenovo,10.0.200.20,10.0.0.220,,ai_node,"AI node - reserve" -synology,10.0.10.40,10.0.0.249,,nas,"NAS management IP - reserve" -terramaster,10.0.10.41,10.0.0.250,,nas,"NAS management IP - reserve" -waldorf,10.0.200.30,10.0.0.251,,docker_host,"Docker host - reserve" -watchtower,10.0.10.200,10.0.0.200,,controller,"Watchtower (Pi) - reserve if controller" -heimdall-mgmt,10.0.10.2,, ,beelink,"Heimdall (Beelink) management NIC" -heimdall-lan,10.0.0.50,, ,beelink,"Heimdall service LAN NIC" diff --git a/ansible/archive/outputs/hardware_facts_20260311T204909.yml b/ansible/archive/outputs/hardware_facts_20260311T204909.yml deleted file mode 100644 index ce222ae..0000000 --- a/ansible/archive/outputs/hardware_facts_20260311T204909.yml +++ /dev/null @@ -1,88 +0,0 @@ ---- -# Hardware Facts Report -# Generated: 2026-03-12T00:49:09Z -# Hosts Analyzed: 4 -# -# Usage: -# This report compares hardware specifications for Docker Swarm topology planning. -# See README in documentation/architecture/ for capacity analysis. - -pve03: - cpu: - cores_per_socket: 10 - cpu_load_percent: 0% - current_1min_load: 0 - max_frequency_mhz: 2885 - model: '0' - sockets: 1 - total_cores: 10 - fqdn: pve03.local - hostname: pve03 - ip_address: 10.0.0.203 - memory: - free_gb: 12 - free_mb: 12433 - total_gb: 23 - total_mb: 23726 - network: - interface_list: - - tap301i0 - - vmbr0 - - lo - - tap302i0 - - wlp0s20f3 - - nic0 - interfaces_count: 6 - proxmox: - cluster_members: - - homelab - - pve01 - - pve02 - - pve03 - cluster_name: not-clustered - is_clustered: true - version: '' - version_full: 'pve-manager/9.1.6/71482d1833ded40a (running kernel: 6.17.2-1-pve)' - vms_and_containers: 3 - storage: - disk_list: - - loop1 - - dm-1 - - dm-10 - - nvme0n1 - - dm-8 - - loop6 - - dm-6 - - loop4 - - dm-4 - - loop2 - - dm-2 - - dm-11 - - loop0 - - dm-0 - - dm-9 - - loop7 - - dm-7 - - loop5 - - dm-5 - - loop3 - - dm-3 - disks_detected: 21 - mounts_summary: - - udev (12G available of 12G) - - tmpfs (2.4G available of 2.4G) - - /dev/mapper/pve-root (59G available of 68G) - - tmpfs (12G available of 12G) - - efivarfs (68K available of 438K) - - tmpfs (5.0M available of 5.0M) - - tmpfs (12G available of 12G) - - /dev/nvme0n1p2 (1014M available of 1022M) - - tmpfs (1.0M available of 1.0M) - - tmpfs (1.0M available of 1.0M) - - tmpfs (2.4G available of 2.4G) - - /dev/fuse (128M available of 128M) - system: - kernel: 6.17.2-1-pve - os: Debian trixie - uptime_days: 4 - timestamp: '2026-03-12T00:49:09Z' diff --git a/ansible/archive/outputs/hardware_facts_20260312T215928.yml b/ansible/archive/outputs/hardware_facts_20260312T215928.yml deleted file mode 100644 index c15bd1a..0000000 --- a/ansible/archive/outputs/hardware_facts_20260312T215928.yml +++ /dev/null @@ -1,88 +0,0 @@ ---- -# Hardware Facts Report -# Generated: 2026-03-13T01:59:28Z -# Hosts Analyzed: 4 -# -# Usage: -# This report compares hardware specifications for Docker Swarm topology planning. -# See README in documentation/architecture/ for capacity analysis. - -pve03: - cpu: - cores_per_socket: '10' - cpu_load_percent: 0% - current_1min_load: '0' - max_frequency_mhz: '2276' - model: '0' - sockets: '1' - total_cores: '10' - fqdn: pve03.local - hostname: pve03 - ip_address: 10.0.0.203 - memory: - free_gb: '11' - free_mb: '12126' - total_gb: '23' - total_mb: '23726' - network: - interface_list: - - vmbr0 - - wlp0s20f3 - - nic0 - - tap302i0 - - lo - - tap301i0 - interfaces_count: '6' - proxmox: - cluster_members: - - homelab - - pve02 - - pve03 - - pve01 - cluster_name: not-clustered - is_clustered: true - version: '' - version_full: 'pve-manager/9.1.6/71482d1833ded40a (running kernel: 6.17.2-1-pve)' - vms_and_containers: '3' - storage: - disk_list: - - loop1 - - dm-1 - - dm-10 - - nvme0n1 - - dm-8 - - loop6 - - dm-6 - - loop4 - - dm-4 - - loop2 - - dm-2 - - dm-11 - - loop0 - - dm-0 - - dm-9 - - loop7 - - dm-7 - - loop5 - - dm-5 - - loop3 - - dm-3 - disks_detected: '21' - mounts_summary: - - udev (12G available of 12G) - - tmpfs (2.4G available of 2.4G) - - /dev/mapper/pve-root (59G available of 68G) - - tmpfs (12G available of 12G) - - efivarfs (68K available of 438K) - - tmpfs (5.0M available of 5.0M) - - tmpfs (12G available of 12G) - - /dev/nvme0n1p2 (1014M available of 1022M) - - tmpfs (1.0M available of 1.0M) - - tmpfs (1.0M available of 1.0M) - - tmpfs (2.4G available of 2.4G) - - /dev/fuse (128M available of 128M) - system: - kernel: 6.17.2-1-pve - os: Debian trixie - uptime_days: '5' - timestamp: '2026-03-13T01:59:28Z' diff --git a/ansible/archive/outputs/heimdall-baseline-20260312T211908/compose_files/_home_chester_traefik_docker-compose.yml b/ansible/archive/outputs/heimdall-baseline-20260312T211908/compose_files/_home_chester_traefik_docker-compose.yml deleted file mode 100644 index 0ca28c0..0000000 --- a/ansible/archive/outputs/heimdall-baseline-20260312T211908/compose_files/_home_chester_traefik_docker-compose.yml +++ /dev/null @@ -1,98 +0,0 @@ -services: - redis: - image: redis:7-alpine - container_name: redis - restart: unless-stopped - ports: - - "6379:6379" - networks: - - proxy-net - volumes: - - redis-data:/data - command: redis-server --appendonly yes - healthcheck: - test: ["CMD", "redis-cli", "ping"] - interval: 10s - timeout: 5s - retries: 5 - - docker-socket-proxy: - image: tecnativa/docker-socket-proxy:latest - container_name: docker-socket-proxy - restart: unless-stopped - userns_mode: "host" - user: "0:0" - security_opt: - - apparmor=unconfined - privileged: true - group_add: - - "988" - environment: - - CONTAINERS=1 - - SERVICES=1 - - TASKS=1 - - NETWORKS=1 - - EVENTS=1 - - VERSION=1 - - PING=1 - - AUTH=1 - - INFO=1 - - VOLUMES=1 - volumes: - - /var/run/docker.sock:/var/run/docker.sock - networks: - - proxy-net - - traefik: - image: traefik:v3.6.5 - container_name: traefik - restart: unless-stopped - user: "0:0" - read_only: false - depends_on: - redis: - condition: service_healthy - docker-socket-proxy: - condition: service_started - environment: - - DOCKER_HOST=tcp://docker-socket-proxy:2375 -# - DOCKER_API_VERSION=1.41 - - CLOUDFLARE_DNS_API_TOKEN=${CLOUDFLARE_DNS_API_TOKEN} - - CLOUDFLARE_ZONE_API_TOKEN=${CLOUDFLARE_DNS_API_TOKEN} - networks: - - proxy-net - ports: - - "80:80" - - "443:443" - volumes: - - ./traefik.yml:/traefik.yml:ro - - ./traefik-data/dynamic:/dynamic:ro - - ./traefik-data/certs:/certs - - ./traefik-data/access-logs:/var/log/traefik - labels: - - "traefik.enable=true" - # Dashboard - - "traefik.http.routers.traefik-secure.rule=Host(`proxy.castaldifamily.com`) && (PathPrefix(`/api`) || PathPrefix(`/dashboard`))" - - "traefik.http.routers.traefik-secure.entrypoints=websecure" - - "traefik.http.routers.traefik-secure.tls=true" - - "traefik.http.routers.traefik-secure.tls.certresolver=cloudflare" - - "traefik.http.routers.traefik-secure.service=api@internal" - - "traefik.http.routers.traefik-secure.middlewares=dashboard-auth@file,security-headers@file,ratelimit-basic@file,dashboard-slash@file" - # Root redirect - - "traefik.http.routers.traefik-root.rule=Host(`proxy.castaldifamily.com`) && Path(`/`)" - - "traefik.http.routers.traefik-root.entrypoints=websecure" - - "traefik.http.routers.traefik-root.tls=true" - - "traefik.http.routers.traefik-root.tls.certresolver=cloudflare" - - "traefik.http.routers.traefik-root.service=api@internal" - - "traefik.http.routers.traefik-root.middlewares=redirect-to-dashboard" - - "traefik.http.middlewares.redirect-to-dashboard.redirectregex.regex=^/$$" - - "traefik.http.middlewares.redirect-to-dashboard.redirectregex.replacement=/dashboard" - - "traefik.http.middlewares.redirect-to-dashboard.redirectregex.permanent=true" - -networks: - proxy-net: - driver: bridge - name: proxy-net - -volumes: - redis-data: diff --git a/ansible/archive/outputs/heimdall-baseline-20260312T211908/containers.yml b/ansible/archive/outputs/heimdall-baseline-20260312T211908/containers.yml deleted file mode 100644 index 3aece73..0000000 --- a/ansible/archive/outputs/heimdall-baseline-20260312T211908/containers.yml +++ /dev/null @@ -1,975 +0,0 @@ -- AppArmorProfile: docker-default - Args: - - --path.procfs=/host/proc - - --path.sysfs=/host/sys - - --path.rootfs=/rootfs - - --collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/) - Config: - AttachStderr: true - AttachStdin: false - AttachStdout: true - Cmd: - - --path.procfs=/host/proc - - --path.sysfs=/host/sys - - --path.rootfs=/rootfs - - --collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/) - Domainname: '' - Entrypoint: - - /bin/node_exporter - Env: - - PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin - ExposedPorts: - 9100/tcp: {} - Hostname: heimdall - Image: prom/node-exporter:latest - Labels: - maintainer: The Prometheus Authors - OpenStdin: false - StdinOnce: false - Tty: false - User: nobody - Volumes: null - WorkingDir: '' - Created: '2026-03-09T23:15:53.531184328Z' - Driver: overlayfs - ExecIDs: null - HostConfig: - AutoRemove: false - Binds: - - /proc:/host/proc:ro - - /sys:/host/sys:ro - - /:/rootfs:ro - BlkioDeviceReadBps: null - BlkioDeviceReadIOps: null - BlkioDeviceWriteBps: null - BlkioDeviceWriteIOps: null - BlkioWeight: 0 - BlkioWeightDevice: null - CapAdd: null - CapDrop: - - ALL - Cgroup: '' - CgroupParent: '' - CgroupnsMode: private - ConsoleSize: - - 0 - - 0 - ContainerIDFile: '' - CpuCount: 0 - CpuPercent: 0 - CpuPeriod: 0 - CpuQuota: 0 - CpuRealtimePeriod: 0 - CpuRealtimeRuntime: 0 - CpuShares: 0 - CpusetCpus: '' - CpusetMems: '' - DeviceCgroupRules: null - DeviceRequests: null - Devices: null - Dns: null - DnsOptions: null - DnsSearch: null - ExtraHosts: null - GroupAdd: null - IOMaximumBandwidth: 0 - IOMaximumIOps: 0 - IpcMode: private - Isolation: '' - Links: null - LogConfig: - Config: {} - Type: json-file - MaskedPaths: - - /proc/acpi - - /proc/asound - - /proc/interrupts - - /proc/kcore - - /proc/keys - - /proc/latency_stats - - /proc/sched_debug - - /proc/scsi - - /proc/timer_list - - /proc/timer_stats - - /sys/devices/virtual/powercap - - /sys/firmware - - /sys/devices/system/cpu/cpu0/thermal_throttle - - /sys/devices/system/cpu/cpu1/thermal_throttle - - /sys/devices/system/cpu/cpu2/thermal_throttle - - /sys/devices/system/cpu/cpu3/thermal_throttle - Memory: 134217728 - MemoryReservation: 0 - MemorySwap: 268435456 - MemorySwappiness: null - NanoCpus: 500000000 - NetworkMode: host - OomKillDisable: null - OomScoreAdj: 0 - PidMode: '' - PidsLimit: null - PortBindings: {} - Privileged: false - PublishAllPorts: false - ReadonlyPaths: - - /proc/bus - - /proc/fs - - /proc/irq - - /proc/sys - - /proc/sysrq-trigger - ReadonlyRootfs: true - RestartPolicy: - MaximumRetryCount: 0 - Name: unless-stopped - Runtime: runc - SecurityOpt: - - no-new-privileges:true - ShmSize: 67108864 - UTSMode: '' - Ulimits: null - UsernsMode: '' - VolumeDriver: '' - VolumesFrom: null - HostnamePath: /var/lib/docker/containers/3f397bc8b39d3a9ae4b903f1daf99fdfddd842cb86b549b86c7aba30fe4d7a4f/hostname - HostsPath: /var/lib/docker/containers/3f397bc8b39d3a9ae4b903f1daf99fdfddd842cb86b549b86c7aba30fe4d7a4f/hosts - Id: 3f397bc8b39d3a9ae4b903f1daf99fdfddd842cb86b549b86c7aba30fe4d7a4f - Image: sha256:3ac34ce007accad95afed72149e0d2b927b7e42fd1c866149b945b84737c62c3 - ImageManifestDescriptor: - digest: sha256:7bcf2839f207d926b908cd3c566c9f1577efb72268062be0c96cd3b17a5cb283 - mediaType: application/vnd.docker.distribution.manifest.v2+json - platform: - architecture: amd64 - os: linux - size: 949 - LogPath: /var/lib/docker/containers/3f397bc8b39d3a9ae4b903f1daf99fdfddd842cb86b549b86c7aba30fe4d7a4f/3f397bc8b39d3a9ae4b903f1daf99fdfddd842cb86b549b86c7aba30fe4d7a4f-json.log - MountLabel: '' - Mounts: - - Destination: /host/proc - Mode: ro - Propagation: rprivate - RW: false - Source: /proc - Type: bind - - Destination: /host/sys - Mode: ro - Propagation: rprivate - RW: false - Source: /sys - Type: bind - - Destination: /rootfs - Mode: ro - Propagation: rslave - RW: false - Source: / - Type: bind - Name: /node-exporter - NetworkSettings: - Networks: - host: - Aliases: null - DNSNames: null - DriverOpts: null - EndpointID: d2673440c953463f22ab1da395595e8f898bfab6baa043b2638fa2654fd04e4a - Gateway: '' - GlobalIPv6Address: '' - GlobalIPv6PrefixLen: 0 - GwPriority: 0 - IPAMConfig: null - IPAddress: '' - IPPrefixLen: 0 - IPv6Gateway: '' - Links: null - MacAddress: '' - NetworkID: b63c150f50197cfb21939a1369d37f0a309118dfb79be11d4c6082d963f8f70a - Ports: {} - SandboxID: 770e56f6832d109ab47e3b523e838be28d0bdf51a520cc5c9a07351bcb84f10d - SandboxKey: /var/run/docker/netns/default - Path: /bin/node_exporter - Platform: linux - ProcessLabel: '' - ResolvConfPath: /var/lib/docker/containers/3f397bc8b39d3a9ae4b903f1daf99fdfddd842cb86b549b86c7aba30fe4d7a4f/resolv.conf - RestartCount: 0 - State: - Dead: false - Error: '' - ExitCode: 0 - FinishedAt: '0001-01-01T00:00:00Z' - OOMKilled: false - Paused: false - Pid: 2616285 - Restarting: false - Running: true - StartedAt: '2026-03-09T23:15:53.649932822Z' - Status: running - Storage: - RootFS: - Snapshot: - Name: overlayfs -- AppArmorProfile: docker-default - Args: - - traefik - Config: - AttachStderr: true - AttachStdin: false - AttachStdout: true - Cmd: - - traefik - Domainname: '' - Entrypoint: - - /entrypoint.sh - Env: - - CLOUDFLARE_ZONE_API_TOKEN= - - DOCKER_HOST=tcp://docker-socket-proxy:2375 - - CLOUDFLARE_DNS_API_TOKEN= - - PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin - ExposedPorts: - 443/tcp: {} - 80/tcp: {} - Hostname: f0c70cc4667e - Image: traefik:v3.6.5 - Labels: - com.docker.compose.config-hash: 42df1402e650e630bde14fa90b6287582d9b29068566faaff58ed7ca6d60fffa - com.docker.compose.container-number: '1' - com.docker.compose.depends_on: redis:service_healthy:false,docker-socket-proxy:service_started:false - com.docker.compose.image: sha256:67622638cd88dbfcfba40159bc652ecf0aea0e032f8a3c7e3134ae7c037b9910 - com.docker.compose.oneoff: 'False' - com.docker.compose.project: traefik - com.docker.compose.project.config_files: /home/chester/traefik/docker-compose.yml - com.docker.compose.project.working_dir: /home/chester/traefik - com.docker.compose.replace: traefik - com.docker.compose.service: traefik - com.docker.compose.version: 5.0.2 - org.opencontainers.image.description: A modern reverse-proxy - org.opencontainers.image.documentation: https://docs.traefik.io - org.opencontainers.image.source: https://github.com/traefik/traefik - org.opencontainers.image.title: Traefik - org.opencontainers.image.url: https://traefik.io - org.opencontainers.image.vendor: Traefik Labs - org.opencontainers.image.version: v3.6.5 - traefik.enable: 'true' - traefik.http.middlewares.redirect-to-dashboard.redirectregex.permanent: 'true' - traefik.http.middlewares.redirect-to-dashboard.redirectregex.regex: ^/$ - traefik.http.middlewares.redirect-to-dashboard.redirectregex.replacement: /dashboard - traefik.http.routers.traefik-root.entrypoints: websecure - traefik.http.routers.traefik-root.middlewares: redirect-to-dashboard - traefik.http.routers.traefik-root.rule: Host(`proxy.castaldifamily.com`) - && Path(`/`) - traefik.http.routers.traefik-root.service: api@internal - traefik.http.routers.traefik-root.tls: 'true' - traefik.http.routers.traefik-root.tls.certresolver: cloudflare - traefik.http.routers.traefik-secure.entrypoints: websecure - traefik.http.routers.traefik-secure.middlewares: dashboard-auth@file,security-headers@file,ratelimit-basic@file,dashboard-slash@file - traefik.http.routers.traefik-secure.rule: Host(`proxy.castaldifamily.com`) - && (PathPrefix(`/api`) || PathPrefix(`/dashboard`)) - traefik.http.routers.traefik-secure.service: api@internal - traefik.http.routers.traefik-secure.tls: 'true' - traefik.http.routers.traefik-secure.tls.certresolver: cloudflare - OpenStdin: false - StdinOnce: false - Tty: false - User: 0:0 - Volumes: null - WorkingDir: / - Created: '2026-01-28T00:34:54.992079505Z' - Driver: overlayfs - ExecIDs: null - HostConfig: - AutoRemove: false - Binds: - - /home/chester/traefik/traefik-data/certs:/certs:rw - - /home/chester/traefik/traefik-data/access-logs:/var/log/traefik:rw - - /home/chester/traefik/traefik.yml:/traefik.yml:ro - - /home/chester/traefik/traefik-data/dynamic:/dynamic:ro - BlkioDeviceReadBps: null - BlkioDeviceReadIOps: null - BlkioDeviceWriteBps: null - BlkioDeviceWriteIOps: null - BlkioWeight: 0 - BlkioWeightDevice: null - CapAdd: null - CapDrop: null - Cgroup: '' - CgroupParent: '' - CgroupnsMode: private - ConsoleSize: - - 0 - - 0 - ContainerIDFile: '' - CpuCount: 0 - CpuPercent: 0 - CpuPeriod: 0 - CpuQuota: 0 - CpuRealtimePeriod: 0 - CpuRealtimeRuntime: 0 - CpuShares: 0 - CpusetCpus: '' - CpusetMems: '' - DeviceCgroupRules: null - DeviceRequests: null - Devices: null - Dns: [] - DnsOptions: [] - DnsSearch: [] - ExtraHosts: [] - GroupAdd: null - IOMaximumBandwidth: 0 - IOMaximumIOps: 0 - IpcMode: private - Isolation: '' - Links: null - LogConfig: - Config: {} - Type: json-file - MaskedPaths: - - /proc/acpi - - /proc/asound - - /proc/interrupts - - /proc/kcore - - /proc/keys - - /proc/latency_stats - - /proc/sched_debug - - /proc/scsi - - /proc/timer_list - - /proc/timer_stats - - /sys/devices/virtual/powercap - - /sys/firmware - - /sys/devices/system/cpu/cpu0/thermal_throttle - - /sys/devices/system/cpu/cpu1/thermal_throttle - - /sys/devices/system/cpu/cpu2/thermal_throttle - - /sys/devices/system/cpu/cpu3/thermal_throttle - Memory: 0 - MemoryReservation: 0 - MemorySwap: 0 - MemorySwappiness: null - NanoCpus: 0 - NetworkMode: proxy-net - OomKillDisable: null - OomScoreAdj: 0 - PidMode: '' - PidsLimit: null - PortBindings: - 443/tcp: - - HostIp: '' - HostPort: '443' - 80/tcp: - - HostIp: '' - HostPort: '80' - Privileged: false - PublishAllPorts: false - ReadonlyPaths: - - /proc/bus - - /proc/fs - - /proc/irq - - /proc/sys - - /proc/sysrq-trigger - ReadonlyRootfs: false - RestartPolicy: - MaximumRetryCount: 0 - Name: unless-stopped - Runtime: runc - SecurityOpt: null - ShmSize: 67108864 - UTSMode: '' - Ulimits: null - UsernsMode: '' - VolumeDriver: '' - VolumesFrom: null - HostnamePath: /var/lib/docker/containers/f0c70cc4667e2bfb834ed92486be28d836c399dbeb84fa26bd84f49579562c64/hostname - HostsPath: /var/lib/docker/containers/f0c70cc4667e2bfb834ed92486be28d836c399dbeb84fa26bd84f49579562c64/hosts - Id: f0c70cc4667e2bfb834ed92486be28d836c399dbeb84fa26bd84f49579562c64 - Image: sha256:67622638cd88dbfcfba40159bc652ecf0aea0e032f8a3c7e3134ae7c037b9910 - ImageManifestDescriptor: - annotations: - com.docker.official-images.bashbrew.arch: amd64 - org.opencontainers.image.base.digest: sha256:1882fa4569e0c591ea092d3766c4893e19b8901a8e649de7067188aba3cc0679 - org.opencontainers.image.base.name: alpine:3.23 - org.opencontainers.image.created: '2025-12-18T00:37:28Z' - org.opencontainers.image.revision: 87ae3f90a938b0159e557ba5b6abcfd63effb714 - org.opencontainers.image.source: https://github.com/traefik/traefik-library-image.git#87ae3f90a938b0159e557ba5b6abcfd63effb714:v3.6/alpine - org.opencontainers.image.url: https://hub.docker.com/_/traefik - org.opencontainers.image.version: v3.6.5 - digest: sha256:d944e3693bbf5a361ddd2e411bb713049cfb4f5ff3da200b30ee7a347dbd6abd - mediaType: application/vnd.oci.image.manifest.v1+json - platform: - architecture: amd64 - os: linux - size: 1728 - LogPath: /var/lib/docker/containers/f0c70cc4667e2bfb834ed92486be28d836c399dbeb84fa26bd84f49579562c64/f0c70cc4667e2bfb834ed92486be28d836c399dbeb84fa26bd84f49579562c64-json.log - MountLabel: '' - Mounts: - - Destination: /certs - Mode: rw - Propagation: rprivate - RW: true - Source: /home/chester/traefik/traefik-data/certs - Type: bind - - Destination: /dynamic - Mode: ro - Propagation: rprivate - RW: false - Source: /home/chester/traefik/traefik-data/dynamic - Type: bind - - Destination: /traefik.yml - Mode: ro - Propagation: rprivate - RW: false - Source: /home/chester/traefik/traefik.yml - Type: bind - - Destination: /var/log/traefik - Mode: rw - Propagation: rprivate - RW: true - Source: /home/chester/traefik/traefik-data/access-logs - Type: bind - Name: /traefik - NetworkSettings: - Networks: - proxy-net: - Aliases: - - traefik - - traefik - DNSNames: - - traefik - - f0c70cc4667e - DriverOpts: null - EndpointID: 85312d375679f81387f54387dc176918f159b3c5527b527a10da91b36dc3c8f5 - Gateway: 172.18.0.1 - GlobalIPv6Address: '' - GlobalIPv6PrefixLen: 0 - GwPriority: 0 - IPAMConfig: null - IPAddress: 172.18.0.3 - IPPrefixLen: 16 - IPv6Gateway: '' - Links: null - MacAddress: c2:85:cb:12:fe:61 - NetworkID: c451239da54e830d98844b541d0b707cc63426ce475d5103dc86300c0ebb7160 - Ports: - 443/tcp: - - HostIp: 0.0.0.0 - HostPort: '443' - - HostIp: '::' - HostPort: '443' - 80/tcp: - - HostIp: 0.0.0.0 - HostPort: '80' - - HostIp: '::' - HostPort: '80' - SandboxID: 39e089426b97fd8075a6b4fad29d0cdc3fa77b73e28f8ef96bef68e3418b7fb1 - SandboxKey: /var/run/docker/netns/39e089426b97 - Path: /entrypoint.sh - Platform: linux - ProcessLabel: '' - ResolvConfPath: /var/lib/docker/containers/f0c70cc4667e2bfb834ed92486be28d836c399dbeb84fa26bd84f49579562c64/resolv.conf - RestartCount: 0 - State: - Dead: false - Error: '' - ExitCode: 0 - FinishedAt: '2026-02-21T18:15:51.551714695Z' - OOMKilled: false - Paused: false - Pid: 1213 - Restarting: false - Running: true - StartedAt: '2026-02-21T18:30:42.488013871Z' - Status: running - Storage: - RootFS: - Snapshot: - Name: overlayfs -- AppArmorProfile: unconfined - Args: - - haproxy - - -f - - /tmp/haproxy.cfg - Config: - AttachStderr: true - AttachStdin: false - AttachStdout: true - Cmd: - - haproxy - - -f - - /tmp/haproxy.cfg - Domainname: '' - Entrypoint: - - docker-entrypoint.sh - Env: - - INFO=1 - - SERVICES=1 - - TASKS=1 - - PING=1 - - AUTH=1 - - VERSION=1 - - EVENTS=1 - - NETWORKS=1 - - CONTAINERS=1 - - VOLUMES=1 - - PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin - - HAPROXY_VERSION=3.2.4 - - HAPROXY_URL=https://www.haproxy.org/download/3.2/src/haproxy-3.2.4.tar.gz - - HAPROXY_SHA256=5d4b2ee6fe56b8098ebb9c91a899d728f87d64cd7be8804d2ddcc5f937498c1d - - ALLOW_RESTARTS=0 - - ALLOW_STOP=0 - - ALLOW_START=0 - - BUILD=0 - - COMMIT=0 - - CONFIGS=0 - - DISABLE_IPV6=0 - - DISTRIBUTION=0 - - EXEC=0 - - GRPC=0 - - IMAGES=0 - - LOG_LEVEL=info - - NODES=0 - - PLUGINS=0 - - POST=0 - - SECRETS=0 - - SESSION=0 - - SOCKET_PATH=/var/run/docker.sock - - SWARM=0 - - SYSTEM=0 - ExposedPorts: - 2375/tcp: {} - Hostname: f59c3a7d4c30 - Image: tecnativa/docker-socket-proxy:latest - Labels: - com.docker.compose.config-hash: 711c15ad420cb4274f3a65832d36be4bc31327a53f09b84b803d0e1ab18a0917 - com.docker.compose.container-number: '1' - com.docker.compose.depends_on: '' - com.docker.compose.image: sha256:1f3a6f303320723d199d2316a3e82b2e2685d86c275d5e3deeaf182573b47476 - com.docker.compose.oneoff: 'False' - com.docker.compose.project: traefik - com.docker.compose.project.config_files: /home/chester/traefik/docker-compose.yml - com.docker.compose.project.working_dir: /home/chester/traefik - com.docker.compose.replace: docker-socket-proxy - com.docker.compose.service: docker-socket-proxy - com.docker.compose.version: 5.0.2 - org.opencontainers.image.created: '2025-12-16T07:26:21.623Z' - org.opencontainers.image.description: Proxy over your Docker socket to - restrict which requests it accepts - org.opencontainers.image.licenses: Apache-2.0 - org.opencontainers.image.revision: 2f04313b042c1bf4dfbd039475dfc42db79bde7a - org.opencontainers.image.source: https://github.com/Tecnativa/docker-socket-proxy - org.opencontainers.image.title: docker-socket-proxy - org.opencontainers.image.url: https://github.com/Tecnativa/docker-socket-proxy - org.opencontainers.image.version: v0.4.2 - OpenStdin: false - StdinOnce: false - StopSignal: SIGUSR1 - Tty: false - User: 0:0 - Volumes: null - WorkingDir: /var/lib/haproxy - Created: '2026-01-28T00:34:44.663698444Z' - Driver: overlayfs - ExecIDs: null - HostConfig: - AutoRemove: false - Binds: - - /var/run/docker.sock:/var/run/docker.sock:rw - BlkioDeviceReadBps: null - BlkioDeviceReadIOps: null - BlkioDeviceWriteBps: null - BlkioDeviceWriteIOps: null - BlkioWeight: 0 - BlkioWeightDevice: null - CapAdd: null - CapDrop: null - Cgroup: '' - CgroupParent: '' - CgroupnsMode: private - ConsoleSize: - - 0 - - 0 - ContainerIDFile: '' - CpuCount: 0 - CpuPercent: 0 - CpuPeriod: 0 - CpuQuota: 0 - CpuRealtimePeriod: 0 - CpuRealtimeRuntime: 0 - CpuShares: 0 - CpusetCpus: '' - CpusetMems: '' - DeviceCgroupRules: null - DeviceRequests: null - Devices: null - Dns: [] - DnsOptions: [] - DnsSearch: [] - ExtraHosts: [] - GroupAdd: - - '988' - IOMaximumBandwidth: 0 - IOMaximumIOps: 0 - IpcMode: private - Isolation: '' - Links: null - LogConfig: - Config: {} - Type: json-file - MaskedPaths: null - Memory: 0 - MemoryReservation: 0 - MemorySwap: 0 - MemorySwappiness: null - NanoCpus: 0 - NetworkMode: proxy-net - OomKillDisable: null - OomScoreAdj: 0 - PidMode: '' - PidsLimit: null - PortBindings: {} - Privileged: true - PublishAllPorts: false - ReadonlyPaths: null - ReadonlyRootfs: false - RestartPolicy: - MaximumRetryCount: 0 - Name: unless-stopped - Runtime: runc - SecurityOpt: - - apparmor=unconfined - - label=disable - ShmSize: 67108864 - UTSMode: '' - Ulimits: null - UsernsMode: host - VolumeDriver: '' - VolumesFrom: null - HostnamePath: /var/lib/docker/containers/f59c3a7d4c3036a26bb8f060aa209b06bcb52d9d0bc41e32a750b36f4df3ae56/hostname - HostsPath: /var/lib/docker/containers/f59c3a7d4c3036a26bb8f060aa209b06bcb52d9d0bc41e32a750b36f4df3ae56/hosts - Id: f59c3a7d4c3036a26bb8f060aa209b06bcb52d9d0bc41e32a750b36f4df3ae56 - Image: sha256:1f3a6f303320723d199d2316a3e82b2e2685d86c275d5e3deeaf182573b47476 - ImageManifestDescriptor: - digest: sha256:bd2241b3bec83abcff25927a0a7ae518e0c5bef624b3cc247dcb31e68b53f417 - mediaType: application/vnd.oci.image.manifest.v1+json - platform: - architecture: amd64 - os: linux - size: 1993 - LogPath: /var/lib/docker/containers/f59c3a7d4c3036a26bb8f060aa209b06bcb52d9d0bc41e32a750b36f4df3ae56/f59c3a7d4c3036a26bb8f060aa209b06bcb52d9d0bc41e32a750b36f4df3ae56-json.log - MountLabel: '' - Mounts: - - Destination: /var/run/docker.sock - Mode: rw - Propagation: rprivate - RW: true - Source: /var/run/docker.sock - Type: bind - Name: /docker-socket-proxy - NetworkSettings: - Networks: - proxy-net: - Aliases: - - docker-socket-proxy - - docker-socket-proxy - DNSNames: - - docker-socket-proxy - - f59c3a7d4c30 - DriverOpts: null - EndpointID: cb18a5396cca6ed0b3c3502b8e8e2d46eb39a5afaa7350e2dd2ea9ee5448d7d3 - Gateway: 172.18.0.1 - GlobalIPv6Address: '' - GlobalIPv6PrefixLen: 0 - GwPriority: 0 - IPAMConfig: null - IPAddress: 172.18.0.2 - IPPrefixLen: 16 - IPv6Gateway: '' - Links: null - MacAddress: 42:a5:f6:d2:52:08 - NetworkID: c451239da54e830d98844b541d0b707cc63426ce475d5103dc86300c0ebb7160 - Ports: - 2375/tcp: null - SandboxID: e0902b280ba958f8f4ee51c20eb33a563b8bfc1717f3fbf4dd012a05672f3e74 - SandboxKey: /var/run/docker/netns/e0902b280ba9 - Path: docker-entrypoint.sh - Platform: linux - ProcessLabel: '' - ResolvConfPath: /var/lib/docker/containers/f59c3a7d4c3036a26bb8f060aa209b06bcb52d9d0bc41e32a750b36f4df3ae56/resolv.conf - RestartCount: 0 - State: - Dead: false - Error: '' - ExitCode: 0 - FinishedAt: '2026-02-21T18:16:00.055009796Z' - OOMKilled: false - Paused: false - Pid: 1225 - Restarting: false - Running: true - StartedAt: '2026-02-21T18:30:42.49130796Z' - Status: running - Storage: - RootFS: - Snapshot: - Name: overlayfs -- AppArmorProfile: docker-default - Args: - - redis-server - - --appendonly - - 'yes' - Config: - AttachStderr: true - AttachStdin: false - AttachStdout: true - Cmd: - - redis-server - - --appendonly - - 'yes' - Domainname: '' - Entrypoint: - - docker-entrypoint.sh - Env: - - PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin - - GOSU_VERSION=1.17 - - REDIS_VERSION=7.4.7 - - REDIS_DOWNLOAD_URL=http://download.redis.io/releases/redis-7.4.7.tar.gz - - REDIS_DOWNLOAD_SHA=c97e57b0df330a9e091cacff012bebe763c275398cf36ff44cdba876814b595b - ExposedPorts: - 6379/tcp: {} - Healthcheck: - Interval: 10000000000 - Retries: 5 - Test: - - CMD - - redis-cli - - ping - Timeout: 5000000000 - Hostname: 57439684f5ef - Image: redis:7-alpine - Labels: - com.docker.compose.config-hash: eb5826610c0f348a70810f75902caa3d6b889a5e442c0d9ddc539355c0113f49 - com.docker.compose.container-number: '1' - com.docker.compose.depends_on: '' - com.docker.compose.image: sha256:ee64a64eaab618d88051c3ade8f6352d11531fcf79d9a4818b9b183d8c1d18ba - com.docker.compose.oneoff: 'False' - com.docker.compose.project: traefik - com.docker.compose.project.config_files: /home/chester/traefik/docker-compose.yml - com.docker.compose.project.working_dir: /home/chester/traefik - com.docker.compose.replace: redis - com.docker.compose.service: redis - com.docker.compose.version: 5.0.2 - OpenStdin: false - StdinOnce: false - Tty: false - User: '' - Volumes: - /data: {} - WorkingDir: /data - Created: '2026-01-28T00:34:44.662867915Z' - Driver: overlayfs - ExecIDs: null - HostConfig: - AutoRemove: false - Binds: - - traefik_redis-data:/data:rw - BlkioDeviceReadBps: null - BlkioDeviceReadIOps: null - BlkioDeviceWriteBps: null - BlkioDeviceWriteIOps: null - BlkioWeight: 0 - BlkioWeightDevice: null - CapAdd: null - CapDrop: null - Cgroup: '' - CgroupParent: '' - CgroupnsMode: private - ConsoleSize: - - 0 - - 0 - ContainerIDFile: '' - CpuCount: 0 - CpuPercent: 0 - CpuPeriod: 0 - CpuQuota: 0 - CpuRealtimePeriod: 0 - CpuRealtimeRuntime: 0 - CpuShares: 0 - CpusetCpus: '' - CpusetMems: '' - DeviceCgroupRules: null - DeviceRequests: null - Devices: null - Dns: [] - DnsOptions: [] - DnsSearch: [] - ExtraHosts: [] - GroupAdd: null - IOMaximumBandwidth: 0 - IOMaximumIOps: 0 - IpcMode: private - Isolation: '' - Links: null - LogConfig: - Config: {} - Type: json-file - MaskedPaths: - - /proc/acpi - - /proc/asound - - /proc/interrupts - - /proc/kcore - - /proc/keys - - /proc/latency_stats - - /proc/sched_debug - - /proc/scsi - - /proc/timer_list - - /proc/timer_stats - - /sys/devices/virtual/powercap - - /sys/firmware - - /sys/devices/system/cpu/cpu0/thermal_throttle - - /sys/devices/system/cpu/cpu1/thermal_throttle - - /sys/devices/system/cpu/cpu2/thermal_throttle - - /sys/devices/system/cpu/cpu3/thermal_throttle - Memory: 0 - MemoryReservation: 0 - MemorySwap: 0 - MemorySwappiness: null - NanoCpus: 0 - NetworkMode: proxy-net - OomKillDisable: null - OomScoreAdj: 0 - PidMode: '' - PidsLimit: null - PortBindings: - 6379/tcp: - - HostIp: '' - HostPort: '6379' - Privileged: false - PublishAllPorts: false - ReadonlyPaths: - - /proc/bus - - /proc/fs - - /proc/irq - - /proc/sys - - /proc/sysrq-trigger - ReadonlyRootfs: false - RestartPolicy: - MaximumRetryCount: 0 - Name: unless-stopped - Runtime: runc - SecurityOpt: null - ShmSize: 67108864 - UTSMode: '' - Ulimits: null - UsernsMode: '' - VolumeDriver: '' - VolumesFrom: null - HostnamePath: /var/lib/docker/containers/57439684f5eff5afa67108c958725c641ff4b0299917774c93d91d5ce7b614b2/hostname - HostsPath: /var/lib/docker/containers/57439684f5eff5afa67108c958725c641ff4b0299917774c93d91d5ce7b614b2/hosts - Id: 57439684f5eff5afa67108c958725c641ff4b0299917774c93d91d5ce7b614b2 - Image: sha256:ee64a64eaab618d88051c3ade8f6352d11531fcf79d9a4818b9b183d8c1d18ba - ImageManifestDescriptor: - annotations: - com.docker.official-images.bashbrew.arch: amd64 - org.opencontainers.image.base.digest: sha256:41c81533144786e0beb2b148667355a6c7659aa99a14ed837ff15a98ca9d71f3 - org.opencontainers.image.base.name: alpine:3.21 - org.opencontainers.image.created: '2025-11-03T17:38:49Z' - org.opencontainers.image.revision: d42d7aec93b1c54dd46f37a66a92f62478456039 - org.opencontainers.image.source: https://github.com/redis/docker-library-redis.git#d42d7aec93b1c54dd46f37a66a92f62478456039:7.4/alpine - org.opencontainers.image.url: https://hub.docker.com/_/redis - org.opencontainers.image.version: 7.4.7-alpine - digest: sha256:4706ecab5371690fecfdd782268929c94ad5b5ce9ce0b35bfdfe191c4ad17851 - mediaType: application/vnd.oci.image.manifest.v1+json - platform: - architecture: amd64 - os: linux - size: 2483 - LogPath: /var/lib/docker/containers/57439684f5eff5afa67108c958725c641ff4b0299917774c93d91d5ce7b614b2/57439684f5eff5afa67108c958725c641ff4b0299917774c93d91d5ce7b614b2-json.log - MountLabel: '' - Mounts: - - Destination: /data - Driver: local - Mode: rw - Name: traefik_redis-data - Propagation: '' - RW: true - Source: /var/lib/docker/volumes/traefik_redis-data/_data - Type: volume - Name: /redis - NetworkSettings: - Networks: - proxy-net: - Aliases: - - redis - - redis - DNSNames: - - redis - - 57439684f5ef - DriverOpts: null - EndpointID: 7f950d9aab3bf29937a2c66723f8fd483984fa9ccd74a859166e810c77a9ca0b - Gateway: 172.18.0.1 - GlobalIPv6Address: '' - GlobalIPv6PrefixLen: 0 - GwPriority: 0 - IPAMConfig: null - IPAddress: 172.18.0.4 - IPPrefixLen: 16 - IPv6Gateway: '' - Links: null - MacAddress: e2:9b:a3:07:2f:81 - NetworkID: c451239da54e830d98844b541d0b707cc63426ce475d5103dc86300c0ebb7160 - Ports: - 6379/tcp: - - HostIp: 0.0.0.0 - HostPort: '6379' - - HostIp: '::' - HostPort: '6379' - SandboxID: dfafbd7bf0a46788747bcf7e8cbe9dcfc05886cdbb73add6cde8d3f50eeed30d - SandboxKey: /var/run/docker/netns/dfafbd7bf0a4 - Path: docker-entrypoint.sh - Platform: linux - ProcessLabel: '' - ResolvConfPath: /var/lib/docker/containers/57439684f5eff5afa67108c958725c641ff4b0299917774c93d91d5ce7b614b2/resolv.conf - RestartCount: 0 - State: - Dead: false - Error: '' - ExitCode: 0 - FinishedAt: '2026-02-21T18:15:50.121096266Z' - Health: - FailingStreak: 0 - Log: - - End: '2026-03-12T21:18:28.607327472Z' - ExitCode: 0 - Output: 'PONG - - ' - Start: '2026-03-12T21:18:28.555451253Z' - - End: '2026-03-12T21:18:38.654395517Z' - ExitCode: 0 - Output: 'PONG - - ' - Start: '2026-03-12T21:18:38.60798899Z' - - End: '2026-03-12T21:18:48.712837864Z' - ExitCode: 0 - Output: 'PONG - - ' - Start: '2026-03-12T21:18:48.655551711Z' - - End: '2026-03-12T21:18:58.75775082Z' - ExitCode: 0 - Output: 'PONG - - ' - Start: '2026-03-12T21:18:58.713415195Z' - - End: '2026-03-12T21:19:08.803904596Z' - ExitCode: 0 - Output: 'PONG - - ' - Start: '2026-03-12T21:19:08.758205815Z' - Status: healthy - OOMKilled: false - Paused: false - Pid: 1220 - Restarting: false - Running: true - StartedAt: '2026-02-21T18:30:42.486966925Z' - Status: running - Storage: - RootFS: - Snapshot: - Name: overlayfs diff --git a/ansible/archive/outputs/heimdall-baseline-20260312T211908/docker_info.yml b/ansible/archive/outputs/heimdall-baseline-20260312T211908/docker_info.yml deleted file mode 100644 index c526144..0000000 --- a/ansible/archive/outputs/heimdall-baseline-20260312T211908/docker_info.yml +++ /dev/null @@ -1,8 +0,0 @@ -cgroup_driver: systemd -containers_running: 4 -containers_total: 4 -daemon_config: {} -logging_driver: json-file -server_version: 29.2.0 -storage_driver: overlayfs -swarm_state: inactive diff --git a/ansible/archive/outputs/heimdall-baseline-20260312T211908/env_keys/_home_chester_traefik_.env.redacted b/ansible/archive/outputs/heimdall-baseline-20260312T211908/env_keys/_home_chester_traefik_.env.redacted deleted file mode 100644 index d6e6492..0000000 --- a/ansible/archive/outputs/heimdall-baseline-20260312T211908/env_keys/_home_chester_traefik_.env.redacted +++ /dev/null @@ -1,7 +0,0 @@ -# Env key inventory β€” values REDACTED for security -# Source: /home/chester/traefik/.env -# Host: heimdall | Captured: 2026-03-12T21:19:10Z -# -# To restore secrets: ansible-vault encrypt_string '' --name '' -CLOUDFLARE_DNS_API_TOKEN= -CLOUDFLARE_ZONE_API_TOKEN= diff --git a/ansible/archive/outputs/heimdall-baseline-20260312T211908/firewall_rules.txt b/ansible/archive/outputs/heimdall-baseline-20260312T211908/firewall_rules.txt deleted file mode 100644 index 653a8bc..0000000 --- a/ansible/archive/outputs/heimdall-baseline-20260312T211908/firewall_rules.txt +++ /dev/null @@ -1,49 +0,0 @@ -# Firewall state on heimdall -# Captured: 2026-03-12T21:19:10Z - -## UFW STATUS -Status: inactive - -## IPTABLES (reference) -Chain INPUT (policy ACCEPT) -num target prot opt source destination - -Chain FORWARD (policy DROP) -num target prot opt source destination -1 DOCKER-USER 0 -- 0.0.0.0/0 0.0.0.0/0 -2 DOCKER-FORWARD 0 -- 0.0.0.0/0 0.0.0.0/0 - -Chain OUTPUT (policy ACCEPT) -num target prot opt source destination - -Chain DOCKER (2 references) -num target prot opt source destination -1 ACCEPT 6 -- 0.0.0.0/0 172.18.0.4 tcp dpt:6379 -2 ACCEPT 6 -- 0.0.0.0/0 172.18.0.3 tcp dpt:443 -3 ACCEPT 6 -- 0.0.0.0/0 172.18.0.3 tcp dpt:80 -4 DROP 0 -- 0.0.0.0/0 0.0.0.0/0 -5 DROP 0 -- 0.0.0.0/0 0.0.0.0/0 - -Chain DOCKER-BRIDGE (1 references) -num target prot opt source destination -1 DOCKER 0 -- 0.0.0.0/0 0.0.0.0/0 -2 DOCKER 0 -- 0.0.0.0/0 0.0.0.0/0 - -Chain DOCKER-CT (1 references) -num target prot opt source destination -1 ACCEPT 0 -- 0.0.0.0/0 0.0.0.0/0 ctstate RELATED,ESTABLISHED -2 ACCEPT 0 -- 0.0.0.0/0 0.0.0.0/0 ctstate RELATED,ESTABLISHED - -Chain DOCKER-FORWARD (1 references) -num target prot opt source destination -1 DOCKER-CT 0 -- 0.0.0.0/0 0.0.0.0/0 -2 DOCKER-INTERNAL 0 -- 0.0.0.0/0 0.0.0.0/0 -3 DOCKER-BRIDGE 0 -- 0.0.0.0/0 0.0.0.0/0 -4 ACCEPT 0 -- 0.0.0.0/0 0.0.0.0/0 -5 ACCEPT 0 -- 0.0.0.0/0 0.0.0.0/0 - -Chain DOCKER-INTERNAL (1 references) -num target prot opt source destination - -Chain DOCKER-USER (1 references) -num target prot opt source destination diff --git a/ansible/archive/outputs/heimdall-baseline-20260312T211908/host_facts.yml b/ansible/archive/outputs/heimdall-baseline-20260312T211908/host_facts.yml deleted file mode 100644 index 8a3c0e3..0000000 --- a/ansible/archive/outputs/heimdall-baseline-20260312T211908/host_facts.yml +++ /dev/null @@ -1,36 +0,0 @@ -ansible_user: root -architecture: x86_64 -cpu_vcpus: 4 -default_ipv4: - address: 10.0.0.151 - alias: enp1s0 - broadcast: 10.0.0.255 - gateway: 10.0.0.2 - interface: enp1s0 - macaddress: 7c:83:34:bf:79:a5 - mtu: 1500 - netmask: 255.255.255.0 - network: 10.0.0.0 - prefix: '24' - type: ether -distribution: Ubuntu -distribution_release: noble -distribution_version: '24.04' -fqdn: heimdall -hostname: heimdall -interfaces: -- veth57f15b2 -- wlo1 -- veth2088d3d -- enp1s0 -- lo -- vethe43b71e -- br-c451239da54e -- enp2s0 -- docker0 -kernel: 6.8.0-100-generic -memory_free_mb: 377 -memory_total_mb: 15767 -os_family: Debian -python_version: 3.12.3 -uptime_seconds: 1651833 diff --git a/ansible/archive/outputs/heimdall-baseline-20260312T211908/manifest.yml b/ansible/archive/outputs/heimdall-baseline-20260312T211908/manifest.yml deleted file mode 100644 index 05bf2a0..0000000 --- a/ansible/archive/outputs/heimdall-baseline-20260312T211908/manifest.yml +++ /dev/null @@ -1,61 +0,0 @@ ---- ---- -# Heimdall baseline capture manifest -# Generated: 2026-03-12T21:19:10Z -# Host: heimdall (10.0.0.151) -# Review this file before proceeding to heimdall_edge role refactor. - -capture_timestamp: "2026-03-12T21:19:10Z" -capture_dir: "/home/chester/homelab/ansible/playbooks/preflight/../../outputs/heimdall-baseline-20260312T211908" - -host: - hostname: "heimdall" - ip: "10.0.0.151" - os: "Ubuntu 24.04" - kernel: "6.8.0-100-generic" - -docker: - version: "29.2.0" - storage_driver: "overlayfs" - swarm_state: "inactive" - containers_running: 4 - containers_total: 4 - -inventory: - containers_found: 4 - compose_files_found: 2 - env_files_found: 2 - -critical_paths: -/etc/docker/daemon.json: false - /home/chester/traefik: true - /home/chester/traefik/.env: true - /home/chester/traefik/docker-compose.yml: true - /opt/stacks/heimdall: false - /opt/stacks/heimdall/.env: false - /opt/stacks/heimdall/docker-compose.yml: false - /opt/stacks/heimdall/redis-data: false - /opt/stacks/heimdall/runner-data: false - /opt/stacks/heimdall/traefik-certs: false - /opt/stacks/heimdall/traefik-certs/acme.json: false - -compose_file_paths: -- /home/chester/traefik/docker-compose.yml - - /home/chester/traefik/docker-compose.yml - -env_file_paths: -- /home/chester/traefik/.env - - /home/chester/traefik/.env - -containers_running: -- node-exporter - - traefik - - docker-socket-proxy - - redis - -validation: - compose_files_present: True - containers_present: True - stack_dir_present: False - compose_present: False - env_present: False diff --git a/ansible/archive/outputs/heimdall-baseline-20260312T211908/networks_and_volumes.yml b/ansible/archive/outputs/heimdall-baseline-20260312T211908/networks_and_volumes.yml deleted file mode 100644 index 8e23c08..0000000 --- a/ansible/archive/outputs/heimdall-baseline-20260312T211908/networks_and_volumes.yml +++ /dev/null @@ -1,25 +0,0 @@ ---- -# Docker network and volume inventory -# Host: heimdall | Captured: 2026-03-12T21:19:10Z - -networks: -- Driver: bridge - Id: 4f3815cff81bd0c59f62e0151bc58bc0289eca4634f77bf544e1fc3e34c0bab7 - Name: bridge - Scope: local - - Driver: 'null' - Id: a55e7a3ec6e204eae20086edec67507e3c7ef59f5e383d4b8631d614c657e0d0 - Name: none - Scope: local - - Driver: host - Id: b63c150f50197cfb21939a1369d37f0a309118dfb79be11d4c6082d963f8f70a - Name: host - Scope: local - - Driver: bridge - Id: c451239da54e830d98844b541d0b707cc63426ce475d5103dc86300c0ebb7160 - Name: proxy-net - Scope: local - -volumes: -- Driver: local - Name: traefik_redis-data diff --git a/ansible/archive/outputs/heimdall-baseline-20260312T211908/systemd_units.txt b/ansible/archive/outputs/heimdall-baseline-20260312T211908/systemd_units.txt deleted file mode 100644 index dba47a0..0000000 --- a/ansible/archive/outputs/heimdall-baseline-20260312T211908/systemd_units.txt +++ /dev/null @@ -1,153 +0,0 @@ - UNIT LOAD ACTIVE SUB DESCRIPTION - apparmor.service loaded active exited Load AppArmor profiles - apport-autoreport.service loaded inactive dead Process error reports when automatic reporting is enabled - apport.service loaded active exited automatic crash report generation - apt-daily-upgrade.service loaded inactive dead Daily apt upgrade and clean activities - apt-daily.service loaded inactive dead Daily apt download activities - blk-availability.service loaded active exited Availability of block devices - cloud-init-local.service loaded inactive dead Cloud-init: Local Stage (pre-network) - console-setup.service loaded active exited Set console font and keymap - containerd.service loaded active running containerd container runtime - cron.service loaded active running Regular background program processing daemon - dbus.service loaded active running D-Bus System Message Bus - dm-event.service loaded inactive dead Device-mapper event daemon - dmesg.service loaded inactive dead Save initial kernel messages after boot - docker.service loaded active running Docker Application Container Engine - dpkg-db-backup.service loaded inactive dead Daily dpkg database backup service - e2scrub_all.service loaded inactive dead Online ext4 Metadata Check for All Filesystems - e2scrub_reap.service loaded inactive dead Remove Stale Online ext4 Metadata Check Snapshots - emergency.service loaded inactive dead Emergency Shell - finalrd.service loaded active exited Create final runtime dir for shutdown pivot root - fstrim.service loaded inactive dead Discard unused blocks on filesystems from /etc/fstab - fwupd-refresh.service loaded inactive dead Refresh fwupd metadata and update motd - getty-static.service loaded inactive dead getty on tty2-tty6 if dbus and logind are not available - getty@tty1.service loaded active running Getty on tty1 - grub-common.service loaded inactive dead Record successful boot for GRUB - grub-initrd-fallback.service loaded inactive dead GRUB failed boot detection - initrd-cleanup.service loaded inactive dead Cleaning Up and Shutting Down Daemons - initrd-parse-etc.service loaded inactive dead Mountpoints Configured in the Real Root - initrd-switch-root.service loaded inactive dead Switch Root - initrd-udevadm-cleanup-db.service loaded inactive dead Cleanup udev Database - iscsid.service loaded inactive dead iSCSI initiator daemon (iscsid) - keyboard-setup.service loaded active exited Set the console keyboard layout - kmod-static-nodes.service loaded active exited Create List of Static Device Nodes - ldconfig.service loaded inactive dead Rebuild Dynamic Linker Cache - logrotate.service loaded inactive dead Rotate log files - lvm2-lvmpolld.service loaded inactive dead LVM2 poll daemon - lvm2-monitor.service loaded active exited Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling - man-db.service loaded inactive dead Daily man-db regeneration - ModemManager.service loaded active running Modem Manager - modprobe@configfs.service loaded inactive dead Load Kernel Module configfs - modprobe@dm_mod.service loaded inactive dead Load Kernel Module dm_mod - modprobe@drm.service loaded inactive dead Load Kernel Module drm - modprobe@efi_pstore.service loaded inactive dead Load Kernel Module efi_pstore - modprobe@fuse.service loaded inactive dead Load Kernel Module fuse - modprobe@loop.service loaded inactive dead Load Kernel Module loop - motd-news.service loaded inactive dead Message of the Day - multipathd.service loaded active running Device-Mapper Multipath Device Controller - netplan-ovs-cleanup.service loaded inactive dead OpenVSwitch configuration for cleanup - networkd-dispatcher.service loaded inactive dead Dispatcher daemon for systemd-networkd - open-iscsi.service loaded inactive dead Login to default iSCSI targets - open-vm-tools.service loaded inactive dead Service for virtual machines hosted on VMware - plymouth-quit-wait.service loaded active exited Hold until boot process finishes up - plymouth-quit.service loaded active exited Terminate Plymouth Boot Screen - plymouth-read-write.service loaded active exited Tell Plymouth To Write Out Runtime Data - plymouth-start.service loaded inactive dead Show Plymouth Boot Screen - plymouth-switch-root.service loaded inactive dead Plymouth switch root service - polkit.service loaded active running Authorization Manager - pollinate.service loaded inactive dead Pollinate to seed the pseudo random number generator - rc-local.service loaded inactive dead /etc/rc.local Compatibility - rescue.service loaded inactive dead Rescue Shell - rsyslog.service loaded active running System Logging Service - secureboot-db.service loaded inactive dead Secure Boot updates for DB and DBX - setvtrgb.service loaded active exited Set console scheme - snapd.apparmor.service loaded active exited Load AppArmor profiles managed internally by snapd - snapd.autoimport.service loaded inactive dead Auto import assertions from block devices - snapd.core-fixup.service loaded inactive dead Automatically repair incorrect owner/permissions on core devices - snapd.failure.service loaded inactive dead Failure handling of the snapd snap - snapd.recovery-chooser-trigger.service loaded inactive dead Wait for the Ubuntu Core chooser trigger - snapd.seeded.service loaded active exited Wait until snapd is fully seeded - snapd.service loaded inactive dead Snap Daemon - snapd.snap-repair.service loaded inactive dead Automatically fetch and run repair assertions - snapd.system-shutdown.service loaded inactive dead Ubuntu core (all-snaps) system shutdown helper setup service - ssh.service loaded active running OpenBSD Secure Shell server - sysstat-collect.service loaded inactive dead system activity accounting tool - sysstat-summary.service loaded inactive dead Generate a daily summary of process accounting - sysstat.service loaded active exited Resets System Activity Logs - systemd-ask-password-console.service loaded inactive dead Dispatch Password Requests to Console - systemd-ask-password-plymouth.service loaded inactive dead Forward Password Requests to Plymouth - systemd-ask-password-wall.service loaded inactive dead Forward Password Requests to Wall - systemd-battery-check.service loaded inactive dead Check battery level during early boot - systemd-binfmt.service loaded active exited Set Up Additional Binary Formats - systemd-bsod.service loaded inactive dead Displays emergency message in full screen. - systemd-firstboot.service loaded inactive dead First Boot Wizard - systemd-fsck-root.service loaded inactive dead File System Check on Root Device - systemd-fsck@dev-disk-by\x2duuid-36D5\x2d0248.service loaded active exited File System Check on /dev/disk/by-uuid/36D5-0248 - systemd-fsck@dev-disk-by\x2duuid-da3c4a6e\x2df851\x2d471f\x2d81e4\x2dcd9b3b26acf1.service loaded active exited File System Check on /dev/disk/by-uuid/da3c4a6e-f851-471f-81e4-cd9b3b26acf1 - systemd-fsckd.service loaded inactive dead File System Check Daemon to report status - systemd-hibernate-resume.service loaded inactive dead Resume from hibernation - systemd-hibernate.service loaded inactive dead System Hibernate - systemd-hwdb-update.service loaded inactive dead Rebuild Hardware Database - systemd-hybrid-sleep.service loaded inactive dead System Hybrid Suspend+Hibernate - systemd-initctl.service loaded inactive dead initctl Compatibility Daemon - systemd-journal-catalog-update.service loaded inactive dead Rebuild Journal Catalog - systemd-journal-flush.service loaded active exited Flush Journal to Persistent Storage - systemd-journald.service loaded active running Journal Service - systemd-logind.service loaded active running User Login Management - systemd-machine-id-commit.service loaded inactive dead Commit a transient machine-id on disk - systemd-modules-load.service loaded active exited Load Kernel Modules -● systemd-networkd-wait-online.service loaded failed failed Wait for Network to be Configured - systemd-networkd.service loaded active running Network Configuration - systemd-pcrmachine.service loaded inactive dead TPM2 PCR Machine ID Measurement - systemd-pcrphase-initrd.service loaded inactive dead TPM2 PCR Barrier (initrd) - systemd-pcrphase-sysinit.service loaded inactive dead TPM2 PCR Barrier (Initialization) - systemd-pcrphase.service loaded inactive dead TPM2 PCR Barrier (User) - systemd-pstore.service loaded inactive dead Platform Persistent Storage Archival - systemd-quotacheck.service loaded inactive dead File System Quota Check - systemd-random-seed.service loaded active exited Load/Save OS Random Seed - systemd-remount-fs.service loaded active exited Remount Root and Kernel File Systems - systemd-repart.service loaded inactive dead Repartition Root Disk - systemd-resolved.service loaded active running Network Name Resolution - systemd-rfkill.service loaded inactive dead Load/Save RF Kill Switch Status - systemd-soft-reboot.service loaded inactive dead Reboot System Userspace - systemd-suspend-then-hibernate.service loaded inactive dead System Suspend then Hibernate - systemd-suspend.service loaded inactive dead System Suspend - systemd-sysctl.service loaded active exited Apply Kernel Variables - systemd-sysext.service loaded inactive dead Merge System Extension Images into /usr/ and /opt/ - systemd-sysusers.service loaded inactive dead Create System Users - systemd-timesyncd.service loaded active running Network Time Synchronization - systemd-tmpfiles-clean.service loaded inactive dead Cleanup of Temporary Directories - systemd-tmpfiles-setup-dev-early.service loaded active exited Create Static Device Nodes in /dev gracefully - systemd-tmpfiles-setup-dev.service loaded active exited Create Static Device Nodes in /dev - systemd-tmpfiles-setup.service loaded active exited Create Volatile Files and Directories - systemd-tpm2-setup-early.service loaded inactive dead TPM2 SRK Setup (Early) - systemd-tpm2-setup.service loaded inactive dead TPM2 SRK Setup - systemd-udev-settle.service loaded inactive dead Wait for udev To Complete Device Initialization - systemd-udev-trigger.service loaded active exited Coldplug All udev Devices - systemd-udevd.service loaded active running Rule-based Manager for Device Events and Files - systemd-update-done.service loaded inactive dead Update is Completed - systemd-update-utmp-runlevel.service loaded inactive dead Record Runlevel Change in UTMP - systemd-update-utmp.service loaded active exited Record System Boot/Shutdown in UTMP - systemd-user-sessions.service loaded active exited Permit User Sessions - thermald.service loaded active running Thermal Daemon Service - tpm-udev.service loaded inactive dead Handle dynamically added tpm devices - ua-reboot-cmds.service loaded inactive dead Ubuntu Pro reboot cmds - ua-timer.service loaded inactive dead Ubuntu Pro Timer for running repeated jobs - ubuntu-advantage.service loaded inactive dead Ubuntu Pro Background Auto Attach - udisks2.service loaded active running Disk Manager - ufw.service loaded active exited Uncomplicated firewall - unattended-upgrades.service loaded active running Unattended Upgrades Shutdown - update-notifier-download.service loaded inactive dead Download data for packages that failed at package install time - update-notifier-motd.service loaded inactive dead Check to see whether there is a new version of Ubuntu available - upower.service loaded active running Daemon for power management - user-runtime-dir@1000.service loaded active exited User Runtime Directory /run/user/1000 - user@1000.service loaded active running User Manager for UID 1000 - uuidd.service loaded inactive dead Daemon for generating UUIDs - vgauth.service loaded inactive dead Authentication service for virtual machines hosted on VMware - wpa_supplicant.service loaded active running WPA supplicant - -Legend: LOAD β†’ Reflects whether the unit definition was properly loaded. - ACTIVE β†’ The high-level unit activation state, i.e. generalization of SUB. - SUB β†’ The low-level unit activation state, values depend on unit type. - -146 loaded units listed. \ No newline at end of file diff --git a/ansible/archive/outputs/heimdall-baseline-20260312T214117/compose_files/_home_chester_traefik_docker-compose.yml b/ansible/archive/outputs/heimdall-baseline-20260312T214117/compose_files/_home_chester_traefik_docker-compose.yml deleted file mode 100644 index 0ca28c0..0000000 --- a/ansible/archive/outputs/heimdall-baseline-20260312T214117/compose_files/_home_chester_traefik_docker-compose.yml +++ /dev/null @@ -1,98 +0,0 @@ -services: - redis: - image: redis:7-alpine - container_name: redis - restart: unless-stopped - ports: - - "6379:6379" - networks: - - proxy-net - volumes: - - redis-data:/data - command: redis-server --appendonly yes - healthcheck: - test: ["CMD", "redis-cli", "ping"] - interval: 10s - timeout: 5s - retries: 5 - - docker-socket-proxy: - image: tecnativa/docker-socket-proxy:latest - container_name: docker-socket-proxy - restart: unless-stopped - userns_mode: "host" - user: "0:0" - security_opt: - - apparmor=unconfined - privileged: true - group_add: - - "988" - environment: - - CONTAINERS=1 - - SERVICES=1 - - TASKS=1 - - NETWORKS=1 - - EVENTS=1 - - VERSION=1 - - PING=1 - - AUTH=1 - - INFO=1 - - VOLUMES=1 - volumes: - - /var/run/docker.sock:/var/run/docker.sock - networks: - - proxy-net - - traefik: - image: traefik:v3.6.5 - container_name: traefik - restart: unless-stopped - user: "0:0" - read_only: false - depends_on: - redis: - condition: service_healthy - docker-socket-proxy: - condition: service_started - environment: - - DOCKER_HOST=tcp://docker-socket-proxy:2375 -# - DOCKER_API_VERSION=1.41 - - CLOUDFLARE_DNS_API_TOKEN=${CLOUDFLARE_DNS_API_TOKEN} - - CLOUDFLARE_ZONE_API_TOKEN=${CLOUDFLARE_DNS_API_TOKEN} - networks: - - proxy-net - ports: - - "80:80" - - "443:443" - volumes: - - ./traefik.yml:/traefik.yml:ro - - ./traefik-data/dynamic:/dynamic:ro - - ./traefik-data/certs:/certs - - ./traefik-data/access-logs:/var/log/traefik - labels: - - "traefik.enable=true" - # Dashboard - - "traefik.http.routers.traefik-secure.rule=Host(`proxy.castaldifamily.com`) && (PathPrefix(`/api`) || PathPrefix(`/dashboard`))" - - "traefik.http.routers.traefik-secure.entrypoints=websecure" - - "traefik.http.routers.traefik-secure.tls=true" - - "traefik.http.routers.traefik-secure.tls.certresolver=cloudflare" - - "traefik.http.routers.traefik-secure.service=api@internal" - - "traefik.http.routers.traefik-secure.middlewares=dashboard-auth@file,security-headers@file,ratelimit-basic@file,dashboard-slash@file" - # Root redirect - - "traefik.http.routers.traefik-root.rule=Host(`proxy.castaldifamily.com`) && Path(`/`)" - - "traefik.http.routers.traefik-root.entrypoints=websecure" - - "traefik.http.routers.traefik-root.tls=true" - - "traefik.http.routers.traefik-root.tls.certresolver=cloudflare" - - "traefik.http.routers.traefik-root.service=api@internal" - - "traefik.http.routers.traefik-root.middlewares=redirect-to-dashboard" - - "traefik.http.middlewares.redirect-to-dashboard.redirectregex.regex=^/$$" - - "traefik.http.middlewares.redirect-to-dashboard.redirectregex.replacement=/dashboard" - - "traefik.http.middlewares.redirect-to-dashboard.redirectregex.permanent=true" - -networks: - proxy-net: - driver: bridge - name: proxy-net - -volumes: - redis-data: diff --git a/ansible/archive/outputs/heimdall-baseline-20260312T214117/containers.yml b/ansible/archive/outputs/heimdall-baseline-20260312T214117/containers.yml deleted file mode 100644 index 231816f..0000000 --- a/ansible/archive/outputs/heimdall-baseline-20260312T214117/containers.yml +++ /dev/null @@ -1,975 +0,0 @@ -- AppArmorProfile: docker-default - Args: - - --path.procfs=/host/proc - - --path.sysfs=/host/sys - - --path.rootfs=/rootfs - - --collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/) - Config: - AttachStderr: true - AttachStdin: false - AttachStdout: true - Cmd: - - --path.procfs=/host/proc - - --path.sysfs=/host/sys - - --path.rootfs=/rootfs - - --collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/) - Domainname: '' - Entrypoint: - - /bin/node_exporter - Env: - - PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin - ExposedPorts: - 9100/tcp: {} - Hostname: heimdall - Image: prom/node-exporter:latest - Labels: - maintainer: The Prometheus Authors - OpenStdin: false - StdinOnce: false - Tty: false - User: nobody - Volumes: null - WorkingDir: '' - Created: '2026-03-09T23:15:53.531184328Z' - Driver: overlayfs - ExecIDs: null - HostConfig: - AutoRemove: false - Binds: - - /proc:/host/proc:ro - - /sys:/host/sys:ro - - /:/rootfs:ro - BlkioDeviceReadBps: null - BlkioDeviceReadIOps: null - BlkioDeviceWriteBps: null - BlkioDeviceWriteIOps: null - BlkioWeight: 0 - BlkioWeightDevice: null - CapAdd: null - CapDrop: - - ALL - Cgroup: '' - CgroupParent: '' - CgroupnsMode: private - ConsoleSize: - - 0 - - 0 - ContainerIDFile: '' - CpuCount: 0 - CpuPercent: 0 - CpuPeriod: 0 - CpuQuota: 0 - CpuRealtimePeriod: 0 - CpuRealtimeRuntime: 0 - CpuShares: 0 - CpusetCpus: '' - CpusetMems: '' - DeviceCgroupRules: null - DeviceRequests: null - Devices: null - Dns: null - DnsOptions: null - DnsSearch: null - ExtraHosts: null - GroupAdd: null - IOMaximumBandwidth: 0 - IOMaximumIOps: 0 - IpcMode: private - Isolation: '' - Links: null - LogConfig: - Config: {} - Type: json-file - MaskedPaths: - - /proc/acpi - - /proc/asound - - /proc/interrupts - - /proc/kcore - - /proc/keys - - /proc/latency_stats - - /proc/sched_debug - - /proc/scsi - - /proc/timer_list - - /proc/timer_stats - - /sys/devices/virtual/powercap - - /sys/firmware - - /sys/devices/system/cpu/cpu0/thermal_throttle - - /sys/devices/system/cpu/cpu1/thermal_throttle - - /sys/devices/system/cpu/cpu2/thermal_throttle - - /sys/devices/system/cpu/cpu3/thermal_throttle - Memory: 134217728 - MemoryReservation: 0 - MemorySwap: 268435456 - MemorySwappiness: null - NanoCpus: 500000000 - NetworkMode: host - OomKillDisable: null - OomScoreAdj: 0 - PidMode: '' - PidsLimit: null - PortBindings: {} - Privileged: false - PublishAllPorts: false - ReadonlyPaths: - - /proc/bus - - /proc/fs - - /proc/irq - - /proc/sys - - /proc/sysrq-trigger - ReadonlyRootfs: true - RestartPolicy: - MaximumRetryCount: 0 - Name: unless-stopped - Runtime: runc - SecurityOpt: - - no-new-privileges:true - ShmSize: 67108864 - UTSMode: '' - Ulimits: null - UsernsMode: '' - VolumeDriver: '' - VolumesFrom: null - HostnamePath: /var/lib/docker/containers/3f397bc8b39d3a9ae4b903f1daf99fdfddd842cb86b549b86c7aba30fe4d7a4f/hostname - HostsPath: /var/lib/docker/containers/3f397bc8b39d3a9ae4b903f1daf99fdfddd842cb86b549b86c7aba30fe4d7a4f/hosts - Id: 3f397bc8b39d3a9ae4b903f1daf99fdfddd842cb86b549b86c7aba30fe4d7a4f - Image: sha256:3ac34ce007accad95afed72149e0d2b927b7e42fd1c866149b945b84737c62c3 - ImageManifestDescriptor: - digest: sha256:7bcf2839f207d926b908cd3c566c9f1577efb72268062be0c96cd3b17a5cb283 - mediaType: application/vnd.docker.distribution.manifest.v2+json - platform: - architecture: amd64 - os: linux - size: 949 - LogPath: /var/lib/docker/containers/3f397bc8b39d3a9ae4b903f1daf99fdfddd842cb86b549b86c7aba30fe4d7a4f/3f397bc8b39d3a9ae4b903f1daf99fdfddd842cb86b549b86c7aba30fe4d7a4f-json.log - MountLabel: '' - Mounts: - - Destination: /host/proc - Mode: ro - Propagation: rprivate - RW: false - Source: /proc - Type: bind - - Destination: /host/sys - Mode: ro - Propagation: rprivate - RW: false - Source: /sys - Type: bind - - Destination: /rootfs - Mode: ro - Propagation: rslave - RW: false - Source: / - Type: bind - Name: /node-exporter - NetworkSettings: - Networks: - host: - Aliases: null - DNSNames: null - DriverOpts: null - EndpointID: d2673440c953463f22ab1da395595e8f898bfab6baa043b2638fa2654fd04e4a - Gateway: '' - GlobalIPv6Address: '' - GlobalIPv6PrefixLen: 0 - GwPriority: 0 - IPAMConfig: null - IPAddress: '' - IPPrefixLen: 0 - IPv6Gateway: '' - Links: null - MacAddress: '' - NetworkID: b63c150f50197cfb21939a1369d37f0a309118dfb79be11d4c6082d963f8f70a - Ports: {} - SandboxID: 770e56f6832d109ab47e3b523e838be28d0bdf51a520cc5c9a07351bcb84f10d - SandboxKey: /var/run/docker/netns/default - Path: /bin/node_exporter - Platform: linux - ProcessLabel: '' - ResolvConfPath: /var/lib/docker/containers/3f397bc8b39d3a9ae4b903f1daf99fdfddd842cb86b549b86c7aba30fe4d7a4f/resolv.conf - RestartCount: 0 - State: - Dead: false - Error: '' - ExitCode: 0 - FinishedAt: '0001-01-01T00:00:00Z' - OOMKilled: false - Paused: false - Pid: 2616285 - Restarting: false - Running: true - StartedAt: '2026-03-09T23:15:53.649932822Z' - Status: running - Storage: - RootFS: - Snapshot: - Name: overlayfs -- AppArmorProfile: docker-default - Args: - - traefik - Config: - AttachStderr: true - AttachStdin: false - AttachStdout: true - Cmd: - - traefik - Domainname: '' - Entrypoint: - - /entrypoint.sh - Env: - - CLOUDFLARE_ZONE_API_TOKEN= - - DOCKER_HOST=tcp://docker-socket-proxy:2375 - - CLOUDFLARE_DNS_API_TOKEN= - - PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin - ExposedPorts: - 443/tcp: {} - 80/tcp: {} - Hostname: f0c70cc4667e - Image: traefik:v3.6.5 - Labels: - com.docker.compose.config-hash: 42df1402e650e630bde14fa90b6287582d9b29068566faaff58ed7ca6d60fffa - com.docker.compose.container-number: '1' - com.docker.compose.depends_on: redis:service_healthy:false,docker-socket-proxy:service_started:false - com.docker.compose.image: sha256:67622638cd88dbfcfba40159bc652ecf0aea0e032f8a3c7e3134ae7c037b9910 - com.docker.compose.oneoff: 'False' - com.docker.compose.project: traefik - com.docker.compose.project.config_files: /home/chester/traefik/docker-compose.yml - com.docker.compose.project.working_dir: /home/chester/traefik - com.docker.compose.replace: traefik - com.docker.compose.service: traefik - com.docker.compose.version: 5.0.2 - org.opencontainers.image.description: A modern reverse-proxy - org.opencontainers.image.documentation: https://docs.traefik.io - org.opencontainers.image.source: https://github.com/traefik/traefik - org.opencontainers.image.title: Traefik - org.opencontainers.image.url: https://traefik.io - org.opencontainers.image.vendor: Traefik Labs - org.opencontainers.image.version: v3.6.5 - traefik.enable: 'true' - traefik.http.middlewares.redirect-to-dashboard.redirectregex.permanent: 'true' - traefik.http.middlewares.redirect-to-dashboard.redirectregex.regex: ^/$ - traefik.http.middlewares.redirect-to-dashboard.redirectregex.replacement: /dashboard - traefik.http.routers.traefik-root.entrypoints: websecure - traefik.http.routers.traefik-root.middlewares: redirect-to-dashboard - traefik.http.routers.traefik-root.rule: Host(`proxy.castaldifamily.com`) - && Path(`/`) - traefik.http.routers.traefik-root.service: api@internal - traefik.http.routers.traefik-root.tls: 'true' - traefik.http.routers.traefik-root.tls.certresolver: cloudflare - traefik.http.routers.traefik-secure.entrypoints: websecure - traefik.http.routers.traefik-secure.middlewares: dashboard-auth@file,security-headers@file,ratelimit-basic@file,dashboard-slash@file - traefik.http.routers.traefik-secure.rule: Host(`proxy.castaldifamily.com`) - && (PathPrefix(`/api`) || PathPrefix(`/dashboard`)) - traefik.http.routers.traefik-secure.service: api@internal - traefik.http.routers.traefik-secure.tls: 'true' - traefik.http.routers.traefik-secure.tls.certresolver: cloudflare - OpenStdin: false - StdinOnce: false - Tty: false - User: 0:0 - Volumes: null - WorkingDir: / - Created: '2026-01-28T00:34:54.992079505Z' - Driver: overlayfs - ExecIDs: null - HostConfig: - AutoRemove: false - Binds: - - /home/chester/traefik/traefik-data/certs:/certs:rw - - /home/chester/traefik/traefik-data/access-logs:/var/log/traefik:rw - - /home/chester/traefik/traefik.yml:/traefik.yml:ro - - /home/chester/traefik/traefik-data/dynamic:/dynamic:ro - BlkioDeviceReadBps: null - BlkioDeviceReadIOps: null - BlkioDeviceWriteBps: null - BlkioDeviceWriteIOps: null - BlkioWeight: 0 - BlkioWeightDevice: null - CapAdd: null - CapDrop: null - Cgroup: '' - CgroupParent: '' - CgroupnsMode: private - ConsoleSize: - - 0 - - 0 - ContainerIDFile: '' - CpuCount: 0 - CpuPercent: 0 - CpuPeriod: 0 - CpuQuota: 0 - CpuRealtimePeriod: 0 - CpuRealtimeRuntime: 0 - CpuShares: 0 - CpusetCpus: '' - CpusetMems: '' - DeviceCgroupRules: null - DeviceRequests: null - Devices: null - Dns: [] - DnsOptions: [] - DnsSearch: [] - ExtraHosts: [] - GroupAdd: null - IOMaximumBandwidth: 0 - IOMaximumIOps: 0 - IpcMode: private - Isolation: '' - Links: null - LogConfig: - Config: {} - Type: json-file - MaskedPaths: - - /proc/acpi - - /proc/asound - - /proc/interrupts - - /proc/kcore - - /proc/keys - - /proc/latency_stats - - /proc/sched_debug - - /proc/scsi - - /proc/timer_list - - /proc/timer_stats - - /sys/devices/virtual/powercap - - /sys/firmware - - /sys/devices/system/cpu/cpu0/thermal_throttle - - /sys/devices/system/cpu/cpu1/thermal_throttle - - /sys/devices/system/cpu/cpu2/thermal_throttle - - /sys/devices/system/cpu/cpu3/thermal_throttle - Memory: 0 - MemoryReservation: 0 - MemorySwap: 0 - MemorySwappiness: null - NanoCpus: 0 - NetworkMode: proxy-net - OomKillDisable: null - OomScoreAdj: 0 - PidMode: '' - PidsLimit: null - PortBindings: - 443/tcp: - - HostIp: '' - HostPort: '443' - 80/tcp: - - HostIp: '' - HostPort: '80' - Privileged: false - PublishAllPorts: false - ReadonlyPaths: - - /proc/bus - - /proc/fs - - /proc/irq - - /proc/sys - - /proc/sysrq-trigger - ReadonlyRootfs: false - RestartPolicy: - MaximumRetryCount: 0 - Name: unless-stopped - Runtime: runc - SecurityOpt: null - ShmSize: 67108864 - UTSMode: '' - Ulimits: null - UsernsMode: '' - VolumeDriver: '' - VolumesFrom: null - HostnamePath: /var/lib/docker/containers/f0c70cc4667e2bfb834ed92486be28d836c399dbeb84fa26bd84f49579562c64/hostname - HostsPath: /var/lib/docker/containers/f0c70cc4667e2bfb834ed92486be28d836c399dbeb84fa26bd84f49579562c64/hosts - Id: f0c70cc4667e2bfb834ed92486be28d836c399dbeb84fa26bd84f49579562c64 - Image: sha256:67622638cd88dbfcfba40159bc652ecf0aea0e032f8a3c7e3134ae7c037b9910 - ImageManifestDescriptor: - annotations: - com.docker.official-images.bashbrew.arch: amd64 - org.opencontainers.image.base.digest: sha256:1882fa4569e0c591ea092d3766c4893e19b8901a8e649de7067188aba3cc0679 - org.opencontainers.image.base.name: alpine:3.23 - org.opencontainers.image.created: '2025-12-18T00:37:28Z' - org.opencontainers.image.revision: 87ae3f90a938b0159e557ba5b6abcfd63effb714 - org.opencontainers.image.source: https://github.com/traefik/traefik-library-image.git#87ae3f90a938b0159e557ba5b6abcfd63effb714:v3.6/alpine - org.opencontainers.image.url: https://hub.docker.com/_/traefik - org.opencontainers.image.version: v3.6.5 - digest: sha256:d944e3693bbf5a361ddd2e411bb713049cfb4f5ff3da200b30ee7a347dbd6abd - mediaType: application/vnd.oci.image.manifest.v1+json - platform: - architecture: amd64 - os: linux - size: 1728 - LogPath: /var/lib/docker/containers/f0c70cc4667e2bfb834ed92486be28d836c399dbeb84fa26bd84f49579562c64/f0c70cc4667e2bfb834ed92486be28d836c399dbeb84fa26bd84f49579562c64-json.log - MountLabel: '' - Mounts: - - Destination: /traefik.yml - Mode: ro - Propagation: rprivate - RW: false - Source: /home/chester/traefik/traefik.yml - Type: bind - - Destination: /var/log/traefik - Mode: rw - Propagation: rprivate - RW: true - Source: /home/chester/traefik/traefik-data/access-logs - Type: bind - - Destination: /certs - Mode: rw - Propagation: rprivate - RW: true - Source: /home/chester/traefik/traefik-data/certs - Type: bind - - Destination: /dynamic - Mode: ro - Propagation: rprivate - RW: false - Source: /home/chester/traefik/traefik-data/dynamic - Type: bind - Name: /traefik - NetworkSettings: - Networks: - proxy-net: - Aliases: - - traefik - - traefik - DNSNames: - - traefik - - f0c70cc4667e - DriverOpts: null - EndpointID: 85312d375679f81387f54387dc176918f159b3c5527b527a10da91b36dc3c8f5 - Gateway: 172.18.0.1 - GlobalIPv6Address: '' - GlobalIPv6PrefixLen: 0 - GwPriority: 0 - IPAMConfig: null - IPAddress: 172.18.0.3 - IPPrefixLen: 16 - IPv6Gateway: '' - Links: null - MacAddress: c2:85:cb:12:fe:61 - NetworkID: c451239da54e830d98844b541d0b707cc63426ce475d5103dc86300c0ebb7160 - Ports: - 443/tcp: - - HostIp: 0.0.0.0 - HostPort: '443' - - HostIp: '::' - HostPort: '443' - 80/tcp: - - HostIp: 0.0.0.0 - HostPort: '80' - - HostIp: '::' - HostPort: '80' - SandboxID: 39e089426b97fd8075a6b4fad29d0cdc3fa77b73e28f8ef96bef68e3418b7fb1 - SandboxKey: /var/run/docker/netns/39e089426b97 - Path: /entrypoint.sh - Platform: linux - ProcessLabel: '' - ResolvConfPath: /var/lib/docker/containers/f0c70cc4667e2bfb834ed92486be28d836c399dbeb84fa26bd84f49579562c64/resolv.conf - RestartCount: 0 - State: - Dead: false - Error: '' - ExitCode: 0 - FinishedAt: '2026-02-21T18:15:51.551714695Z' - OOMKilled: false - Paused: false - Pid: 1213 - Restarting: false - Running: true - StartedAt: '2026-02-21T18:30:42.488013871Z' - Status: running - Storage: - RootFS: - Snapshot: - Name: overlayfs -- AppArmorProfile: unconfined - Args: - - haproxy - - -f - - /tmp/haproxy.cfg - Config: - AttachStderr: true - AttachStdin: false - AttachStdout: true - Cmd: - - haproxy - - -f - - /tmp/haproxy.cfg - Domainname: '' - Entrypoint: - - docker-entrypoint.sh - Env: - - INFO=1 - - SERVICES=1 - - TASKS=1 - - PING=1 - - AUTH=1 - - VERSION=1 - - EVENTS=1 - - NETWORKS=1 - - CONTAINERS=1 - - VOLUMES=1 - - PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin - - HAPROXY_VERSION=3.2.4 - - HAPROXY_URL=https://www.haproxy.org/download/3.2/src/haproxy-3.2.4.tar.gz - - HAPROXY_SHA256=5d4b2ee6fe56b8098ebb9c91a899d728f87d64cd7be8804d2ddcc5f937498c1d - - ALLOW_RESTARTS=0 - - ALLOW_STOP=0 - - ALLOW_START=0 - - BUILD=0 - - COMMIT=0 - - CONFIGS=0 - - DISABLE_IPV6=0 - - DISTRIBUTION=0 - - EXEC=0 - - GRPC=0 - - IMAGES=0 - - LOG_LEVEL=info - - NODES=0 - - PLUGINS=0 - - POST=0 - - SECRETS=0 - - SESSION=0 - - SOCKET_PATH=/var/run/docker.sock - - SWARM=0 - - SYSTEM=0 - ExposedPorts: - 2375/tcp: {} - Hostname: f59c3a7d4c30 - Image: tecnativa/docker-socket-proxy:latest - Labels: - com.docker.compose.config-hash: 711c15ad420cb4274f3a65832d36be4bc31327a53f09b84b803d0e1ab18a0917 - com.docker.compose.container-number: '1' - com.docker.compose.depends_on: '' - com.docker.compose.image: sha256:1f3a6f303320723d199d2316a3e82b2e2685d86c275d5e3deeaf182573b47476 - com.docker.compose.oneoff: 'False' - com.docker.compose.project: traefik - com.docker.compose.project.config_files: /home/chester/traefik/docker-compose.yml - com.docker.compose.project.working_dir: /home/chester/traefik - com.docker.compose.replace: docker-socket-proxy - com.docker.compose.service: docker-socket-proxy - com.docker.compose.version: 5.0.2 - org.opencontainers.image.created: '2025-12-16T07:26:21.623Z' - org.opencontainers.image.description: Proxy over your Docker socket to - restrict which requests it accepts - org.opencontainers.image.licenses: Apache-2.0 - org.opencontainers.image.revision: 2f04313b042c1bf4dfbd039475dfc42db79bde7a - org.opencontainers.image.source: https://github.com/Tecnativa/docker-socket-proxy - org.opencontainers.image.title: docker-socket-proxy - org.opencontainers.image.url: https://github.com/Tecnativa/docker-socket-proxy - org.opencontainers.image.version: v0.4.2 - OpenStdin: false - StdinOnce: false - StopSignal: SIGUSR1 - Tty: false - User: 0:0 - Volumes: null - WorkingDir: /var/lib/haproxy - Created: '2026-01-28T00:34:44.663698444Z' - Driver: overlayfs - ExecIDs: null - HostConfig: - AutoRemove: false - Binds: - - /var/run/docker.sock:/var/run/docker.sock:rw - BlkioDeviceReadBps: null - BlkioDeviceReadIOps: null - BlkioDeviceWriteBps: null - BlkioDeviceWriteIOps: null - BlkioWeight: 0 - BlkioWeightDevice: null - CapAdd: null - CapDrop: null - Cgroup: '' - CgroupParent: '' - CgroupnsMode: private - ConsoleSize: - - 0 - - 0 - ContainerIDFile: '' - CpuCount: 0 - CpuPercent: 0 - CpuPeriod: 0 - CpuQuota: 0 - CpuRealtimePeriod: 0 - CpuRealtimeRuntime: 0 - CpuShares: 0 - CpusetCpus: '' - CpusetMems: '' - DeviceCgroupRules: null - DeviceRequests: null - Devices: null - Dns: [] - DnsOptions: [] - DnsSearch: [] - ExtraHosts: [] - GroupAdd: - - '988' - IOMaximumBandwidth: 0 - IOMaximumIOps: 0 - IpcMode: private - Isolation: '' - Links: null - LogConfig: - Config: {} - Type: json-file - MaskedPaths: null - Memory: 0 - MemoryReservation: 0 - MemorySwap: 0 - MemorySwappiness: null - NanoCpus: 0 - NetworkMode: proxy-net - OomKillDisable: null - OomScoreAdj: 0 - PidMode: '' - PidsLimit: null - PortBindings: {} - Privileged: true - PublishAllPorts: false - ReadonlyPaths: null - ReadonlyRootfs: false - RestartPolicy: - MaximumRetryCount: 0 - Name: unless-stopped - Runtime: runc - SecurityOpt: - - apparmor=unconfined - - label=disable - ShmSize: 67108864 - UTSMode: '' - Ulimits: null - UsernsMode: host - VolumeDriver: '' - VolumesFrom: null - HostnamePath: /var/lib/docker/containers/f59c3a7d4c3036a26bb8f060aa209b06bcb52d9d0bc41e32a750b36f4df3ae56/hostname - HostsPath: /var/lib/docker/containers/f59c3a7d4c3036a26bb8f060aa209b06bcb52d9d0bc41e32a750b36f4df3ae56/hosts - Id: f59c3a7d4c3036a26bb8f060aa209b06bcb52d9d0bc41e32a750b36f4df3ae56 - Image: sha256:1f3a6f303320723d199d2316a3e82b2e2685d86c275d5e3deeaf182573b47476 - ImageManifestDescriptor: - digest: sha256:bd2241b3bec83abcff25927a0a7ae518e0c5bef624b3cc247dcb31e68b53f417 - mediaType: application/vnd.oci.image.manifest.v1+json - platform: - architecture: amd64 - os: linux - size: 1993 - LogPath: /var/lib/docker/containers/f59c3a7d4c3036a26bb8f060aa209b06bcb52d9d0bc41e32a750b36f4df3ae56/f59c3a7d4c3036a26bb8f060aa209b06bcb52d9d0bc41e32a750b36f4df3ae56-json.log - MountLabel: '' - Mounts: - - Destination: /var/run/docker.sock - Mode: rw - Propagation: rprivate - RW: true - Source: /var/run/docker.sock - Type: bind - Name: /docker-socket-proxy - NetworkSettings: - Networks: - proxy-net: - Aliases: - - docker-socket-proxy - - docker-socket-proxy - DNSNames: - - docker-socket-proxy - - f59c3a7d4c30 - DriverOpts: null - EndpointID: cb18a5396cca6ed0b3c3502b8e8e2d46eb39a5afaa7350e2dd2ea9ee5448d7d3 - Gateway: 172.18.0.1 - GlobalIPv6Address: '' - GlobalIPv6PrefixLen: 0 - GwPriority: 0 - IPAMConfig: null - IPAddress: 172.18.0.2 - IPPrefixLen: 16 - IPv6Gateway: '' - Links: null - MacAddress: 42:a5:f6:d2:52:08 - NetworkID: c451239da54e830d98844b541d0b707cc63426ce475d5103dc86300c0ebb7160 - Ports: - 2375/tcp: null - SandboxID: e0902b280ba958f8f4ee51c20eb33a563b8bfc1717f3fbf4dd012a05672f3e74 - SandboxKey: /var/run/docker/netns/e0902b280ba9 - Path: docker-entrypoint.sh - Platform: linux - ProcessLabel: '' - ResolvConfPath: /var/lib/docker/containers/f59c3a7d4c3036a26bb8f060aa209b06bcb52d9d0bc41e32a750b36f4df3ae56/resolv.conf - RestartCount: 0 - State: - Dead: false - Error: '' - ExitCode: 0 - FinishedAt: '2026-02-21T18:16:00.055009796Z' - OOMKilled: false - Paused: false - Pid: 1225 - Restarting: false - Running: true - StartedAt: '2026-02-21T18:30:42.49130796Z' - Status: running - Storage: - RootFS: - Snapshot: - Name: overlayfs -- AppArmorProfile: docker-default - Args: - - redis-server - - --appendonly - - 'yes' - Config: - AttachStderr: true - AttachStdin: false - AttachStdout: true - Cmd: - - redis-server - - --appendonly - - 'yes' - Domainname: '' - Entrypoint: - - docker-entrypoint.sh - Env: - - PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin - - GOSU_VERSION=1.17 - - REDIS_VERSION=7.4.7 - - REDIS_DOWNLOAD_URL=http://download.redis.io/releases/redis-7.4.7.tar.gz - - REDIS_DOWNLOAD_SHA=c97e57b0df330a9e091cacff012bebe763c275398cf36ff44cdba876814b595b - ExposedPorts: - 6379/tcp: {} - Healthcheck: - Interval: 10000000000 - Retries: 5 - Test: - - CMD - - redis-cli - - ping - Timeout: 5000000000 - Hostname: 57439684f5ef - Image: redis:7-alpine - Labels: - com.docker.compose.config-hash: eb5826610c0f348a70810f75902caa3d6b889a5e442c0d9ddc539355c0113f49 - com.docker.compose.container-number: '1' - com.docker.compose.depends_on: '' - com.docker.compose.image: sha256:ee64a64eaab618d88051c3ade8f6352d11531fcf79d9a4818b9b183d8c1d18ba - com.docker.compose.oneoff: 'False' - com.docker.compose.project: traefik - com.docker.compose.project.config_files: /home/chester/traefik/docker-compose.yml - com.docker.compose.project.working_dir: /home/chester/traefik - com.docker.compose.replace: redis - com.docker.compose.service: redis - com.docker.compose.version: 5.0.2 - OpenStdin: false - StdinOnce: false - Tty: false - User: '' - Volumes: - /data: {} - WorkingDir: /data - Created: '2026-01-28T00:34:44.662867915Z' - Driver: overlayfs - ExecIDs: null - HostConfig: - AutoRemove: false - Binds: - - traefik_redis-data:/data:rw - BlkioDeviceReadBps: null - BlkioDeviceReadIOps: null - BlkioDeviceWriteBps: null - BlkioDeviceWriteIOps: null - BlkioWeight: 0 - BlkioWeightDevice: null - CapAdd: null - CapDrop: null - Cgroup: '' - CgroupParent: '' - CgroupnsMode: private - ConsoleSize: - - 0 - - 0 - ContainerIDFile: '' - CpuCount: 0 - CpuPercent: 0 - CpuPeriod: 0 - CpuQuota: 0 - CpuRealtimePeriod: 0 - CpuRealtimeRuntime: 0 - CpuShares: 0 - CpusetCpus: '' - CpusetMems: '' - DeviceCgroupRules: null - DeviceRequests: null - Devices: null - Dns: [] - DnsOptions: [] - DnsSearch: [] - ExtraHosts: [] - GroupAdd: null - IOMaximumBandwidth: 0 - IOMaximumIOps: 0 - IpcMode: private - Isolation: '' - Links: null - LogConfig: - Config: {} - Type: json-file - MaskedPaths: - - /proc/acpi - - /proc/asound - - /proc/interrupts - - /proc/kcore - - /proc/keys - - /proc/latency_stats - - /proc/sched_debug - - /proc/scsi - - /proc/timer_list - - /proc/timer_stats - - /sys/devices/virtual/powercap - - /sys/firmware - - /sys/devices/system/cpu/cpu0/thermal_throttle - - /sys/devices/system/cpu/cpu1/thermal_throttle - - /sys/devices/system/cpu/cpu2/thermal_throttle - - /sys/devices/system/cpu/cpu3/thermal_throttle - Memory: 0 - MemoryReservation: 0 - MemorySwap: 0 - MemorySwappiness: null - NanoCpus: 0 - NetworkMode: proxy-net - OomKillDisable: null - OomScoreAdj: 0 - PidMode: '' - PidsLimit: null - PortBindings: - 6379/tcp: - - HostIp: '' - HostPort: '6379' - Privileged: false - PublishAllPorts: false - ReadonlyPaths: - - /proc/bus - - /proc/fs - - /proc/irq - - /proc/sys - - /proc/sysrq-trigger - ReadonlyRootfs: false - RestartPolicy: - MaximumRetryCount: 0 - Name: unless-stopped - Runtime: runc - SecurityOpt: null - ShmSize: 67108864 - UTSMode: '' - Ulimits: null - UsernsMode: '' - VolumeDriver: '' - VolumesFrom: null - HostnamePath: /var/lib/docker/containers/57439684f5eff5afa67108c958725c641ff4b0299917774c93d91d5ce7b614b2/hostname - HostsPath: /var/lib/docker/containers/57439684f5eff5afa67108c958725c641ff4b0299917774c93d91d5ce7b614b2/hosts - Id: 57439684f5eff5afa67108c958725c641ff4b0299917774c93d91d5ce7b614b2 - Image: sha256:ee64a64eaab618d88051c3ade8f6352d11531fcf79d9a4818b9b183d8c1d18ba - ImageManifestDescriptor: - annotations: - com.docker.official-images.bashbrew.arch: amd64 - org.opencontainers.image.base.digest: sha256:41c81533144786e0beb2b148667355a6c7659aa99a14ed837ff15a98ca9d71f3 - org.opencontainers.image.base.name: alpine:3.21 - org.opencontainers.image.created: '2025-11-03T17:38:49Z' - org.opencontainers.image.revision: d42d7aec93b1c54dd46f37a66a92f62478456039 - org.opencontainers.image.source: https://github.com/redis/docker-library-redis.git#d42d7aec93b1c54dd46f37a66a92f62478456039:7.4/alpine - org.opencontainers.image.url: https://hub.docker.com/_/redis - org.opencontainers.image.version: 7.4.7-alpine - digest: sha256:4706ecab5371690fecfdd782268929c94ad5b5ce9ce0b35bfdfe191c4ad17851 - mediaType: application/vnd.oci.image.manifest.v1+json - platform: - architecture: amd64 - os: linux - size: 2483 - LogPath: /var/lib/docker/containers/57439684f5eff5afa67108c958725c641ff4b0299917774c93d91d5ce7b614b2/57439684f5eff5afa67108c958725c641ff4b0299917774c93d91d5ce7b614b2-json.log - MountLabel: '' - Mounts: - - Destination: /data - Driver: local - Mode: rw - Name: traefik_redis-data - Propagation: '' - RW: true - Source: /var/lib/docker/volumes/traefik_redis-data/_data - Type: volume - Name: /redis - NetworkSettings: - Networks: - proxy-net: - Aliases: - - redis - - redis - DNSNames: - - redis - - 57439684f5ef - DriverOpts: null - EndpointID: 7f950d9aab3bf29937a2c66723f8fd483984fa9ccd74a859166e810c77a9ca0b - Gateway: 172.18.0.1 - GlobalIPv6Address: '' - GlobalIPv6PrefixLen: 0 - GwPriority: 0 - IPAMConfig: null - IPAddress: 172.18.0.4 - IPPrefixLen: 16 - IPv6Gateway: '' - Links: null - MacAddress: e2:9b:a3:07:2f:81 - NetworkID: c451239da54e830d98844b541d0b707cc63426ce475d5103dc86300c0ebb7160 - Ports: - 6379/tcp: - - HostIp: 0.0.0.0 - HostPort: '6379' - - HostIp: '::' - HostPort: '6379' - SandboxID: dfafbd7bf0a46788747bcf7e8cbe9dcfc05886cdbb73add6cde8d3f50eeed30d - SandboxKey: /var/run/docker/netns/dfafbd7bf0a4 - Path: docker-entrypoint.sh - Platform: linux - ProcessLabel: '' - ResolvConfPath: /var/lib/docker/containers/57439684f5eff5afa67108c958725c641ff4b0299917774c93d91d5ce7b614b2/resolv.conf - RestartCount: 0 - State: - Dead: false - Error: '' - ExitCode: 0 - FinishedAt: '2026-02-21T18:15:50.121096266Z' - Health: - FailingStreak: 0 - Log: - - End: '2026-03-12T21:40:46.09861824Z' - ExitCode: 0 - Output: 'PONG - - ' - Start: '2026-03-12T21:40:46.035694287Z' - - End: '2026-03-12T21:40:56.156972993Z' - ExitCode: 0 - Output: 'PONG - - ' - Start: '2026-03-12T21:40:56.09903008Z' - - End: '2026-03-12T21:41:06.212479164Z' - ExitCode: 0 - Output: 'PONG - - ' - Start: '2026-03-12T21:41:06.158068315Z' - - End: '2026-03-12T21:41:16.254915792Z' - ExitCode: 0 - Output: 'PONG - - ' - Start: '2026-03-12T21:41:16.213809696Z' - - End: '2026-03-12T21:41:26.295890532Z' - ExitCode: 0 - Output: 'PONG - - ' - Start: '2026-03-12T21:41:26.255822169Z' - Status: healthy - OOMKilled: false - Paused: false - Pid: 1220 - Restarting: false - Running: true - StartedAt: '2026-02-21T18:30:42.486966925Z' - Status: running - Storage: - RootFS: - Snapshot: - Name: overlayfs diff --git a/ansible/archive/outputs/heimdall-baseline-20260312T214117/docker_info.yml b/ansible/archive/outputs/heimdall-baseline-20260312T214117/docker_info.yml deleted file mode 100644 index c526144..0000000 --- a/ansible/archive/outputs/heimdall-baseline-20260312T214117/docker_info.yml +++ /dev/null @@ -1,8 +0,0 @@ -cgroup_driver: systemd -containers_running: 4 -containers_total: 4 -daemon_config: {} -logging_driver: json-file -server_version: 29.2.0 -storage_driver: overlayfs -swarm_state: inactive diff --git a/ansible/archive/outputs/heimdall-baseline-20260312T214117/env_keys/_home_chester_traefik_.env.redacted b/ansible/archive/outputs/heimdall-baseline-20260312T214117/env_keys/_home_chester_traefik_.env.redacted deleted file mode 100644 index f046e42..0000000 --- a/ansible/archive/outputs/heimdall-baseline-20260312T214117/env_keys/_home_chester_traefik_.env.redacted +++ /dev/null @@ -1,7 +0,0 @@ -# Env key inventory β€” values REDACTED for security -# Source: /home/chester/traefik/.env -# Host: heimdall | Captured: 2026-03-12T21:41:19Z -# -# To restore secrets: ansible-vault encrypt_string '' --name '' -CLOUDFLARE_DNS_API_TOKEN= -CLOUDFLARE_ZONE_API_TOKEN= diff --git a/ansible/archive/outputs/heimdall-baseline-20260312T214117/firewall_rules.txt b/ansible/archive/outputs/heimdall-baseline-20260312T214117/firewall_rules.txt deleted file mode 100644 index 18a53d4..0000000 --- a/ansible/archive/outputs/heimdall-baseline-20260312T214117/firewall_rules.txt +++ /dev/null @@ -1,49 +0,0 @@ -# Firewall state on heimdall -# Captured: 2026-03-12T21:41:19Z - -## UFW STATUS -Status: inactive - -## IPTABLES (reference) -Chain INPUT (policy ACCEPT) -num target prot opt source destination - -Chain FORWARD (policy DROP) -num target prot opt source destination -1 DOCKER-USER 0 -- 0.0.0.0/0 0.0.0.0/0 -2 DOCKER-FORWARD 0 -- 0.0.0.0/0 0.0.0.0/0 - -Chain OUTPUT (policy ACCEPT) -num target prot opt source destination - -Chain DOCKER (2 references) -num target prot opt source destination -1 ACCEPT 6 -- 0.0.0.0/0 172.18.0.4 tcp dpt:6379 -2 ACCEPT 6 -- 0.0.0.0/0 172.18.0.3 tcp dpt:443 -3 ACCEPT 6 -- 0.0.0.0/0 172.18.0.3 tcp dpt:80 -4 DROP 0 -- 0.0.0.0/0 0.0.0.0/0 -5 DROP 0 -- 0.0.0.0/0 0.0.0.0/0 - -Chain DOCKER-BRIDGE (1 references) -num target prot opt source destination -1 DOCKER 0 -- 0.0.0.0/0 0.0.0.0/0 -2 DOCKER 0 -- 0.0.0.0/0 0.0.0.0/0 - -Chain DOCKER-CT (1 references) -num target prot opt source destination -1 ACCEPT 0 -- 0.0.0.0/0 0.0.0.0/0 ctstate RELATED,ESTABLISHED -2 ACCEPT 0 -- 0.0.0.0/0 0.0.0.0/0 ctstate RELATED,ESTABLISHED - -Chain DOCKER-FORWARD (1 references) -num target prot opt source destination -1 DOCKER-CT 0 -- 0.0.0.0/0 0.0.0.0/0 -2 DOCKER-INTERNAL 0 -- 0.0.0.0/0 0.0.0.0/0 -3 DOCKER-BRIDGE 0 -- 0.0.0.0/0 0.0.0.0/0 -4 ACCEPT 0 -- 0.0.0.0/0 0.0.0.0/0 -5 ACCEPT 0 -- 0.0.0.0/0 0.0.0.0/0 - -Chain DOCKER-INTERNAL (1 references) -num target prot opt source destination - -Chain DOCKER-USER (1 references) -num target prot opt source destination diff --git a/ansible/archive/outputs/heimdall-baseline-20260312T214117/host_facts.yml b/ansible/archive/outputs/heimdall-baseline-20260312T214117/host_facts.yml deleted file mode 100644 index fe06f35..0000000 --- a/ansible/archive/outputs/heimdall-baseline-20260312T214117/host_facts.yml +++ /dev/null @@ -1,36 +0,0 @@ -ansible_user: root -architecture: x86_64 -cpu_vcpus: 4 -default_ipv4: - address: 10.0.0.151 - alias: enp1s0 - broadcast: 10.0.0.255 - gateway: 10.0.0.2 - interface: enp1s0 - macaddress: 7c:83:34:bf:79:a5 - mtu: 1500 - netmask: 255.255.255.0 - network: 10.0.0.0 - prefix: '24' - type: ether -distribution: Ubuntu -distribution_release: noble -distribution_version: '24.04' -fqdn: heimdall -hostname: heimdall -interfaces: -- enp2s0 -- wlo1 -- enp1s0 -- vethe43b71e -- br-c451239da54e -- lo -- veth2088d3d -- veth57f15b2 -- docker0 -kernel: 6.8.0-100-generic -memory_free_mb: 342 -memory_total_mb: 15767 -os_family: Debian -python_version: 3.12.3 -uptime_seconds: 1653162 diff --git a/ansible/archive/outputs/heimdall-baseline-20260312T214117/manifest.yml b/ansible/archive/outputs/heimdall-baseline-20260312T214117/manifest.yml deleted file mode 100644 index 92abc35..0000000 --- a/ansible/archive/outputs/heimdall-baseline-20260312T214117/manifest.yml +++ /dev/null @@ -1,65 +0,0 @@ ---- ---- -# Heimdall baseline capture manifest -# Generated: 2026-03-12T21:41:19Z -# Host: heimdall (10.0.0.151) -# Review this file before proceeding to heimdall_edge role refactor. - -capture_timestamp: "2026-03-12T21:41:19Z" -capture_dir: "/home/chester/homelab/ansible/playbooks/preflight/../../outputs/heimdall-baseline-20260312T214117" - -host: - hostname: "heimdall" - ip: "10.0.0.151" - os: "Ubuntu 24.04" - kernel: "6.8.0-100-generic" - -docker: - version: "29.2.0" - storage_driver: "overlayfs" - swarm_state: "inactive" - containers_running: 4 - containers_total: 4 - -inventory: - containers_found: 4 - compose_files_found: 2 - env_files_found: 2 - -critical_paths: -/etc/docker/daemon.json: false - /home/chester/traefik: true - /home/chester/traefik/.env: true - /home/chester/traefik/docker-compose.yml: true - /home/chester/traefik/traefik-data/certs/acme.json: true - /home/chester/traefik/traefik-data/dynamic/middleware.yml: true - /home/chester/traefik/traefik-data/dynamic/static-backends.yml: true - /home/chester/traefik/traefik.yml: true - /opt/stacks/heimdall: false - /opt/stacks/heimdall/.env: false - /opt/stacks/heimdall/docker-compose.yml: false - /opt/stacks/heimdall/redis-data: false - /opt/stacks/heimdall/runner-data: false - /opt/stacks/heimdall/traefik-certs: false - /opt/stacks/heimdall/traefik-certs/acme.json: false - -compose_file_paths: -- /home/chester/traefik/docker-compose.yml - - /home/chester/traefik/docker-compose.yml - -env_file_paths: -- /home/chester/traefik/.env - - /home/chester/traefik/.env - -containers_running: -- node-exporter - - traefik - - docker-socket-proxy - - redis - -validation: - compose_files_present: True - containers_present: True - stack_dir_present: False - compose_present: False - env_present: False diff --git a/ansible/archive/outputs/heimdall-baseline-20260312T214117/networks_and_volumes.yml b/ansible/archive/outputs/heimdall-baseline-20260312T214117/networks_and_volumes.yml deleted file mode 100644 index 882657a..0000000 --- a/ansible/archive/outputs/heimdall-baseline-20260312T214117/networks_and_volumes.yml +++ /dev/null @@ -1,25 +0,0 @@ ---- -# Docker network and volume inventory -# Host: heimdall | Captured: 2026-03-12T21:41:19Z - -networks: -- Driver: host - Id: b63c150f50197cfb21939a1369d37f0a309118dfb79be11d4c6082d963f8f70a - Name: host - Scope: local - - Driver: bridge - Id: c451239da54e830d98844b541d0b707cc63426ce475d5103dc86300c0ebb7160 - Name: proxy-net - Scope: local - - Driver: bridge - Id: 4f3815cff81bd0c59f62e0151bc58bc0289eca4634f77bf544e1fc3e34c0bab7 - Name: bridge - Scope: local - - Driver: 'null' - Id: a55e7a3ec6e204eae20086edec67507e3c7ef59f5e383d4b8631d614c657e0d0 - Name: none - Scope: local - -volumes: -- Driver: local - Name: traefik_redis-data diff --git a/ansible/archive/outputs/heimdall-baseline-20260312T214117/systemd_units.txt b/ansible/archive/outputs/heimdall-baseline-20260312T214117/systemd_units.txt deleted file mode 100644 index dba47a0..0000000 --- a/ansible/archive/outputs/heimdall-baseline-20260312T214117/systemd_units.txt +++ /dev/null @@ -1,153 +0,0 @@ - UNIT LOAD ACTIVE SUB DESCRIPTION - apparmor.service loaded active exited Load AppArmor profiles - apport-autoreport.service loaded inactive dead Process error reports when automatic reporting is enabled - apport.service loaded active exited automatic crash report generation - apt-daily-upgrade.service loaded inactive dead Daily apt upgrade and clean activities - apt-daily.service loaded inactive dead Daily apt download activities - blk-availability.service loaded active exited Availability of block devices - cloud-init-local.service loaded inactive dead Cloud-init: Local Stage (pre-network) - console-setup.service loaded active exited Set console font and keymap - containerd.service loaded active running containerd container runtime - cron.service loaded active running Regular background program processing daemon - dbus.service loaded active running D-Bus System Message Bus - dm-event.service loaded inactive dead Device-mapper event daemon - dmesg.service loaded inactive dead Save initial kernel messages after boot - docker.service loaded active running Docker Application Container Engine - dpkg-db-backup.service loaded inactive dead Daily dpkg database backup service - e2scrub_all.service loaded inactive dead Online ext4 Metadata Check for All Filesystems - e2scrub_reap.service loaded inactive dead Remove Stale Online ext4 Metadata Check Snapshots - emergency.service loaded inactive dead Emergency Shell - finalrd.service loaded active exited Create final runtime dir for shutdown pivot root - fstrim.service loaded inactive dead Discard unused blocks on filesystems from /etc/fstab - fwupd-refresh.service loaded inactive dead Refresh fwupd metadata and update motd - getty-static.service loaded inactive dead getty on tty2-tty6 if dbus and logind are not available - getty@tty1.service loaded active running Getty on tty1 - grub-common.service loaded inactive dead Record successful boot for GRUB - grub-initrd-fallback.service loaded inactive dead GRUB failed boot detection - initrd-cleanup.service loaded inactive dead Cleaning Up and Shutting Down Daemons - initrd-parse-etc.service loaded inactive dead Mountpoints Configured in the Real Root - initrd-switch-root.service loaded inactive dead Switch Root - initrd-udevadm-cleanup-db.service loaded inactive dead Cleanup udev Database - iscsid.service loaded inactive dead iSCSI initiator daemon (iscsid) - keyboard-setup.service loaded active exited Set the console keyboard layout - kmod-static-nodes.service loaded active exited Create List of Static Device Nodes - ldconfig.service loaded inactive dead Rebuild Dynamic Linker Cache - logrotate.service loaded inactive dead Rotate log files - lvm2-lvmpolld.service loaded inactive dead LVM2 poll daemon - lvm2-monitor.service loaded active exited Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling - man-db.service loaded inactive dead Daily man-db regeneration - ModemManager.service loaded active running Modem Manager - modprobe@configfs.service loaded inactive dead Load Kernel Module configfs - modprobe@dm_mod.service loaded inactive dead Load Kernel Module dm_mod - modprobe@drm.service loaded inactive dead Load Kernel Module drm - modprobe@efi_pstore.service loaded inactive dead Load Kernel Module efi_pstore - modprobe@fuse.service loaded inactive dead Load Kernel Module fuse - modprobe@loop.service loaded inactive dead Load Kernel Module loop - motd-news.service loaded inactive dead Message of the Day - multipathd.service loaded active running Device-Mapper Multipath Device Controller - netplan-ovs-cleanup.service loaded inactive dead OpenVSwitch configuration for cleanup - networkd-dispatcher.service loaded inactive dead Dispatcher daemon for systemd-networkd - open-iscsi.service loaded inactive dead Login to default iSCSI targets - open-vm-tools.service loaded inactive dead Service for virtual machines hosted on VMware - plymouth-quit-wait.service loaded active exited Hold until boot process finishes up - plymouth-quit.service loaded active exited Terminate Plymouth Boot Screen - plymouth-read-write.service loaded active exited Tell Plymouth To Write Out Runtime Data - plymouth-start.service loaded inactive dead Show Plymouth Boot Screen - plymouth-switch-root.service loaded inactive dead Plymouth switch root service - polkit.service loaded active running Authorization Manager - pollinate.service loaded inactive dead Pollinate to seed the pseudo random number generator - rc-local.service loaded inactive dead /etc/rc.local Compatibility - rescue.service loaded inactive dead Rescue Shell - rsyslog.service loaded active running System Logging Service - secureboot-db.service loaded inactive dead Secure Boot updates for DB and DBX - setvtrgb.service loaded active exited Set console scheme - snapd.apparmor.service loaded active exited Load AppArmor profiles managed internally by snapd - snapd.autoimport.service loaded inactive dead Auto import assertions from block devices - snapd.core-fixup.service loaded inactive dead Automatically repair incorrect owner/permissions on core devices - snapd.failure.service loaded inactive dead Failure handling of the snapd snap - snapd.recovery-chooser-trigger.service loaded inactive dead Wait for the Ubuntu Core chooser trigger - snapd.seeded.service loaded active exited Wait until snapd is fully seeded - snapd.service loaded inactive dead Snap Daemon - snapd.snap-repair.service loaded inactive dead Automatically fetch and run repair assertions - snapd.system-shutdown.service loaded inactive dead Ubuntu core (all-snaps) system shutdown helper setup service - ssh.service loaded active running OpenBSD Secure Shell server - sysstat-collect.service loaded inactive dead system activity accounting tool - sysstat-summary.service loaded inactive dead Generate a daily summary of process accounting - sysstat.service loaded active exited Resets System Activity Logs - systemd-ask-password-console.service loaded inactive dead Dispatch Password Requests to Console - systemd-ask-password-plymouth.service loaded inactive dead Forward Password Requests to Plymouth - systemd-ask-password-wall.service loaded inactive dead Forward Password Requests to Wall - systemd-battery-check.service loaded inactive dead Check battery level during early boot - systemd-binfmt.service loaded active exited Set Up Additional Binary Formats - systemd-bsod.service loaded inactive dead Displays emergency message in full screen. - systemd-firstboot.service loaded inactive dead First Boot Wizard - systemd-fsck-root.service loaded inactive dead File System Check on Root Device - systemd-fsck@dev-disk-by\x2duuid-36D5\x2d0248.service loaded active exited File System Check on /dev/disk/by-uuid/36D5-0248 - systemd-fsck@dev-disk-by\x2duuid-da3c4a6e\x2df851\x2d471f\x2d81e4\x2dcd9b3b26acf1.service loaded active exited File System Check on /dev/disk/by-uuid/da3c4a6e-f851-471f-81e4-cd9b3b26acf1 - systemd-fsckd.service loaded inactive dead File System Check Daemon to report status - systemd-hibernate-resume.service loaded inactive dead Resume from hibernation - systemd-hibernate.service loaded inactive dead System Hibernate - systemd-hwdb-update.service loaded inactive dead Rebuild Hardware Database - systemd-hybrid-sleep.service loaded inactive dead System Hybrid Suspend+Hibernate - systemd-initctl.service loaded inactive dead initctl Compatibility Daemon - systemd-journal-catalog-update.service loaded inactive dead Rebuild Journal Catalog - systemd-journal-flush.service loaded active exited Flush Journal to Persistent Storage - systemd-journald.service loaded active running Journal Service - systemd-logind.service loaded active running User Login Management - systemd-machine-id-commit.service loaded inactive dead Commit a transient machine-id on disk - systemd-modules-load.service loaded active exited Load Kernel Modules -● systemd-networkd-wait-online.service loaded failed failed Wait for Network to be Configured - systemd-networkd.service loaded active running Network Configuration - systemd-pcrmachine.service loaded inactive dead TPM2 PCR Machine ID Measurement - systemd-pcrphase-initrd.service loaded inactive dead TPM2 PCR Barrier (initrd) - systemd-pcrphase-sysinit.service loaded inactive dead TPM2 PCR Barrier (Initialization) - systemd-pcrphase.service loaded inactive dead TPM2 PCR Barrier (User) - systemd-pstore.service loaded inactive dead Platform Persistent Storage Archival - systemd-quotacheck.service loaded inactive dead File System Quota Check - systemd-random-seed.service loaded active exited Load/Save OS Random Seed - systemd-remount-fs.service loaded active exited Remount Root and Kernel File Systems - systemd-repart.service loaded inactive dead Repartition Root Disk - systemd-resolved.service loaded active running Network Name Resolution - systemd-rfkill.service loaded inactive dead Load/Save RF Kill Switch Status - systemd-soft-reboot.service loaded inactive dead Reboot System Userspace - systemd-suspend-then-hibernate.service loaded inactive dead System Suspend then Hibernate - systemd-suspend.service loaded inactive dead System Suspend - systemd-sysctl.service loaded active exited Apply Kernel Variables - systemd-sysext.service loaded inactive dead Merge System Extension Images into /usr/ and /opt/ - systemd-sysusers.service loaded inactive dead Create System Users - systemd-timesyncd.service loaded active running Network Time Synchronization - systemd-tmpfiles-clean.service loaded inactive dead Cleanup of Temporary Directories - systemd-tmpfiles-setup-dev-early.service loaded active exited Create Static Device Nodes in /dev gracefully - systemd-tmpfiles-setup-dev.service loaded active exited Create Static Device Nodes in /dev - systemd-tmpfiles-setup.service loaded active exited Create Volatile Files and Directories - systemd-tpm2-setup-early.service loaded inactive dead TPM2 SRK Setup (Early) - systemd-tpm2-setup.service loaded inactive dead TPM2 SRK Setup - systemd-udev-settle.service loaded inactive dead Wait for udev To Complete Device Initialization - systemd-udev-trigger.service loaded active exited Coldplug All udev Devices - systemd-udevd.service loaded active running Rule-based Manager for Device Events and Files - systemd-update-done.service loaded inactive dead Update is Completed - systemd-update-utmp-runlevel.service loaded inactive dead Record Runlevel Change in UTMP - systemd-update-utmp.service loaded active exited Record System Boot/Shutdown in UTMP - systemd-user-sessions.service loaded active exited Permit User Sessions - thermald.service loaded active running Thermal Daemon Service - tpm-udev.service loaded inactive dead Handle dynamically added tpm devices - ua-reboot-cmds.service loaded inactive dead Ubuntu Pro reboot cmds - ua-timer.service loaded inactive dead Ubuntu Pro Timer for running repeated jobs - ubuntu-advantage.service loaded inactive dead Ubuntu Pro Background Auto Attach - udisks2.service loaded active running Disk Manager - ufw.service loaded active exited Uncomplicated firewall - unattended-upgrades.service loaded active running Unattended Upgrades Shutdown - update-notifier-download.service loaded inactive dead Download data for packages that failed at package install time - update-notifier-motd.service loaded inactive dead Check to see whether there is a new version of Ubuntu available - upower.service loaded active running Daemon for power management - user-runtime-dir@1000.service loaded active exited User Runtime Directory /run/user/1000 - user@1000.service loaded active running User Manager for UID 1000 - uuidd.service loaded inactive dead Daemon for generating UUIDs - vgauth.service loaded inactive dead Authentication service for virtual machines hosted on VMware - wpa_supplicant.service loaded active running WPA supplicant - -Legend: LOAD β†’ Reflects whether the unit definition was properly loaded. - ACTIVE β†’ The high-level unit activation state, i.e. generalization of SUB. - SUB β†’ The low-level unit activation state, values depend on unit type. - -146 loaded units listed. \ No newline at end of file diff --git a/ansible/archive/outputs/heimdall-baseline-20260312T214117/traefik_configs/middleware.yml b/ansible/archive/outputs/heimdall-baseline-20260312T214117/traefik_configs/middleware.yml deleted file mode 100644 index f4d2078..0000000 --- a/ansible/archive/outputs/heimdall-baseline-20260312T214117/traefik_configs/middleware.yml +++ /dev/null @@ -1,37 +0,0 @@ -http: - middlewares: - # Security headers - security-headers: - headers: - stsSeconds: 63072000 - stsIncludeSubdomains: true - stsPreload: true - frameDeny: true - contentTypeNosniff: true - browserXssFilter: true - referrerPolicy: "same-origin" - - # Rate limiting - ratelimit-basic: - rateLimit: - average: 50 - burst: 100 - - # Basic auth for dashboard - dashboard-auth: - basicAuth: - users: - - "chester:$apr1$hrRDQ/tR$ZwyxHOCDZjm/55GAs5/Ew1" - - # HTTPS redirect - https-redirect: - redirectScheme: - scheme: https - permanent: true - - # Dashboard slash redirect - dashboard-slash: - redirectregex: - regex: ^/dashboard$ - replacement: /dashboard/ - permanent: true diff --git a/ansible/archive/outputs/heimdall-baseline-20260312T214117/traefik_configs/static-backends.yml b/ansible/archive/outputs/heimdall-baseline-20260312T214117/traefik_configs/static-backends.yml deleted file mode 100644 index 4d00de3..0000000 --- a/ansible/archive/outputs/heimdall-baseline-20260312T214117/traefik_configs/static-backends.yml +++ /dev/null @@ -1,74 +0,0 @@ -http: - # Transport for self-signed certs - serversTransports: - insecure-transport: - insecureSkipVerify: true - - # Static routers for on-prem backends - routers: - tnas-router: - rule: "Host(`tnas.castaldifamily.com`)" - entryPoints: - - websecure - tls: - certResolver: cloudflare - service: tnas-service - middlewares: - - security-headers@file - - dsm-router: - rule: "Host(`dsm.castaldifamily.com`)" - entryPoints: - - websecure - tls: - certResolver: cloudflare - service: dsm-service - middlewares: - - security-headers@file - - watchtower-router: - rule: "Host(`watchtower.castaldifamily.com`)" - entryPoints: - - websecure - tls: - certResolver: cloudflare - service: watchtower-service - middlewares: - - security-headers@file - - gatus-router: - rule: "Host(`status.castaldifamily.com`)" - entryPoints: - - websecure - tls: - certResolver: cloudflare - service: gatus-service - middlewares: - - security-headers@file - - # Services (backends) - services: - tnas-service: - loadBalancer: - servers: - - url: "https://10.0.0.250:5443/tos/#/" - serversTransport: insecure-transport - - dsm-service: - loadBalancer: - servers: - - url: "https://10.0.0.249:5001" - passHostHeader: true - serversTransport: insecure-transport - - watchtower-service: - loadBalancer: - servers: - - url: "https://10.0.0.200:9090" - serversTransport: insecure-transport - - gatus-service: - loadBalancer: - servers: - - url: "http://10.0.0.200:8080" - serversTransport: insecure-transport diff --git a/ansible/archive/outputs/heimdall-baseline-20260312T214117/traefik_configs/traefik.yml b/ansible/archive/outputs/heimdall-baseline-20260312T214117/traefik_configs/traefik.yml deleted file mode 100644 index c8725a5..0000000 --- a/ansible/archive/outputs/heimdall-baseline-20260312T214117/traefik_configs/traefik.yml +++ /dev/null @@ -1,57 +0,0 @@ -global: - checkNewVersion: false - sendAnonymousUsage: false - -log: - level: DEBUG - format: json - -accessLog: - format: json - filePath: /var/log/traefik/access.log - bufferingSize: 100 - -api: - dashboard: true - insecure: false - -entryPoints: - web: - address: ":80" - http: - redirections: - entryPoint: - to: websecure - scheme: https - websecure: - address: ":443" - ping: - address: ":8082" - -ping: - entryPoint: ping - -providers: - docker: - endpoint: "tcp://docker-socket-proxy:2375" - exposedByDefault: false - network: proxy-net - redis: - endpoints: - - redis:6379 - file: - directory: /dynamic - watch: true - -certificatesResolvers: - cloudflare: - acme: - email: nathan@castaldifamily.com - storage: /certs/acme.json - dnsChallenge: - provider: cloudflare - propagation: - delayBeforeChecks: 0 - resolvers: - - 1.1.1.1:53 - - 8.8.8.8:53 diff --git a/ansible/archive/outputs/node-replacement/node-replacement-2026-20260313T131954/cutover-todo.txt b/ansible/archive/outputs/node-replacement/node-replacement-2026-20260313T131954/cutover-todo.txt deleted file mode 100644 index a58fdae..0000000 --- a/ansible/archive/outputs/node-replacement/node-replacement-2026-20260313T131954/cutover-todo.txt +++ /dev/null @@ -1,17 +0,0 @@ -EXECUTION MODE ENABLED - -Phase 2 execution switch: -- replacement_phase2_rebuild_and_rejoin=true - -Phase 3 execution switch: -- replacement_phase3_identity_cutover=false - -Phase 4 execution switch: -- replacement_phase4_validate_cutover=false - -Manual steps still required around identity cutover: -1. If phase 2 enabled, rebuild and rejoin replacement swarm nodes on pve04. -2. If phase 3 enabled, update inventory/group_vars source-of-truth with rollback snapshots. -3. If phase 4 enabled, validate swarm quorum and optional service endpoints. -4. Move network identity 10.0.0.201 to replacement physical host. -5. If stable and approved, power off old host. diff --git a/ansible/archive/outputs/node-replacement/node-replacement-2026-20260313T132142/cutover-todo.txt b/ansible/archive/outputs/node-replacement/node-replacement-2026-20260313T132142/cutover-todo.txt deleted file mode 100644 index a58fdae..0000000 --- a/ansible/archive/outputs/node-replacement/node-replacement-2026-20260313T132142/cutover-todo.txt +++ /dev/null @@ -1,17 +0,0 @@ -EXECUTION MODE ENABLED - -Phase 2 execution switch: -- replacement_phase2_rebuild_and_rejoin=true - -Phase 3 execution switch: -- replacement_phase3_identity_cutover=false - -Phase 4 execution switch: -- replacement_phase4_validate_cutover=false - -Manual steps still required around identity cutover: -1. If phase 2 enabled, rebuild and rejoin replacement swarm nodes on pve04. -2. If phase 3 enabled, update inventory/group_vars source-of-truth with rollback snapshots. -3. If phase 4 enabled, validate swarm quorum and optional service endpoints. -4. Move network identity 10.0.0.201 to replacement physical host. -5. If stable and approved, power off old host. diff --git a/ansible/archive/outputs/node-replacement/node-replacement-2026-20260313T132229/cutover-todo.txt b/ansible/archive/outputs/node-replacement/node-replacement-2026-20260313T132229/cutover-todo.txt deleted file mode 100644 index a58fdae..0000000 --- a/ansible/archive/outputs/node-replacement/node-replacement-2026-20260313T132229/cutover-todo.txt +++ /dev/null @@ -1,17 +0,0 @@ -EXECUTION MODE ENABLED - -Phase 2 execution switch: -- replacement_phase2_rebuild_and_rejoin=true - -Phase 3 execution switch: -- replacement_phase3_identity_cutover=false - -Phase 4 execution switch: -- replacement_phase4_validate_cutover=false - -Manual steps still required around identity cutover: -1. If phase 2 enabled, rebuild and rejoin replacement swarm nodes on pve04. -2. If phase 3 enabled, update inventory/group_vars source-of-truth with rollback snapshots. -3. If phase 4 enabled, validate swarm quorum and optional service endpoints. -4. Move network identity 10.0.0.201 to replacement physical host. -5. If stable and approved, power off old host. diff --git a/ansible/archive/outputs/node-replacement/node-replacement-2026-20260313T140253/cutover-todo.txt b/ansible/archive/outputs/node-replacement/node-replacement-2026-20260313T140253/cutover-todo.txt deleted file mode 100644 index 85bf7ec..0000000 --- a/ansible/archive/outputs/node-replacement/node-replacement-2026-20260313T140253/cutover-todo.txt +++ /dev/null @@ -1,17 +0,0 @@ -EXECUTION MODE ENABLED - -Phase 2 execution switch: -- replacement_phase2_rebuild_and_rejoin=false - -Phase 3 execution switch: -- replacement_phase3_identity_cutover=false - -Phase 4 execution switch: -- replacement_phase4_validate_cutover=true - -Manual steps still required around identity cutover: -1. If phase 2 enabled, rebuild and rejoin replacement swarm nodes on pve04. -2. If phase 3 enabled, update inventory/group_vars source-of-truth with rollback snapshots. -3. If phase 4 enabled, validate swarm quorum and optional service endpoints. -4. Move network identity 10.0.0.201 to replacement physical host. -5. If stable and approved, power off old host. diff --git a/ansible/archive/outputs/node-replacement/node-replacement-2026-20260313T140253/phase4-validation-summary.txt b/ansible/archive/outputs/node-replacement/node-replacement-2026-20260313T140253/phase4-validation-summary.txt deleted file mode 100644 index f4014c3..0000000 --- a/ansible/archive/outputs/node-replacement/node-replacement-2026-20260313T140253/phase4-validation-summary.txt +++ /dev/null @@ -1,17 +0,0 @@ -Project: node-replacement-2026 -Validation manager: swarm-manager-2 -Logical pve01 host: pve01 -Swarm manager identity: swarm-manager-1 -Swarm worker identity: swarm-worker-1 - -=== docker node ls === -ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION -hxcagfwxmrkqoyjo2mgfjeubm swarm-manager-1 Ready Active Reachable 29.3.0 -lalct6bxzf2nn5cpe68wxmjjh * swarm-manager-2 Ready Active Leader 29.3.0 -3aqljmk6dj41q6g6e2uac83nc swarm-manager-3 Ready Active Reachable 29.3.0 -3l735ukunrkbekq72fi0xzg97 swarm-worker-1 Ready Active 29.3.0 -j3j7o853tn00b38bxo3flbi0l swarm-worker-2 Ready Active 29.3.0 -54hq74d2ey5yjhtqh9hl5ieo9 swarm-worker-3 Ready Active 29.3.0 - -=== endpoint checks === -No endpoint checks configured. diff --git a/ansible/archive/outputs/node-replacement/node-replacement-2026-20260313T140730/cutover-todo.txt b/ansible/archive/outputs/node-replacement/node-replacement-2026-20260313T140730/cutover-todo.txt deleted file mode 100644 index 85bf7ec..0000000 --- a/ansible/archive/outputs/node-replacement/node-replacement-2026-20260313T140730/cutover-todo.txt +++ /dev/null @@ -1,17 +0,0 @@ -EXECUTION MODE ENABLED - -Phase 2 execution switch: -- replacement_phase2_rebuild_and_rejoin=false - -Phase 3 execution switch: -- replacement_phase3_identity_cutover=false - -Phase 4 execution switch: -- replacement_phase4_validate_cutover=true - -Manual steps still required around identity cutover: -1. If phase 2 enabled, rebuild and rejoin replacement swarm nodes on pve04. -2. If phase 3 enabled, update inventory/group_vars source-of-truth with rollback snapshots. -3. If phase 4 enabled, validate swarm quorum and optional service endpoints. -4. Move network identity 10.0.0.201 to replacement physical host. -5. If stable and approved, power off old host. diff --git a/ansible/archive/outputs/node-replacement/node-replacement-2026-20260313T140730/phase4-validation-summary.txt b/ansible/archive/outputs/node-replacement/node-replacement-2026-20260313T140730/phase4-validation-summary.txt deleted file mode 100644 index d098916..0000000 --- a/ansible/archive/outputs/node-replacement/node-replacement-2026-20260313T140730/phase4-validation-summary.txt +++ /dev/null @@ -1,17 +0,0 @@ -Project: node-replacement-2026 -Validation manager: swarm-manager-3 -Logical pve01 host: pve01 -Swarm manager identity: swarm-manager-1 -Swarm worker identity: swarm-worker-1 - -=== docker node ls === -ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION -hxcagfwxmrkqoyjo2mgfjeubm swarm-manager-1 Ready Active Reachable 29.3.0 -lalct6bxzf2nn5cpe68wxmjjh swarm-manager-2 Ready Active Leader 29.3.0 -3aqljmk6dj41q6g6e2uac83nc * swarm-manager-3 Ready Active Reachable 29.3.0 -3l735ukunrkbekq72fi0xzg97 swarm-worker-1 Ready Active 29.3.0 -j3j7o853tn00b38bxo3flbi0l swarm-worker-2 Ready Active 29.3.0 -54hq74d2ey5yjhtqh9hl5ieo9 swarm-worker-3 Ready Active 29.3.0 - -=== endpoint checks === -No endpoint checks configured. diff --git a/ansible/archive/outputs/node-replacement/node-replacement-apply-20260313-20260313T131217/cutover-todo.txt b/ansible/archive/outputs/node-replacement/node-replacement-apply-20260313-20260313T131217/cutover-todo.txt deleted file mode 100644 index 8e713f1..0000000 --- a/ansible/archive/outputs/node-replacement/node-replacement-apply-20260313-20260313T131217/cutover-todo.txt +++ /dev/null @@ -1,17 +0,0 @@ -EXECUTION MODE ENABLED - -Phase 2 execution switch: -- replacement_phase2_rebuild_and_rejoin=false - -Phase 3 execution switch: -- replacement_phase3_identity_cutover=true - -Phase 4 execution switch: -- replacement_phase4_validate_cutover=false - -Manual steps still required around identity cutover: -1. If phase 2 enabled, rebuild and rejoin replacement swarm nodes on pve04. -2. If phase 3 enabled, update inventory/group_vars source-of-truth with rollback snapshots. -3. If phase 4 enabled, validate swarm quorum and optional service endpoints. -4. Move network identity 10.0.0.201 to replacement physical host. -5. If stable and approved, power off old host. diff --git a/ansible/archive/outputs/node-replacement/node-replacement-apply-20260313-20260313T131217/phase3-cutover-summary.txt b/ansible/archive/outputs/node-replacement/node-replacement-apply-20260313-20260313T131217/phase3-cutover-summary.txt deleted file mode 100644 index 9e5c712..0000000 --- a/ansible/archive/outputs/node-replacement/node-replacement-apply-20260313-20260313T131217/phase3-cutover-summary.txt +++ /dev/null @@ -1,11 +0,0 @@ -Project: node-replacement-apply-20260313 -Phase: identity cutover source-of-truth update -Inventory file: /home/chester/homelab/ansible/playbooks/proxmox/../../inventory/hosts.ini -Group vars file: /home/chester/homelab/ansible/playbooks/proxmox/../../group_vars/all.yml -Rollback inventory backup: /home/chester/homelab/ansible/playbooks/proxmox/../../outputs/node-replacement/node-replacement-apply-20260313-20260313T131217/rollback/hosts.ini.pre-cutover -Rollback group vars backup: /home/chester/homelab/ansible/playbooks/proxmox/../../outputs/node-replacement/node-replacement-apply-20260313-20260313T131217/rollback/all.yml.pre-cutover - -Applied updates: -- Removed pve04 from proxmox_cluster inventory: True -- Set physical_backing_host for pve01 to pve04 -- Set replacement_status in pve04 metadata diff --git a/ansible/archive/outputs/node-replacement/node-replacement-apply-20260313-20260313T131217/rollback/all.yml.pre-cutover b/ansible/archive/outputs/node-replacement/node-replacement-apply-20260313-20260313T131217/rollback/all.yml.pre-cutover deleted file mode 100644 index 0e9204e..0000000 --- a/ansible/archive/outputs/node-replacement/node-replacement-apply-20260313-20260313T131217/rollback/all.yml.pre-cutover +++ /dev/null @@ -1,183 +0,0 @@ -# Central YAML Source of Truth for Nathan's Lab (2026) -# Edit and commit this file; Ansible playbooks should read this as canonical. -lab_name: "nathan-lab-2026" -canonical_source: "ansible/group_vars/all.yml" - -networks: - main: - vlan: 1 - cidr: "10.0.0.0/24" - dhcp_pool: "10.0.0.100-10.0.0.240" - gateway: "10.0.0.1" - purpose: "Family / wired / main SSID" - - infra: - vlan: 10 - cidr: "10.0.10.0/24" - reserved: "10.0.10.2-10.0.10.50" - purpose: "Management / Proxmox / NAS / Heimdall mgmt" - - iot: - vlan: 50 - cidr: "10.0.50.0/24" - dhcp_pool: "10.0.50.100-10.0.50.199" - purpose: "IoT devices (Omada)" - - guest: - vlan: 30 - cidr: "10.0.30.0/24" - dhcp_pool: "10.0.30.100-10.0.30.200" - purpose: "Guest WiFi (isolated)" - - compute: - vlan: 200 - cidr: "10.0.200.0/24" - purpose: "Swarm / AI grid / ephemeral compute" - -lab_hosts: - er7212pc: - role: gateway - current_ip: "10.0.0.2" - desired_ip: "10.0.0.2" - note: "DHCP + Omada controller" - - pve01: - role: proxmox - current_ip: "10.0.0.201" - desired_ip: "10.0.10.11" - - pve02: - role: proxmox - current_ip: "10.0.0.202" - desired_ip: "10.0.10.12" - - pve03: - role: proxmox - current_ip: "10.0.0.203" - desired_ip: "10.0.10.13" - - pve04: - role: proxmox - current_ip: "10.0.0.204" - desired_ip: "10.0.10.14" - - swarm-manager-1: - current_ip: "10.0.0.211" - desired_ip: "10.0.200.11" - - swarm-manager-2: - current_ip: "10.0.0.212" - desired_ip: "10.0.200.12" - - swarm-manager-3: - current_ip: "10.0.0.213" - desired_ip: "10.0.200.13" - - swarm-worker-1: - current_ip: "10.0.0.221" - desired_ip: "10.0.200.21" - - swarm-worker-2: - current_ip: "10.0.0.222" - desired_ip: "10.0.200.22" - - swarm-worker-3: - current_ip: "10.0.0.223" - desired_ip: "10.0.200.23" - - ai-lenovo: - current_ip: "10.0.0.220" - desired_ip: "10.0.200.20" - - synology: - current_ip: "10.0.0.249" - desired_ip: "10.0.10.40" - - terramaster: - current_ip: "10.0.0.250" - desired_ip: "10.0.10.41" - - waldorf: - current_ip: "10.0.0.251" - desired_ip: "10.0.200.30" - - watchtower: - current_ip: "10.0.0.200" - desired_ip: "10.0.10.200" - - heimdall: - role: beelink - current_ip: null - desired_ip: - mgmt: "10.0.10.2" - lan: "10.0.0.50" - -# === MONITORING INFRASTRUCTURE === -# Environment-specific configuration for monitoring stack -monitoring: - stack_user: "chester" - heimdall_redis: "10.0.0.151:6379" - watchtower_ip: "10.0.0.200" - grafana_domain: "grafana.castaldifamily.com" - uptime_domain: "status.castaldifamily.com" - dozzle_domain: "logs.castaldifamily.com" - authentik_host: "https://sso.castaldifamily.com" - # grafana_admin_password: DEFINE IN VAULT - -# === EDGE ROUTING TOPOLOGY === -# Canonical ingress model: Traefik runs on a dedicated edge host outside Swarm. -# Swarm and standalone hosts publish routes through traefik-kop agents. -edge_routing: - ingress_mode: "external-traefik" - edge_host: - name: "heimdall" - ip: "10.0.0.151" - ssh_port: 22 - http_port: 80 - https_port: 443 - integration: - # Watchtower-hosted traefik-kop instance (publishes Watchtower container routes) - agent_image: "ghcr.io/jittering/traefik-kop:latest" - redis_addr: "10.0.0.151:6379" - bind_ip: "10.0.0.200" # Watchtower IP β€” correct for routes originating on Watchtower - swarm: - # Swarm-hosted traefik-kop instance (publishes Swarm service routes) - # bind_ip MUST be a Swarm node IP β€” the Swarm routing mesh makes published - # ports available on ALL nodes, so Traefik routes inbound requests here. - bind_ip: "10.0.0.211" # swarm-manager-1; any Swarm node IP is valid via routing mesh - proxy_network: "proxy-net" # Swarm overlay network; separate from heimdall's bridge of same name - stack_deploy_target: "swarm-manager-1" - migration_rules: - deploy_traefik_in_swarm: false - use_external_proxy_network: true - notes: - - "Services should attach to swarm overlay proxy-net for east-west traffic." - - "Ingress is terminated by external Traefik at 10.0.0.151 via traefik-kop updates." - -# === SERVICE SECRETS (set via: ansible-vault encrypt_string) === -vault_gitea_db_password: !vault | - $ANSIBLE_VAULT;1.1;AES256 - 62323135663563386162633134616430633034366465376439663133346634616639376431356165 - 6361376530363938656235623330396530643631616266330a323962373736383339353064633634 - 36636664383530386539366137666632393134366435356634383061643566366335376164656531 - 6464333566326261610a306366346638366439333535393161643066643234653165636636623832 - 3135 - -vlan_defaults: - dns_domain: "home.lab" - ntp_servers: - - "10.0.10.2" - -# Plex bootstrap claim token β€” used only on first server claim. -vault_plex_claim: !vault | - $ANSIBLE_VAULT;1.1;AES256 - 65626432323737386462666132336161303635633438326432666631383339663835356238343838 - 3533306232623437376263353161633530646533343739300a323730643330386633626661353234 - 31643631346666666431666534613539333835623562306335376534626463633936643838323666 - 6432626262323231660a323965393163366230363838623165643532356438393863346361656162 - 63323966386333323236353861623333623339626538396565643965323562383636 - -# Usage notes: -# - Treat this file as the single source of truth for IPs and VLANs. -# - Ansible playbooks should read `networks` and `lab_hosts` to render configs, -# update `inventory/hosts.ini`, and generate DHCP reservation templates. \ No newline at end of file diff --git a/ansible/archive/outputs/node-replacement/node-replacement-apply-20260313-20260313T131217/rollback/hosts.ini.pre-cutover b/ansible/archive/outputs/node-replacement/node-replacement-apply-20260313-20260313T131217/rollback/hosts.ini.pre-cutover deleted file mode 100644 index c17e2ed..0000000 --- a/ansible/archive/outputs/node-replacement/node-replacement-apply-20260313-20260313T131217/rollback/hosts.ini.pre-cutover +++ /dev/null @@ -1,63 +0,0 @@ -# Generated inventory from ../group_vars/all.yml - -# --- Watchtower (local controller) --- -[watchtower] -localhost ansible_connection=local - -# --- Proxmox Cluster (management) --- -[proxmox_cluster] -pve01 ansible_host=10.0.0.201 ansible_user=root ansible_ssh_private_key_file=/home/chester/.ssh/id_ed25519 ansible_port=22 -pve02 ansible_host=10.0.0.202 ansible_user=root ansible_ssh_private_key_file=/home/chester/.ssh/id_ed25519 ansible_port=22 -pve03 ansible_host=10.0.0.203 ansible_user=root ansible_ssh_private_key_file=/home/chester/.ssh/id_ed25519 ansible_port=22 -pve04 ansible_host=10.0.0.204 ansible_user=root ansible_ssh_private_key_file=/home/chester/.ssh/id_ed25519 ansible_port=22 - -[proxmox_cluster:vars] -ansible_user=root -ansible_become=true -ansible_python_interpreter=/usr/bin/python3 - -# --- Swarm Managers --- -[swarm_managers] -swarm-manager-1 ansible_host=10.0.0.211 -swarm-manager-2 ansible_host=10.0.0.212 -swarm-manager-3 ansible_host=10.0.0.213 - -# --- Swarm Workers --- -[swarm_workers] -swarm-worker-1 ansible_host=10.0.0.221 -swarm-worker-2 ansible_host=10.0.0.222 -swarm-worker-3 ansible_host=10.0.0.223 - -[swarm_hosts:children] -swarm_managers -swarm_workers - -[swarm_hosts:vars] -ansible_user=chester -ansible_ssh_private_key_file=/home/chester/.ssh/id_ed25519 - -# --- AI Grid --- -[ai_grid] -ai-lenovo ansible_host=10.0.0.220 - -# --- Docker Hosts --- -[docker_hosts] -heimdall ansible_host=10.0.0.151 -waldorf ansible_host=10.0.0.251 - -# --- Storage --- -[storage] -synology ansible_host=10.0.0.249 ansible_scp_if_ssh=True -terramaster ansible_host=10.0.0.250 ansible_scp_if_ssh=True - -# --- Aggregate grouping --- -[ubuntu_lab:children] -swarm_managers -swarm_workers -ai_grid -docker_hosts -storage - -[ubuntu_lab:vars] -ansible_user=chester -ansible_ssh_private_key_file=/home/chester/.ssh/id_ed25519 diff --git a/ansible/archive/outputs_vault_authentik_postgres_password.txt b/ansible/archive/outputs_vault_authentik_postgres_password.txt deleted file mode 100644 index 15f4e2d..0000000 --- a/ansible/archive/outputs_vault_authentik_postgres_password.txt +++ /dev/null @@ -1,9 +0,0 @@ -vault_authentik_postgres_password: !vault | - $ANSIBLE_VAULT;1.1;AES256 - 32396365316438323862616536633232356436656366333561383864393932386531323935313463 - 6235313233303938653530313039363530376439343634370a386263326335356330633332633039 - 37373965303236383463396162356534336661396437383365336630363533383462383165366666 - 3532353937336635330a656633356164383639313433326366316334333538613463336239383663 - 37383263353930333039336534373166616633653239393932613937343164383935363139373935 - 63643430303339396262613135373635636363663662663730326130633666303131383532613262 - 663962393933663230333761623239343365 diff --git a/ansible/archive/outputs_vault_authentik_secret_key.txt b/ansible/archive/outputs_vault_authentik_secret_key.txt deleted file mode 100644 index 775d3f1..0000000 --- a/ansible/archive/outputs_vault_authentik_secret_key.txt +++ /dev/null @@ -1,10 +0,0 @@ -vault_authentik_secret_key: !vault | - $ANSIBLE_VAULT;1.1;AES256 - 63656438336336383936333735303639336131613835313833646331376331346635363062313833 - 3561373665646664393137303533333630336663313366640a343538316162336263393862366235 - 65326239613662376434313539653064666636313037343936356338643663313264366430356639 - 3930316136383166380a636666633737663735306238313534626637656439383664356332396231 - 37326366633861386636326565363338613766643134643830313763646139383763393638633431 - 38623335333566356235366238313436353333663736316234333761646665663865393339656262 - 33383430633139353163663666373532646466663131666539613061326666363033363832323033 - 37623034333065336430 diff --git a/ansible/archive/playbooks/ai/deploy_ansible_mcp_watchtower.yml b/ansible/archive/playbooks/ai/deploy_ansible_mcp_watchtower.yml deleted file mode 100644 index a7afdf8..0000000 --- a/ansible/archive/playbooks/ai/deploy_ansible_mcp_watchtower.yml +++ /dev/null @@ -1,174 +0,0 @@ ---- -# Deploy a custom Ansible MCP server on Watchtower. -# -# Usage: -# cd /home/chester/homelab/ansible -# ansible-playbook -i inventory/hosts.ini playbooks/ai/deploy_ansible_mcp_watchtower.yml -# -# Validate only: -# ansible-playbook -i inventory/hosts.ini playbooks/ai/deploy_ansible_mcp_watchtower.yml --check - -- name: Deploy Ansible MCP server on Watchtower - hosts: watchtower - become: true - gather_facts: true - - vars: - mcp_service_name: ansible-mcp - mcp_install_dir: /opt/ansible-mcp - mcp_state_dir: /var/lib/ansible-mcp - mcp_user: chester - mcp_group: chester - mcp_transport: streamable-http - mcp_host: 0.0.0.0 - mcp_port: 8449 - - mcp_repo_root: /home/chester/homelab/ansible - mcp_inventory: inventory/hosts.ini - mcp_allowed_playbook_dirs: playbooks - mcp_allowed_playbooks: "" - mcp_api_token: "{{ lookup('env', 'ANSIBLE_MCP_API_TOKEN') | default('', true) }}" - mcp_max_extra_vars_bytes: 16384 - mcp_blocked_extra_vars_keys: "ansible_password,ansible_become_password,vault_password" - - # Full-write mode is enabled by default here to match requested behavior. - # Keep confirm enforcement enabled in server guardrails. - mcp_allow_write: true - mcp_require_confirm_for_write: true - - mcp_default_timeout: 900 - mcp_max_timeout: 3600 - - mcp_python_packages: - - ansible-core>=2.16,<2.19 - - mcp>=1.0.0 - - tasks: - - name: Assert API token is configured for HTTP transport - ansible.builtin.assert: - that: - - mcp_transport == "stdio" or (mcp_api_token | length) > 0 - fail_msg: >- - HTTP transport requires ANSIBLE_MCP_API_TOKEN to be set in the control - shell environment before running this playbook. - success_msg: "Transport/auth configuration validated." - - - name: Assert service account exists - ansible.builtin.getent: - database: passwd - key: "{{ mcp_user }}" - - - name: Ensure installation and state directories exist - ansible.builtin.file: - path: "{{ item.path }}" - state: directory - owner: "{{ item.owner }}" - group: "{{ item.group }}" - mode: "{{ item.mode }}" - loop: - - { path: "{{ mcp_install_dir }}", owner: "{{ mcp_user }}", group: "{{ mcp_group }}", mode: "0755" } - - { path: "{{ mcp_state_dir }}", owner: "{{ mcp_user }}", group: "{{ mcp_group }}", mode: "0750" } - - - name: Copy MCP server script - ansible.builtin.copy: - src: ../../scripts/ansible_mcp_server.py - dest: "{{ mcp_install_dir }}/ansible_mcp_server.py" - owner: "{{ mcp_user }}" - group: "{{ mcp_group }}" - mode: "0755" - notify: Restart ansible mcp service - - - name: Ensure Python venv exists - ansible.builtin.command: "python3 -m venv {{ mcp_install_dir }}/.venv" - args: - creates: "{{ mcp_install_dir }}/.venv/bin/python" - changed_when: false - - - name: Install MCP server dependencies in venv - ansible.builtin.pip: - name: "{{ mcp_python_packages }}" - virtualenv: "{{ mcp_install_dir }}/.venv" - state: present - notify: Restart ansible mcp service - - - name: Install systemd unit for ansible mcp service - ansible.builtin.copy: - dest: "/etc/systemd/system/{{ mcp_service_name }}.service" - owner: root - group: root - mode: "0644" - content: | - [Unit] - Description=Ansible MCP Server - Wants=network-online.target - After=network-online.target - - [Service] - Type=simple - User={{ mcp_user }} - Group={{ mcp_group }} - WorkingDirectory={{ mcp_repo_root }} - Environment=ANSIBLE_MCP_REPO_ROOT={{ mcp_repo_root }} - Environment=ANSIBLE_MCP_INVENTORY={{ mcp_inventory }} - Environment=ANSIBLE_MCP_ALLOWED_PLAYBOOK_DIRS={{ mcp_allowed_playbook_dirs }} - Environment=ANSIBLE_MCP_ALLOWED_PLAYBOOKS={{ mcp_allowed_playbooks }} - Environment=ANSIBLE_MCP_API_TOKEN={{ mcp_api_token }} - Environment=ANSIBLE_MCP_ALLOW_WRITE={{ mcp_allow_write | ternary('true', 'false') }} - Environment=ANSIBLE_MCP_REQUIRE_CONFIRM={{ mcp_require_confirm_for_write | ternary('true', 'false') }} - Environment=ANSIBLE_MCP_DEFAULT_TIMEOUT={{ mcp_default_timeout }} - Environment=ANSIBLE_MCP_MAX_TIMEOUT={{ mcp_max_timeout }} - Environment=ANSIBLE_MCP_MAX_EXTRA_VARS_BYTES={{ mcp_max_extra_vars_bytes }} - Environment=ANSIBLE_MCP_BLOCKED_EXTRA_VARS_KEYS={{ mcp_blocked_extra_vars_keys }} - Environment=ANSIBLE_MCP_STATE_DIR={{ mcp_state_dir }} - Environment=ANSIBLE_MCP_TRANSPORT={{ mcp_transport }} - Environment=ANSIBLE_MCP_HOST={{ mcp_host }} - Environment=ANSIBLE_MCP_PORT={{ mcp_port }} - ExecStart={{ mcp_install_dir }}/.venv/bin/python {{ mcp_install_dir }}/ansible_mcp_server.py --transport {{ mcp_transport }} --host {{ mcp_host }} --port {{ mcp_port }} - Restart=on-failure - RestartSec=3 - - [Install] - WantedBy=multi-user.target - notify: - - Reload systemd - - Restart ansible mcp service - - - name: Ensure ansible mcp service is enabled and running - ansible.builtin.systemd: - name: "{{ mcp_service_name }}" - enabled: true - state: started - - - name: Verify MCP health endpoint - ansible.builtin.uri: - url: "http://127.0.0.1:{{ mcp_port }}" - method: GET - return_content: true - status_code: 200 - changed_when: false - register: _mcp_http_probe - failed_when: false - - - name: Show deployment summary - ansible.builtin.debug: - msg: - - "Ansible MCP deployed to watchtower" - - "Service: {{ mcp_service_name }}" - - "Transport: {{ mcp_transport }}" - - "Endpoint: {{ mcp_host }}:{{ mcp_port }}" - - "Repo root: {{ mcp_repo_root }}" - - "Allow write: {{ mcp_allow_write }}" - - "Auth enabled: {{ (mcp_api_token | length) > 0 }}" - - "Require confirm for write: {{ mcp_require_confirm_for_write }}" - - "Explicit playbook allowlist set: {{ (mcp_allowed_playbooks | length) > 0 }}" - - "HTTP probe status: {{ _mcp_http_probe.status | default('n/a') }}" - - handlers: - - name: Reload systemd - ansible.builtin.systemd: - daemon_reload: true - - - name: Restart ansible mcp service - ansible.builtin.systemd: - name: "{{ mcp_service_name }}" - state: restarted diff --git a/ansible/archive/playbooks/ai/test_ollama.yml b/ansible/archive/playbooks/ai/test_ollama.yml deleted file mode 100644 index 47e3427..0000000 --- a/ansible/archive/playbooks/ai/test_ollama.yml +++ /dev/null @@ -1,74 +0,0 @@ ---- -- name: Test Karakeep to Ollama connection - hosts: localhost - gather_facts: false - - vars: - karakeep_host: "10.0.0.251" - ollama_host: "10.0.0.220" - ollama_port: 11434 - container_name: "hoarder-web" - - tasks: - - name: Check Ollama API is reachable - ansible.builtin.uri: - url: "http://{{ ollama_host }}:{{ ollama_port }}/api/tags" - method: GET - return_content: true - status_code: 200 - register: ollama_check - changed_when: false - - - name: Show available models - ansible.builtin.debug: - msg: "Ollama models: {{ ollama_check.json.models | map(attribute='name') | list }}" - - - name: Test connectivity from Karakeep container - community.docker.docker_container_exec: - container: "{{ container_name }}" - command: "/bin/sh -c 'wget -qO- http://{{ ollama_host }}:{{ ollama_port }}/api/tags'" - delegate_to: "{{ karakeep_host }}" - vars: - ansible_user: chester - ansible_ssh_private_key_file: /home/chester/.ssh/id_ed25519 - register: container_test - changed_when: false - - - name: Verify container can reach Ollama - ansible.builtin.assert: - that: - - "'models' in container_test.stdout" - success_msg: "Container can reach Ollama" - fail_msg: "Container cannot reach Ollama" - - - name: Extract Ollama-related environment variables - community.docker.docker_container_info: - name: "{{ container_name }}" - delegate_to: "{{ karakeep_host }}" - vars: - ansible_user: chester - ansible_ssh_private_key_file: /home/chester/.ssh/id_ed25519 - register: container_info - - - name: Show configuration - ansible.builtin.debug: - msg: "{{ container_info.container.Config.Env | select('match', '^(OLLAMA|INFERENCE).*') | list }}" - - - name: Verify configuration is correct - ansible.builtin.assert: - that: - - "'OLLAMA_BASE_URL=http://' + ollama_host + ':' + (ollama_port | string) in container_info.container.Config.Env" - - "'INFERENCE_TEXT_MODEL=llama3.1:8b' in container_info.container.Config.Env" - - "'INFERENCE_IMAGE_MODEL=llama3.2-vision:11b' in container_info.container.Config.Env" - success_msg: "Configuration is correct" - fail_msg: "Configuration needs updating" - - - name: Display validation summary - ansible.builtin.debug: - msg: - - "Validation complete" - - "Ollama: {{ ollama_host }}:{{ ollama_port }}" - - "Karakeep: {{ karakeep_host }}" - - "Container: {{ container_name }}" - - "Connection: Working" - - "Config: Valid" diff --git a/ansible/archive/playbooks/ai/validate_karakeep.yml b/ansible/archive/playbooks/ai/validate_karakeep.yml deleted file mode 100644 index a4ba7c6..0000000 --- a/ansible/archive/playbooks/ai/validate_karakeep.yml +++ /dev/null @@ -1,152 +0,0 @@ ---- -- name: Validate Ollama service and models - hosts: ai_grid - gather_facts: true - tags: [ollama, models] - - vars: - ollama_base_url: "http://{{ ansible_host }}:11434" - required_models: - - name: "llama3.1:8b" - type: "text" - - name: "llama3.2-vision:11b" - type: "vision" - - tasks: - - name: Check Ollama service is responding - ansible.builtin.uri: - url: "{{ ollama_base_url }}/api/tags" - method: GET - return_content: true - status_code: 200 - register: ollama_response - changed_when: false - - - name: Parse available models - ansible.builtin.set_fact: - available_models: "{{ ollama_response.json.models | map(attribute='name') | list }}" - - - name: Display available models - ansible.builtin.debug: - msg: "Available models: {{ available_models }}" - - - name: Verify required models are installed - ansible.builtin.assert: - that: - - item.name in available_models - fail_msg: "Required model {{ item.name }} ({{ item.type }}) is not installed" - success_msg: "Model {{ item.name }} ({{ item.type }}) is available" - loop: "{{ required_models }}" - loop_control: - label: "{{ item.name }}" - - - name: Test text model inference - ansible.builtin.uri: - url: "{{ ollama_base_url }}/api/generate" - method: POST - body_format: json - body: - model: "llama3.1:8b" - prompt: "Hello" - stream: false - return_content: true - status_code: 200 - timeout: 30 - register: text_inference_test - changed_when: false - - - name: Verify text model response - ansible.builtin.assert: - that: - - text_inference_test.json.response is defined - - text_inference_test.json.response | length > 0 - success_msg: "Text model inference successful" - fail_msg: "Text model inference failed" - - - name: Show Ollama validation summary - ansible.builtin.debug: - msg: - - "Ollama validation passed" - - "Host: {{ inventory_hostname }} ({{ ansible_host }})" - - "Models available: {{ available_models | length }}" - - "Text inference: Working" - -- name: Validate legacy Karakeep integration - hosts: localhost - gather_facts: false - tags: [karakeep, integration, legacy] - vars: - test_legacy_karakeep: "{{ test_legacy_karakeep | default(false) }}" - container_name: "hoarder-web" - ollama_host: "10.0.0.220" - ollama_port: 11434 - legacy_host: "{{ legacy_host | default('10.0.0.251') }}" - - tasks: - - name: Skip legacy validation when disabled - ansible.builtin.meta: end_play - when: not (test_legacy_karakeep | bool) - - - name: Check whether Karakeep container is running - community.docker.docker_container_info: - name: "{{ container_name }}" - delegate_to: "{{ legacy_host }}" - vars: - ansible_user: chester - ansible_ssh_private_key_file: /home/chester/.ssh/id_ed25519 - register: karakeep_container - - - name: Verify Karakeep container status - ansible.builtin.assert: - that: - - karakeep_container.exists - - karakeep_container.container.State.Running - - karakeep_container.container.State.Health.Status == "healthy" - fail_msg: "Karakeep container is not running or unhealthy" - success_msg: "Karakeep container is running and healthy" - - - name: Extract Ollama environment values - ansible.builtin.set_fact: - ollama_config: "{{ karakeep_container.container.Config.Env | select('match', '^(OLLAMA|INFERENCE).*') | list }}" - - - name: Verify Karakeep Ollama environment variables - ansible.builtin.assert: - that: - - "'OLLAMA_BASE_URL=http://' + ollama_host + ':' + (ollama_port | string) in ollama_config" - - "'INFERENCE_TEXT_MODEL=llama3.1:8b' in ollama_config" - - "'INFERENCE_IMAGE_MODEL=llama3.2-vision:11b' in ollama_config" - fail_msg: "Ollama environment variables are incorrect" - success_msg: "Ollama environment variables are correctly configured" - - - name: Test Ollama connectivity from Karakeep container - community.docker.docker_container_exec: - container: "{{ container_name }}" - command: "/bin/sh -c 'wget -qO- http://{{ ollama_host }}:{{ ollama_port }}/api/tags'" - delegate_to: "{{ legacy_host }}" - vars: - ansible_user: chester - ansible_ssh_private_key_file: /home/chester/.ssh/id_ed25519 - register: container_connectivity - changed_when: false - failed_when: container_connectivity.rc != 0 - - - name: Verify container can reach Ollama API - ansible.builtin.assert: - that: - - "'models' in container_connectivity.stdout" - success_msg: "Karakeep container can reach Ollama API" - fail_msg: "Karakeep container cannot reach Ollama API" - -- name: Display integration test summary - hosts: localhost - gather_facts: false - tags: [summary] - - tasks: - - name: Show final validation report - ansible.builtin.debug: - msg: - - "Service validation complete" - - "Ollama endpoint: http://10.0.0.220:11434" - - "Models: llama3.1:8b, llama3.2-vision:11b" - - "Legacy Karakeep tested: {{ test_legacy_karakeep | default(false) }}" diff --git a/ansible/archive/playbooks/docker/bootstrap_swarm.yml b/ansible/archive/playbooks/docker/bootstrap_swarm.yml deleted file mode 100644 index c0a2722..0000000 --- a/ansible/archive/playbooks/docker/bootstrap_swarm.yml +++ /dev/null @@ -1,49 +0,0 @@ ---- -# Bootstrap Docker and Swarm cluster state for all swarm nodes. - -# -------------------------------------------------- -# PRE-PLAY: Ensure NFS storage mounts are present before Swarm starts. -# WHY first: Docker bind-mount paths (/mnt/homelab, /mnt/media) must exist -# as live NFS mounts before any stack deploy runs. If absent, Docker -# creates an empty local directory instead β€” silent wrong-state behavior. -# WHY storage_mounts role: idempotent via ansible.posix.mount; safe to re-run -# on already-mounted hosts (no-op when mount table already matches fstab). -# -------------------------------------------------- -- name: Ensure NFS storage mounts are present on all Swarm nodes - hosts: swarm_hosts - become: true - gather_facts: true - roles: - - storage_mounts - -# -------------------------------------------------- -# PRE-PLAY: Ensure the operational user is in the docker group on every node. -# WHY separate play: the swarm_bootstrap role runs from `hosts: localhost` via -# delegate_to, so `--limit swarm-node` silently skips that play. Running this -# directly on swarm_hosts makes it independently targetable and idempotent. -# WHY before the bootstrap play: docker daemon must accept socket connections -# from ansible_user before any subsequent docker-cli tasks succeed. -# -------------------------------------------------- -- name: Ensure docker group membership for the operational user on all swarm nodes - hosts: swarm_hosts - become: true - gather_facts: false - tags: [docker-users, docker-install] - tasks: - - name: Add ansible user to the docker group - ansible.builtin.user: - name: "{{ ansible_user }}" - groups: docker - append: true - -- name: Bootstrap Docker Swarm cluster - hosts: localhost - gather_facts: false - vars_files: - - ../../group_vars/all.yml - - tasks: - - name: Run swarm bootstrap role from the primary manager context - ansible.builtin.include_role: - name: swarm_bootstrap - tags: [swarm-join] diff --git a/ansible/archive/playbooks/docker/deploy_authentik.yml b/ansible/archive/playbooks/docker/deploy_authentik.yml deleted file mode 100644 index ae9d12b..0000000 --- a/ansible/archive/playbooks/docker/deploy_authentik.yml +++ /dev/null @@ -1,186 +0,0 @@ ---- -# playbooks/docker/deploy_authentik.yml -# -# Purpose: -# Deploy Authentik as a Swarm stack pinned to swarm-manager-1 with persistent -# bind mounts under /mnt/homelab/apps/authentik. -# -# Data protection: -# This playbook validates all required Authentik data paths before deploy. -# If paths are missing, deployment fails early to avoid creating empty data -# roots that could mask or diverge from an existing Authentik installation. -# -# Usage: -# ansible-playbook -i inventory/hosts.ini playbooks/docker/deploy_authentik.yml -# -# ansible-playbook -i inventory/hosts.ini playbooks/docker/deploy_authentik.yml \ -# -e "stack_validate_only=true" -# -# ansible-playbook -i inventory/hosts.ini playbooks/docker/deploy_authentik.yml \ -# -e "authentik_deploy_state=absent" - -- name: Deploy Authentik Swarm stack - hosts: swarm_managers - become: true - gather_facts: false - vars_files: - - ../../group_vars/all.yml - vars: - authentik_deploy_target: "{{ edge_routing.swarm.stack_deploy_target | default(groups['swarm_managers'][0]) }}" - - tasks: - # -------------------------------------------------- - # STEP 0: Assert required secrets are present - # -------------------------------------------------- - - - name: Assert vault_authentik_secret_key is defined and non-empty - ansible.builtin.assert: - that: - - vault_authentik_secret_key is defined - - vault_authentik_secret_key | trim | length > 0 - fail_msg: >- - vault_authentik_secret_key is not defined or is empty. - Encrypt and store it in group_vars/vault/all.yml with: - ansible-vault encrypt_string 'your-random-secret' --name 'vault_authentik_secret_key' - when: inventory_hostname == authentik_deploy_target - - - name: Assert vault_authentik_postgres_password is defined and non-empty - ansible.builtin.assert: - that: - - vault_authentik_postgres_password is defined - - vault_authentik_postgres_password | trim | length > 0 - fail_msg: >- - vault_authentik_postgres_password is not defined or is empty. - Encrypt and store it in group_vars/vault/all.yml with: - ansible-vault encrypt_string 'your-db-password' --name 'vault_authentik_postgres_password' - when: inventory_hostname == authentik_deploy_target - - - name: Assert Authentik secrets are not placeholders - ansible.builtin.assert: - that: - - vault_authentik_secret_key not in ['change-me', 'changeme', 'your-random-secret'] - - vault_authentik_postgres_password not in ['change-me', 'changeme', 'your-db-password'] - fail_msg: "Authentik secrets still appear to be placeholders. Set real vault values before deploy." - when: inventory_hostname == authentik_deploy_target - - # -------------------------------------------------- - # STEP 1: Assert Swarm manager is active - # -------------------------------------------------- - - - name: Collect Swarm manager state - ansible.builtin.command: > - docker info --format '{{ "{{" }}.Swarm.LocalNodeState{{ "}}" }}|{{ "{{" }}.Swarm.ControlAvailable{{ "}}" }}' - register: _swarm_info - changed_when: false - when: inventory_hostname == authentik_deploy_target - - - name: Assert target is an active Swarm manager - ansible.builtin.assert: - that: - - _swarm_info.stdout is search('active') - - _swarm_info.stdout is search('true') - fail_msg: >- - {{ inventory_hostname }} must be an active Swarm manager. - Current state: {{ _swarm_info.stdout | default('unknown') }} - when: inventory_hostname == authentik_deploy_target - - # -------------------------------------------------- - # STEP 2: Validate pre-existing persistent data paths - # -------------------------------------------------- - - - name: Stat required Authentik bind-mount paths - ansible.builtin.stat: - path: "{{ item }}" - register: _authentik_path_stat - loop: - - /mnt/homelab/apps/authentik - - /mnt/homelab/apps/authentik/data - - /mnt/homelab/apps/authentik/data/database - - /mnt/homelab/apps/authentik/data/redis - - /mnt/homelab/apps/authentik/data/media - - /mnt/homelab/apps/authentik/data/config - - /mnt/homelab/apps/authentik/data/blueprints - when: inventory_hostname == authentik_deploy_target - - - name: Assert required Authentik paths exist before deploy - ansible.builtin.assert: - that: - - item.stat.exists - - item.stat.isdir - fail_msg: >- - Required Authentik path '{{ item.item }}' is missing on {{ inventory_hostname }}. - Create/restore this directory first to avoid accidental fresh bootstrap over existing data. - loop: "{{ _authentik_path_stat.results }}" - when: inventory_hostname == authentik_deploy_target - - # -------------------------------------------------- - # STEP 3: Deploy Authentik stack - # -------------------------------------------------- - - - name: Deploy Authentik stack - ansible.builtin.include_role: - name: swarm_stack_deploy - vars: - stack_name: "authentik" - stack_compose_src: "{{ playbook_dir }}/../../templates/stacks/authentik.stack.yml" - # authentik_placement_node resolved from group_vars (swarm-manager-2) - # Use service-specific state var to avoid self-reference recursion. - stack_state: "{{ authentik_deploy_state | default('present') }}" - stack_required_external_networks: - - proxy-net - stack_required_directories: - - /mnt/homelab/apps/authentik - - /mnt/homelab/apps/authentik/data - - /mnt/homelab/apps/authentik/data/database - - /mnt/homelab/apps/authentik/data/redis - - /mnt/homelab/apps/authentik/data/media - - /mnt/homelab/apps/authentik/data/config - - /mnt/homelab/apps/authentik/data/blueprints - when: inventory_hostname == authentik_deploy_target - - # -------------------------------------------------- - # STEP 4: Wait for service convergence - # -------------------------------------------------- - - - name: Wait for Authentik server service to converge - ansible.builtin.command: > - docker service ls --filter name=authentik_authentik-server --format '{{ "{{" }}.Replicas{{ "}}" }}' - register: _authentik_server_replicas - retries: 18 - delay: 10 - until: _authentik_server_replicas.stdout is search('1/1') - changed_when: false - when: - - inventory_hostname == authentik_deploy_target - - authentik_deploy_state | default('present') == 'present' - - not ansible_check_mode - tags: [verify] - - - name: Wait for Authentik worker service to converge - ansible.builtin.command: > - docker service ls --filter name=authentik_authentik-worker --format '{{ "{{" }}.Replicas{{ "}}" }}' - register: _authentik_worker_replicas - retries: 18 - delay: 10 - until: _authentik_worker_replicas.stdout is search('1/1') - changed_when: false - when: - - inventory_hostname == authentik_deploy_target - - authentik_deploy_state | default('present') == 'present' - - not ansible_check_mode - tags: [verify] - - - name: Report Authentik deployment result - ansible.builtin.debug: - msg: - - "================================================" - - "Authentik deployment complete." - - "================================================" - - "Stack : authentik" - - "Manager : {{ inventory_hostname }} ({{ ansible_host | default('') }})" - - "URL : https://sso.castaldifamily.com" - - "Data root : /mnt/homelab/apps/authentik" - - "Services : authentik-postgres, authentik-redis, authentik-server, authentik-worker" - - "================================================" - when: inventory_hostname == authentik_deploy_target - tags: [always] \ No newline at end of file diff --git a/ansible/archive/playbooks/docker/deploy_authentik_standalone.yml b/ansible/archive/playbooks/docker/deploy_authentik_standalone.yml deleted file mode 100644 index e273811..0000000 --- a/ansible/archive/playbooks/docker/deploy_authentik_standalone.yml +++ /dev/null @@ -1,173 +0,0 @@ ---- -# playbooks/docker/deploy_authentik_standalone.yml -# Deploy Authentik on a standalone Docker host (statler by default). - -- name: Deploy Authentik on standalone Docker host - hosts: "{{ target_host | default('statler') }}" - become: true - gather_facts: false - vars_files: - - ../../group_vars/all.yml - - vars: - authentik_base_dir: "{{ standalone_authentik_base_dir | default('/mnt/homelab/apps/authentik') }}" - authentik_db_dir: "{{ authentik_base_dir }}/data/database" - authentik_redis_dir: "{{ authentik_base_dir }}/data/redis" - authentik_media_dir: "{{ authentik_base_dir }}/data/media" - authentik_config_dir: "{{ authentik_base_dir }}/data/config" - authentik_blueprints_dir: "{{ authentik_base_dir }}/data/blueprints" - authentik_network: "proxy-net" - authentik_host_domain: "{{ standalone_authentik_domain | default('sso.castaldifamily.com') }}" - authentik_bind_ip: "{{ ansible_host }}" - authentik_redis_addr: "{{ edge_routing.integration.redis_addr }}" - - tasks: - - name: Assert target_host is explicit and safe - ansible.builtin.assert: - that: - - target_host is defined - - target_host | length > 0 - - target_host not in ['all', '*', 'ubuntu_lab', 'docker_hosts', 'swarm_hosts'] - fail_msg: >- - Invalid target_host scope. Use an explicit host, for example: - -e "target_host=statler" - run_once: true - delegate_to: localhost - - - name: Assert Authentik secrets are available and decrypted - ansible.builtin.assert: - that: - - vault_authentik_secret_key is defined - - vault_authentik_secret_key | trim | length > 0 - - vault_authentik_postgres_password is defined - - vault_authentik_postgres_password | trim | length > 0 - - vault_authentik_secret_key is not search('^\\$ANSIBLE_VAULT;') - - vault_authentik_postgres_password is not search('^\\$ANSIBLE_VAULT;') - fail_msg: >- - Authentik secrets are unavailable or not decrypted. - Ensure vault credentials are available before deployment. - - - name: Ensure Authentik app directories exist - ansible.builtin.file: - path: "{{ item }}" - state: directory - owner: "1000" - group: "1000" - mode: '0755' - loop: - - "{{ authentik_base_dir }}" - - "{{ authentik_media_dir }}" - - "{{ authentik_config_dir }}" - - "{{ authentik_blueprints_dir }}" - - - name: Ensure Authentik service data directories exist - ansible.builtin.file: - path: "{{ item }}" - state: directory - mode: '0755' - loop: - - "{{ authentik_db_dir }}" - - "{{ authentik_redis_dir }}" - - - name: Ensure Authentik network exists - community.docker.docker_network: - name: "{{ authentik_network }}" - state: present - - - name: Deploy Authentik Postgres - community.docker.docker_container: - name: authentik-postgres - image: docker.io/library/postgres:16-alpine - pull: always - restart_policy: unless-stopped - state: started - env: - TZ: America/New_York - POSTGRES_DB: authentik - POSTGRES_USER: authentik - POSTGRES_PASSWORD: "{{ vault_authentik_postgres_password }}" - volumes: - - "{{ authentik_db_dir }}:/var/lib/postgresql/data" - networks: - - name: "{{ authentik_network }}" - - - name: Deploy Authentik Redis - community.docker.docker_container: - name: authentik-redis - image: redis:7-alpine - pull: always - command: - - --save - - "60" - - "1" - - --loglevel - - warning - restart_policy: unless-stopped - state: started - volumes: - - "{{ authentik_redis_dir }}:/data" - networks: - - name: "{{ authentik_network }}" - - - name: Deploy Authentik server with Traefik labels - community.docker.docker_container: - name: authentik-server - image: ghcr.io/goauthentik/server:2025.10.1 - pull: always - command: ["server"] - restart_policy: unless-stopped - state: started - published_ports: - - "9000:9000" - env: - TZ: America/New_York - AUTHENTIK_POSTGRESQL__HOST: authentik-postgres - AUTHENTIK_POSTGRESQL__NAME: authentik - AUTHENTIK_POSTGRESQL__USER: authentik - AUTHENTIK_POSTGRESQL__PASSWORD: "{{ vault_authentik_postgres_password }}" - AUTHENTIK_SECRET_KEY: "{{ vault_authentik_secret_key }}" - AUTHENTIK_REDIS__HOST: authentik-redis - volumes: - - "{{ authentik_media_dir }}:/media" - - "{{ authentik_config_dir }}:/config" - - "{{ authentik_blueprints_dir }}:/blueprints/custom:ro" - networks: - - name: "{{ authentik_network }}" - labels: - traefik.enable: "true" - traefik.http.routers.authentik.rule: "Host(`{{ authentik_host_domain }}`)" - traefik.http.routers.authentik.entrypoints: websecure - traefik.http.routers.authentik.tls: "true" - traefik.http.routers.authentik.tls.certresolver: cloudflare - traefik.http.services.authentik.loadbalancer.server.port: "9000" - - - name: Deploy Authentik worker - community.docker.docker_container: - name: authentik-worker - image: ghcr.io/goauthentik/server:2025.10.1 - pull: always - command: ["worker"] - restart_policy: unless-stopped - state: started - env: - TZ: America/New_York - AUTHENTIK_POSTGRESQL__HOST: authentik-postgres - AUTHENTIK_POSTGRESQL__NAME: authentik - AUTHENTIK_POSTGRESQL__USER: authentik - AUTHENTIK_POSTGRESQL__PASSWORD: "{{ vault_authentik_postgres_password }}" - AUTHENTIK_SECRET_KEY: "{{ vault_authentik_secret_key }}" - AUTHENTIK_REDIS__HOST: authentik-redis - volumes: - - "{{ authentik_media_dir }}:/media" - - "{{ authentik_config_dir }}:/config" - networks: - - name: "{{ authentik_network }}" - - - name: Show deployment summary - ansible.builtin.debug: - msg: - - "Standalone Authentik deployed to {{ inventory_hostname }}" - - "Base dir: {{ authentik_base_dir }}" - - "Domain: {{ authentik_host_domain }}" - - "Traefik-kop Redis: {{ authentik_redis_addr }}" - - "Bind IP: {{ authentik_bind_ip }}" diff --git a/ansible/archive/playbooks/docker/deploy_example_stack.yml b/ansible/archive/playbooks/docker/deploy_example_stack.yml deleted file mode 100644 index 1659be0..0000000 --- a/ansible/archive/playbooks/docker/deploy_example_stack.yml +++ /dev/null @@ -1,178 +0,0 @@ ---- -# ============================================================================= -# FUTURE-STACK DEPLOYMENT BLUEPRINT β€” copy, rename, and fill in TODO items. -# This playbook is the minimum viable deploy playbook for any new Swarm stack. -# -# COPY CHECKLIST: -# 1. Rename this file to deploy_.yml -# 2. Search for TODO and fill in every occurrence -# 3. Run validate-only first: -# ansible-playbook -i inventory/hosts.ini playbooks/docker/deploy_.yml \ -# -e "stack_validate_only=true" -# 4. Run full deploy and verify convergence -# 5. Run deploy a second time and confirm "changed=0" (idempotency proof) -# ============================================================================= -# -# IDEMPOTENCY CONTRACT (required for all new stacks): -# - All required secrets MUST be asserted before any Swarm state is touched. -# - All required bind-mount paths MUST be statted and asserted before deploy. -# - All command/shell tasks MUST declare changed_when. -# - validate-only mode MUST work without any Swarm mutations. -# - Deploy MUST be replay-safe: running twice produces no unintended changes. -# -# Usage: -# Normal deploy: -# ansible-playbook -i inventory/hosts.ini playbooks/docker/deploy_.yml -# -# Validate only (no Swarm changes): -# ansible-playbook -i inventory/hosts.ini playbooks/docker/deploy_.yml \ -# -e "stack_validate_only=true" -# -# Tear down: -# ansible-playbook -i inventory/hosts.ini playbooks/docker/deploy_.yml \ -# -e "_deploy_state=absent" - -# TODO: set the play name and stack name. -- name: Deploy Swarm stack - hosts: swarm_managers - become: true - gather_facts: false - vars_files: - - ../../group_vars/all.yml - vars: - # TODO: set the deploy target. Default: first Swarm manager. - _deploy_target: "{{ groups['swarm_managers'][0] }}" - - tasks: - - # -------------------------------------------------- - # STEP 0: Assert required secrets are present - # WHY: Fail before any Swarm state is touched. An empty/placeholder secret - # causes a silent misconfiguration that is hard to diagnose at runtime. - # -------------------------------------------------- - - # TODO: add one assert block per required vault variable. - # Remove this block entirely if the stack has no secrets. - - name: Assert vault__secret is defined and non-empty - ansible.builtin.assert: - that: - - vault_example_secret is defined - - vault_example_secret | trim | length > 0 - - vault_example_secret not in ['change-me', 'changeme', 'TODO'] - fail_msg: >- - vault_example_secret is not defined, empty, or still a placeholder. - Encrypt a real value with: - ansible-vault encrypt_string 'value' --name 'vault_example_secret' - then add it to group_vars/vault/all.yml. - when: inventory_hostname == _deploy_target - - # -------------------------------------------------- - # STEP 1: Assert Swarm manager is active - # WHY: Exact equality check prevents 'inactive' passing as a substring of - # 'active' via regex. Docker format yields 'active|true' for a healthy - # manager and nothing else valid. - # -------------------------------------------------- - - - name: Collect Swarm manager state - ansible.builtin.command: > - docker info --format '{{ "{{" }}.Swarm.LocalNodeState{{ "}}" }}|{{ "{{" }}.Swarm.ControlAvailable{{ "}}" }}' - register: _swarm_info - changed_when: false - when: inventory_hostname == _deploy_target - - - name: Assert target is an active Swarm manager - ansible.builtin.assert: - that: - - _swarm_info.stdout == 'active|true' - fail_msg: >- - {{ inventory_hostname }} must be an active Swarm manager. - Expected 'active|true', got '{{ _swarm_info.stdout | default('unknown') }}'. - when: inventory_hostname == _deploy_target - - # -------------------------------------------------- - # STEP 2: Validate required bind-mount paths - # WHY: A missing path causes the service to start against an empty/wrong - # directory. Pre-existence assertion protects against accidental fresh - # bootstrap over existing data. - # TODO: add/remove paths to match the stacks volume mounts. - # IMPORTANT: do NOT create missing paths here; require the operator to - # provision or restore them first (data safety). - # -------------------------------------------------- - - - name: Stat required bind-mount paths - ansible.builtin.stat: - path: "{{ item }}" - register: _path_stat - loop: - - /mnt/homelab/apps/example/data # TODO: adjust per service - when: inventory_hostname == _deploy_target - - - name: Assert required paths exist before deploy - ansible.builtin.assert: - that: - - item.stat.exists - - item.stat.isdir - fail_msg: >- - Required path '{{ item.item }}' is missing on {{ inventory_hostname }}. - Create or restore this directory before deploying. - loop: "{{ _path_stat.results }}" - when: inventory_hostname == _deploy_target - - # -------------------------------------------------- - # STEP 3: Deploy stack via shared role - # WHY swarm_stack_deploy: handles template render, YAML syntax validation, - # external-network pre-check, bind-mount directory creation, and - # idempotent docker stack deploy with correct changed semantics. - # -------------------------------------------------- - - - name: Deploy stack - ansible.builtin.include_role: - name: swarm_stack_deploy - vars: - stack_name: "example" # TODO: change to service name - stack_compose_src: "{{ playbook_dir }}/../../templates/stacks/example.service.stack.yml" # TODO: change path - # WHY _deploy_state (not stack_state): using stack_state here - # creates a Jinja2 self-reference loop inside the role. Use a - # service-specific var that defaults cleanly. - stack_state: "{{ example_deploy_state | default('present') }}" # TODO: rename var - stack_required_external_networks: - - proxy-net - # OPTIONAL: directories the role should CREATE if absent (non-data dirs). - # Do NOT list data directories here β€” assert their existence in STEP 2. - stack_required_directories: [] - when: inventory_hostname == _deploy_target - - # -------------------------------------------------- - # STEP 4: Wait for service convergence - # WHY: Confirms the scheduler placed and started the task successfully. - # changed_when: false β€” querying replica count is read-only. - # TODO: adjust filter name and replica count to match stack_name. - # -------------------------------------------------- - - - name: Wait for to converge - ansible.builtin.command: > - docker service ls --filter name=example_example-app --format '{{ "{{" }}.Replicas{{ "}}" }}' - register: _replicas - retries: 12 - delay: 10 - until: _replicas.stdout is search('1/1') - changed_when: false - when: - - inventory_hostname == _deploy_target - - example_deploy_state | default('present') == 'present' - - not ansible_check_mode - tags: [verify] - - - name: Report deployment result - ansible.builtin.debug: - msg: - - "================================================" - - " deployment complete." # TODO: rename - - "================================================" - - "Stack : example" # TODO: rename - - "Manager : {{ inventory_hostname }} ({{ ansible_host | default('') }})" - - "URL : https://example.castaldifamily.com" # TODO: change - - "Data : /mnt/homelab/apps/example" # TODO: change - - "================================================" - when: inventory_hostname == _deploy_target - tags: [always] diff --git a/ansible/archive/playbooks/docker/deploy_gitea.yml b/ansible/archive/playbooks/docker/deploy_gitea.yml deleted file mode 100644 index 89cef61..0000000 --- a/ansible/archive/playbooks/docker/deploy_gitea.yml +++ /dev/null @@ -1,158 +0,0 @@ ---- -# playbooks/docker/deploy_gitea.yml -# -# Purpose: -# Deploy Gitea as a Swarm stack pinned to swarm-manager-1, with a dedicated -# Postgres sidecar and persistent bind mounts under /mnt/homelab/apps/gitea. -# -# Data protection: -# Preflight checks require all data paths to exist before deploy. -# If paths are missing, deployment fails early to avoid creating an empty -# data root over an existing Gitea installation. -# -# Usage: -# ansible-playbook -i inventory/hosts.ini playbooks/docker/deploy_gitea.yml -# -# Validate only (no Swarm mutations): -# ansible-playbook -i inventory/hosts.ini playbooks/docker/deploy_gitea.yml \ -# -e "stack_validate_only=true" -# -# Tear down: -# ansible-playbook -i inventory/hosts.ini playbooks/docker/deploy_gitea.yml \ -# -e "gitea_deploy_state=absent" - -- name: Deploy Gitea Swarm stack - hosts: swarm_managers - become: true - gather_facts: false - vars_files: - - ../../group_vars/all.yml - vars: - gitea_deploy_target: "{{ edge_routing.swarm.stack_deploy_target | default(groups['swarm_managers'][0]) }}" - - tasks: - - # -------------------------------------------------- - # STEP 0: Assert required secrets are present - # -------------------------------------------------- - - - name: Assert vault_gitea_db_password is defined and non-empty - ansible.builtin.assert: - that: - - vault_gitea_db_password is defined - - vault_gitea_db_password | trim | length > 0 - fail_msg: >- - vault_gitea_db_password is not defined or is empty. - Encrypt and store it in group_vars/vault/all.yml with: - ansible-vault encrypt_string 'your-db-password' --name 'vault_gitea_db_password' - when: inventory_hostname == gitea_deploy_target - - - name: Assert vault_gitea_db_password is not a placeholder - ansible.builtin.assert: - that: - - vault_gitea_db_password not in ['change-me', 'changeme', 'your-db-password'] - fail_msg: "vault_gitea_db_password still appears to be a placeholder. Set a real vault value before deploy." - when: inventory_hostname == gitea_deploy_target - - # -------------------------------------------------- - # STEP 1: Assert Swarm manager is active - # WHY exact equality: search('active') matches 'inactive' as a substring. - # The format string yields 'active|true' only for a healthy manager. - # -------------------------------------------------- - - - name: Collect Swarm manager state - ansible.builtin.command: > - docker info --format '{{ "{{" }}.Swarm.LocalNodeState{{ "}}" }}|{{ "{{" }}.Swarm.ControlAvailable{{ "}}" }}' - register: _swarm_info - changed_when: false - when: inventory_hostname == gitea_deploy_target - - - name: Assert target is an active Swarm manager - ansible.builtin.assert: - that: - - _swarm_info.stdout == 'active|true' - fail_msg: >- - {{ inventory_hostname }} must be an active Swarm manager. - Expected 'active|true', got '{{ _swarm_info.stdout | default('unknown') }}'. - when: inventory_hostname == gitea_deploy_target - - # -------------------------------------------------- - # STEP 2: Validate pre-existing persistent data paths - # WHY: Missing paths cause Gitea to bootstrap a fresh install over existing - # data. The operator must create or restore paths before deploying. - # -------------------------------------------------- - - - name: Stat required Gitea bind-mount paths - ansible.builtin.stat: - path: "{{ item }}" - register: _gitea_path_stat - loop: - - /mnt/homelab/apps/gitea - - /mnt/homelab/apps/gitea/data - - /mnt/homelab/apps/gitea/data/db - when: inventory_hostname == gitea_deploy_target - - - name: Assert required Gitea paths exist before deploy - ansible.builtin.assert: - that: - - item.stat.exists - - item.stat.isdir - fail_msg: >- - Required Gitea path '{{ item.item }}' is missing on {{ inventory_hostname }}. - Create or restore this directory first to protect existing data. - loop: "{{ _gitea_path_stat.results }}" - when: inventory_hostname == gitea_deploy_target - - # -------------------------------------------------- - # STEP 3: Deploy Gitea stack - # -------------------------------------------------- - - - name: Deploy Gitea stack - ansible.builtin.include_role: - name: swarm_stack_deploy - vars: - stack_name: "gitea" - stack_compose_src: "{{ playbook_dir }}/../../templates/stacks/gitea.stack.yml" - # WHY gitea_deploy_state (not stack_state): using stack_state directly - # creates a Jinja2 self-reference loop inside the role. - stack_state: "{{ gitea_deploy_state | default('present') }}" - stack_required_external_networks: - - proxy-net - stack_required_directories: - - /mnt/homelab/apps/gitea - - /mnt/homelab/apps/gitea/data - - /mnt/homelab/apps/gitea/data/db - when: inventory_hostname == gitea_deploy_target - - # -------------------------------------------------- - # STEP 4: Wait for service convergence - # -------------------------------------------------- - - - name: Wait for Gitea server service to converge - ansible.builtin.command: > - docker service ls --filter name=gitea_server --format '{{ "{{" }}.Replicas{{ "}}" }}' - register: _gitea_replicas - retries: 18 - delay: 10 - until: _gitea_replicas.stdout is search('1/1') - changed_when: false - when: - - inventory_hostname == gitea_deploy_target - - gitea_deploy_state | default('present') == 'present' - - not ansible_check_mode - tags: [verify] - - - name: Report Gitea deployment result - ansible.builtin.debug: - msg: - - "================================================" - - "Gitea deployment complete." - - "================================================" - - "Stack : gitea" - - "Manager : {{ inventory_hostname }} ({{ ansible_host | default('') }})" - - "URL : https://git.castaldifamily.com" - - "Data root : /mnt/homelab/apps/gitea" - - "Services : gitea_server, gitea_gitea-db" - - "================================================" - when: inventory_hostname == gitea_deploy_target - tags: [always] diff --git a/ansible/archive/playbooks/docker/deploy_plex.yml b/ansible/archive/playbooks/docker/deploy_plex.yml deleted file mode 100644 index 6d6f24d..0000000 --- a/ansible/archive/playbooks/docker/deploy_plex.yml +++ /dev/null @@ -1,235 +0,0 @@ ---- -# playbooks/docker/deploy_plex.yml -# -# Purpose: -# Deploy Plex Media Server as a Swarm stack, pinned to swarm-manager-1 which -# hosts the media volumes and hardware transcoding devices. -# -# Architecture: -# Plex listens on port 32400. Traefik on Heimdall routes inbound HTTPS for -# plex.castaldifamily.com via traefik-kop, which reads deploy.labels from -# the Swarm service and publishes routes into Redis. -# Media is served from bind-mounted host paths; config persists under -# /mnt/homelab/apps/plex. -# -# Pre-requisites: -# - Swarm must be active; swarm-manager-1 (10.0.0.211) must be reachable. -# - proxy-net overlay must exist (deploy_traefik_kop.yml must have run). -# - traefik-kop must be running on Swarm. -# - vault_plex_claim must be present in group_vars/vault/all.yml: -# ansible-vault encrypt_string 'claim-XXXX' --name 'vault_plex_claim' -# - Media paths on swarm-manager-1 must be mounted: -# /mnt/media/tvshows -# /mnt/media/movies -# - community.docker collection installed: -# ansible-galaxy collection install -r requirements.yml -# -# Usage: -# Normal deploy: -# ansible-playbook -i inventory/hosts.ini playbooks/docker/deploy_plex.yml -# -# Validate only (preflight and syntax checks β€” no changes applied to Swarm): -# ansible-playbook -i inventory/hosts.ini playbooks/docker/deploy_plex.yml \ -# -e "stack_validate_only=true" -# -# Tear down: -# ansible-playbook -i inventory/hosts.ini playbooks/docker/deploy_plex.yml \ -# -e "plex_deploy_state=absent" -# -# Verification after deploy: -# docker stack services plex -# docker service ps plex_plex -# docker exec redis redis-cli keys 'traefik/*plex*' -# curl -sf https://plex.castaldifamily.com/web/index.html | head -5 - -- name: Deploy Plex Media Server Swarm stack - hosts: swarm_managers - become: true - gather_facts: false - vars_files: - - ../../group_vars/all.yml - - tasks: - # -------------------------------------------------- - # STEP 0: Assert required secrets are present - # WHY: If vault_plex_claim is missing or still holds the placeholder value, - # the stack template renders with an empty PLEX_CLAIM and Plex starts - # unclaimed β€” a silent failure. Catching it here produces a clear, - # actionable error before any Swarm state is touched. - # -------------------------------------------------- - - - name: Assert vault_plex_claim is defined and non-empty - ansible.builtin.assert: - that: - - vault_plex_claim is defined - - vault_plex_claim | length > 0 - fail_msg: >- - vault_plex_claim is not defined or is empty. - Encrypt your Plex claim token with: - ansible-vault encrypt_string 'claim-XXXX' --name 'vault_plex_claim' - then add the result to group_vars/vault/all.yml. - when: inventory_hostname == groups['swarm_managers'][0] - - - name: Assert vault_plex_claim is not the placeholder literal - ansible.builtin.assert: - that: - - vault_plex_claim != 'claim-XXXX' - fail_msg: >- - vault_plex_claim contains the placeholder value 'claim-XXXX'. - Replace it with a real token from https://www.plex.tv/claim/ - when: inventory_hostname == groups['swarm_managers'][0] - - # -------------------------------------------------- - # STEP 1: Assert Swarm is active and reachable - # WHY: Fail fast before touching the stack; the role also validates this - # but an early assert here produces a cleaner error message. - # -------------------------------------------------- - - - name: Collect Swarm manager state - ansible.builtin.command: > - docker info --format '{{ "{{" }}.Swarm.LocalNodeState{{ "}}" }}|{{ "{{" }}.Swarm.ControlAvailable{{ "}}" }}' - register: _swarm_info - changed_when: false - when: inventory_hostname == groups['swarm_managers'][0] - - - name: Assert target is an active Swarm manager - ansible.builtin.assert: - that: - # WHY exact equality: search('active') matches 'inactive' as a substring. - # The format string yields 'active|true' only for a healthy manager. - - _swarm_info.stdout == 'active|true' - fail_msg: >- - {{ inventory_hostname }} must be an active Swarm manager. - Expected 'active|true', got '{{ _swarm_info.stdout | default('unknown') }}'. - when: inventory_hostname == groups['swarm_managers'][0] - - # -------------------------------------------------- - # STEP 1b: Validate Docker Engine version and hardware device availability - # WHY: Device passthrough requires Docker >= 20.10. Missing devices fall - # back to CPU transcoding silently β€” warn here for operator visibility. - # These checks are NON-BLOCKING: deploy proceeds regardless of result. - # -------------------------------------------------- - - - name: Get Docker Engine version on placement node - ansible.builtin.command: docker info --format '{{ "{{" }}.ServerVersion{{ "}}" }}' - register: _docker_ver - changed_when: false - when: inventory_hostname == groups['swarm_managers'][0] - - - name: Warn if Docker Engine is below 20.10 (device passthrough may fail) - ansible.builtin.debug: - msg: >- - WARNING: Docker Engine {{ _docker_ver.stdout }} may not support Swarm - device passthrough. Required: >= 20.10. Hardware transcoding may be - unavailable; CPU transcoding will be used as fallback. - when: - - inventory_hostname == groups['swarm_managers'][0] - - _docker_ver.stdout is version('20.10', '<') - - - name: Stat GPU device nodes on placement node - ansible.builtin.stat: - path: "{{ item }}" - register: _device_stat - loop: - - /dev/renderD128 - - /dev/dri - when: inventory_hostname == groups['swarm_managers'][0] - - - name: Warn on missing GPU device nodes (CPU fallback will be used) - ansible.builtin.debug: - msg: >- - WARNING: Device {{ item.item }} not present on {{ inventory_hostname }}. - Plex will fall back to CPU transcoding. - loop: "{{ _device_stat.results }}" - when: - - inventory_hostname == groups['swarm_managers'][0] - - not item.stat.exists - - # -------------------------------------------------- - # STEP 2: Verify media bind-mount paths exist on placement node - # WHY: A missing media path causes Plex to start but serve no content. - # Catch this before deploy to prevent a misleading "success" state. - # -------------------------------------------------- - - - name: Stat required media paths on placement node - ansible.builtin.stat: - path: "{{ item }}" - register: _media_path_stat - loop: - - /mnt/media/tvshows - - /mnt/media/movies - when: inventory_hostname == groups['swarm_managers'][0] - - - name: Assert media paths are present - ansible.builtin.assert: - that: - - item.stat.exists - fail_msg: >- - Required media path '{{ item.item }}' does not exist on - {{ inventory_hostname }}. Mount or create the path before deploying Plex. - loop: "{{ _media_path_stat.results }}" - when: inventory_hostname == groups['swarm_managers'][0] - - # -------------------------------------------------- - # STEP 3: Deploy Plex stack - # WHY swarm_stack_deploy role: handles template render, compose config - # validation, external-network pre-check, directory creation, and - # idempotent docker stack deploy with prune and registry auth. - # -------------------------------------------------- - - - name: Deploy Plex stack - ansible.builtin.include_role: - name: swarm_stack_deploy - vars: - stack_name: "plex" - stack_compose_src: "{{ playbook_dir }}/../../templates/stacks/plex.stack.yml" - # WHY plex_deploy_state (not stack_state): using stack_state here would - # create a Jinja2 self-reference loop β€” the role stores stack_state as - # a template string, then any evaluation of stack_state recurses into - # itself. plex_deploy_state is never internally defined, so - # | default('present') always resolves cleanly. - stack_state: "{{ plex_deploy_state | default('present') }}" - stack_required_external_networks: - - proxy-net - stack_required_directories: - - /mnt/homelab/apps/plex/data - when: inventory_hostname == groups['swarm_managers'][0] - - # -------------------------------------------------- - # STEP 4: Wait for service to reach desired replica count - # WHY: Confirms the scheduler placed and started the task successfully, - # rather than leaving the caller to check manually. - # -------------------------------------------------- - - - name: Wait for Plex service to converge - ansible.builtin.command: > - docker service ls --filter name=plex_plex --format '{{ "{{" }}.Replicas{{ "}}" }}' - register: _plex_replicas - retries: 12 - delay: 10 - until: _plex_replicas.stdout is search('1/1') - changed_when: false - when: - - inventory_hostname == groups['swarm_managers'][0] - - plex_deploy_state | default('present') == 'present' - - not ansible_check_mode - tags: [verify] - - - name: Report deployment result - ansible.builtin.debug: - msg: - - "================================================" - - "Plex deployment complete." - - "================================================" - - "Stack : plex" - - "Manager : {{ inventory_hostname }} ({{ ansible_host | default('') }})" - - "Port : 32400" - - "URL : https://plex.castaldifamily.com" - - "Config : /mnt/homelab/apps/plex/data" - - "Media : /mnt/media/tvshows, /mnt/media/movies" - - "------------------------------------------------" - - "Verify route keys in Traefik Redis:" - - " docker exec redis redis-cli keys 'traefik/*plex*'" - - "================================================" - when: inventory_hostname == groups['swarm_managers'][0] - tags: [always] diff --git a/ansible/archive/playbooks/docker/deploy_plex_standalone.yml b/ansible/archive/playbooks/docker/deploy_plex_standalone.yml deleted file mode 100644 index ddd603c..0000000 --- a/ansible/archive/playbooks/docker/deploy_plex_standalone.yml +++ /dev/null @@ -1,448 +0,0 @@ ---- -# playbooks/docker/deploy_plex_standalone.yml -# -# Purpose: -# Deploy the full Plex media stack on a standalone Docker host (statler). -# Includes: Plex, Radarr, Sonarr, SABnzbd, Overseerr, Wizarr, and their -# Authentik proxy outposts. -# -# Architecture: -# All containers share the proxy-net bridge network. Traefik-kop on statler -# reads container labels and publishes routes to Heimdall's Redis, where -# the external Traefik picks them up. -# Plex config is served from the TNAS share at /mnt/homelab/apps/plex/data. -# Media (TV/Movies/Downloads) is served from /mnt/media (TNAS Volume2). -# Service configs (Radarr, Sonarr, etc.) are served from /mnt/homelab/apps. -# -# Pre-requisites: -# - NFS shares mounted on target host (mount_nfs_shares.yml must have run): -# /mnt/homelab (TNAS Volume1/appdata) -# /mnt/media (TNAS Volume2/media) -# - traefik-kop-agent must be running on the target host. -# - vault_plex_claim and vault_authentik_token_* must be present and decrypted. -# -# Usage: -# ansible-playbook -i inventory/hosts.ini playbooks/docker/deploy_plex_standalone.yml \ -# -e "target_host=statler" -# -# Tear down a single service (example): -# ansible-playbook ... -e "target_host=statler plex_deploy_state=absent" -# -# Verification after deploy: -# docker ps on statler -# curl http://10.0.0.210:32400/identity -# redis-cli -h 10.0.0.151 keys 'traefik/*sonarr*' - -- name: Deploy Plex media stack on standalone Docker host - hosts: "{{ target_host | default('statler') }}" - become: true - gather_facts: false - vars_files: - - ../../group_vars/all.yml - - vars: - plex_network: "proxy-net" - plex_config_dir: "/mnt/homelab/apps/plex/data" - plex_tv_dir: "/mnt/media/tvshows" - plex_movies_dir: "/mnt/media/movies" - media_base: "/mnt/media" - sabnzbd_config_dir: "/mnt/homelab/apps/sabnzbd/data" - sonarr_config_dir: "/mnt/homelab/apps/sonarr/data" - radarr_config_dir: "/mnt/homelab/apps/radarr/data" - overseerr_config_dir: "/mnt/homelab/apps/overseerr/data" - wizarr_config_dir: "/mnt/homelab/apps/wizarr/data/database" - - tasks: - # -------------------------------------------------- - # STEP 0: Safety assertions - # -------------------------------------------------- - - - name: Assert target_host is explicit and safe - ansible.builtin.assert: - that: - - target_host is defined - - target_host | length > 0 - - target_host not in ['all', '*', 'ubuntu_lab', 'docker_hosts', 'swarm_hosts'] - fail_msg: >- - Invalid target_host scope. Use an explicit host, e.g.: - -e "target_host=statler" - run_once: true - delegate_to: localhost - - - name: Assert required secrets are available and decrypted - ansible.builtin.assert: - that: - - vault_plex_claim is defined - - vault_plex_claim | trim | length > 0 - - vault_plex_claim is not search('^\$ANSIBLE_VAULT;') - - vault_authentik_token_sonarr is defined - - vault_authentik_token_sonarr | trim | length > 0 - - vault_authentik_token_sonarr is not search('^\$ANSIBLE_VAULT;') - - vault_authentik_token_radarr is defined - - vault_authentik_token_radarr | trim | length > 0 - - vault_authentik_token_radarr is not search('^\$ANSIBLE_VAULT;') - - vault_authentik_token_sabnzbd is defined - - vault_authentik_token_sabnzbd | trim | length > 0 - - vault_authentik_token_sabnzbd is not search('^\$ANSIBLE_VAULT;') - fail_msg: >- - One or more required secrets are unavailable or not decrypted. - Required: vault_plex_claim, vault_authentik_token_sonarr, - vault_authentik_token_radarr, vault_authentik_token_sabnzbd. - - - name: Assert TNAS Plex config directory is mounted and accessible - ansible.builtin.stat: - path: "{{ plex_config_dir }}" - register: _plex_config_stat - - - name: Fail if TNAS Plex config path does not exist - ansible.builtin.assert: - that: - - _plex_config_stat.stat.exists - - _plex_config_stat.stat.isdir - fail_msg: >- - {{ plex_config_dir }} does not exist or is not a directory. - Ensure the TNAS NFS share is mounted: run mount_nfs_shares.yml first. - - - name: Assert media NFS shares are mounted - ansible.builtin.stat: - path: "{{ item }}" - register: _media_stat - loop: - - "{{ plex_tv_dir }}" - - "{{ plex_movies_dir }}" - - - name: Fail if media paths are not mounted - ansible.builtin.assert: - that: - - item.stat.exists - - item.stat.isdir - fail_msg: >- - Media path {{ item.item }} is not accessible on {{ inventory_hostname }}. - Ensure /mnt/media NFS share is mounted: run mount_nfs_shares.yml first. - loop: "{{ _media_stat.results }}" - - # -------------------------------------------------- - # STEP 1: Ensure proxy-net bridge network exists - # -------------------------------------------------- - - - name: Ensure proxy-net bridge network exists - community.docker.docker_network: - name: "{{ plex_network }}" - driver: bridge - state: present - - # -------------------------------------------------- - # STEP 2: Ensure service config directories exist on appdata mount - # WHY these dirs are on /mnt/homelab: shared appdata policy for statler - # services while keeping explicit paths in deployment automation. - # -------------------------------------------------- - - - name: Ensure local service config directories exist - ansible.builtin.file: - path: "{{ item }}" - state: directory - owner: "1000" - group: "1000" - mode: '0755' - loop: - - "{{ sabnzbd_config_dir }}" - - "{{ sonarr_config_dir }}" - - "{{ radarr_config_dir }}" - - "{{ overseerr_config_dir }}" - - "{{ wizarr_config_dir }}" - - # -------------------------------------------------- - # STEP 3: Plex - # -------------------------------------------------- - - - name: Deploy Plex Media Server - community.docker.docker_container: - name: plex - image: lscr.io/linuxserver/plex:latest - pull: always - restart_policy: unless-stopped - state: "{{ plex_deploy_state | default('started') }}" - published_ports: - - "32400:32400" - env: - PUID: "1000" - PGID: "1000" - TZ: America/New_York - PLEX_CLAIM: "{{ vault_plex_claim }}" - VERSION: docker - volumes: - - "{{ plex_config_dir }}:/config" - - "{{ plex_tv_dir }}:/tv" - - "{{ plex_movies_dir }}:/movies" - networks: - - name: "{{ plex_network }}" - memory: 4g - cpus: 2.0 - - # -------------------------------------------------- - # STEP 4: SABnzbd + outpost - # -------------------------------------------------- - - - name: Deploy SABnzbd - community.docker.docker_container: - name: sabnzbd - image: lscr.io/linuxserver/sabnzbd:4.5.5-ls239 - pull: always - restart_policy: unless-stopped - state: "{{ plex_deploy_state | default('started') }}" - published_ports: - - "8155:8080" - env: - PUID: "1000" - PGID: "1000" - TZ: America/New_York - volumes: - - "{{ sabnzbd_config_dir }}:/config" - - "{{ media_base }}/incoming/downloads-sab/complete:/downloads" - - "{{ media_base }}/incoming/downloads-sab/incomplete:/incomplete-downloads" - - "{{ media_base }}/incoming/downloads-sab/history:/history" - networks: - - name: "{{ plex_network }}" - labels: - homepage.name: SABnzbd - homepage.icon: si:sabnzbd - homepage.url: https://sab.castaldifamily.com - homepage.description: Usenet downloader - memory: 1g - cpus: 0.5 - - - name: Deploy Authentik outpost for SABnzbd - community.docker.docker_container: - name: authentik-outpost-sabnzbd - image: ghcr.io/goauthentik/proxy:2025.10.3 - pull: always - restart_policy: unless-stopped - state: "{{ plex_deploy_state | default('started') }}" - published_ports: - - "9004:9000" - - "9447:9443" - env: - AUTHENTIK_HOST: https://sso.castaldifamily.com - AUTHENTIK_INSECURE: "false" - AUTHENTIK_TOKEN: "{{ vault_authentik_token_sabnzbd }}" - AUTHENTIK_HOST_BROWSER: https://sso.castaldifamily.com - networks: - - name: "{{ plex_network }}" - labels: - traefik.enable: "true" - traefik.http.routers.sabnzbd.entrypoints: websecure - traefik.http.routers.sabnzbd.rule: "Host(`sab.castaldifamily.com`)" - traefik.http.routers.sabnzbd.tls: "true" - traefik.http.routers.sabnzbd.tls.certresolver: cloudflare - traefik.http.services.sabnzbd.loadbalancer.server.port: "9004" - memory: 256m - cpus: 0.25 - - # -------------------------------------------------- - # STEP 5: Sonarr + outpost - # -------------------------------------------------- - - - name: Deploy Sonarr - community.docker.docker_container: - name: sonarr - image: lscr.io/linuxserver/sonarr:4.0.16.2944-ls300 - pull: always - restart_policy: unless-stopped - state: "{{ plex_deploy_state | default('started') }}" - published_ports: - - "8989:8989" - env: - PUID: "1000" - PGID: "1000" - TZ: America/New_York - volumes: - - "{{ sonarr_config_dir }}:/config" - - "{{ plex_tv_dir }}:/tv" - - "{{ media_base }}/incoming/downloads-sab/complete/sonarr:/downloads/sonarr" - networks: - - name: "{{ plex_network }}" - labels: - homepage.name: Sonarr - homepage.icon: si:sonarr - homepage.url: https://sonarr.castaldifamily.com - homepage.description: TV Shows - memory: 1g - cpus: 0.5 - - - name: Deploy Authentik outpost for Sonarr - community.docker.docker_container: - name: authentik-outpost-sonarr - image: ghcr.io/goauthentik/proxy:2025.10.3 - pull: always - restart_policy: unless-stopped - state: "{{ plex_deploy_state | default('started') }}" - published_ports: - - "9001:9000" - - "9444:9443" - env: - AUTHENTIK_HOST: https://sso.castaldifamily.com - AUTHENTIK_INSECURE: "false" - AUTHENTIK_TOKEN: "{{ vault_authentik_token_sonarr }}" - AUTHENTIK_HOST_BROWSER: https://sso.castaldifamily.com - networks: - - name: "{{ plex_network }}" - labels: - traefik.enable: "true" - traefik.http.routers.sonarr.entrypoints: websecure - traefik.http.routers.sonarr.rule: "Host(`sonarr.castaldifamily.com`)" - traefik.http.routers.sonarr.tls: "true" - traefik.http.routers.sonarr.tls.certresolver: cloudflare - traefik.http.services.sonarr.loadbalancer.server.port: "9001" - memory: 256m - cpus: 0.25 - - # -------------------------------------------------- - # STEP 6: Radarr + outpost - # -------------------------------------------------- - - - name: Deploy Radarr - community.docker.docker_container: - name: radarr - image: lscr.io/linuxserver/radarr:6.0.4.10291-ls289 - pull: always - restart_policy: unless-stopped - state: "{{ plex_deploy_state | default('started') }}" - published_ports: - - "7878:7878" - env: - PUID: "1000" - PGID: "1000" - TZ: America/New_York - volumes: - - "{{ radarr_config_dir }}:/config" - - "{{ plex_movies_dir }}:/movies" - - "{{ media_base }}/incoming/downloads-sab/complete/radarr:/downloads/radarr" - networks: - - name: "{{ plex_network }}" - labels: - homepage.name: Radarr - homepage.icon: si:radarr - homepage.url: https://radarr.castaldifamily.com - homepage.description: Movies & shows - memory: 1g - cpus: 0.5 - - - name: Deploy Authentik outpost for Radarr - community.docker.docker_container: - name: authentik-outpost-radarr - image: ghcr.io/goauthentik/proxy:2025.10.3 - pull: always - restart_policy: unless-stopped - state: "{{ plex_deploy_state | default('started') }}" - published_ports: - - "9002:9000" - - "9445:9443" - env: - AUTHENTIK_HOST: https://sso.castaldifamily.com - AUTHENTIK_INSECURE: "false" - AUTHENTIK_TOKEN: "{{ vault_authentik_token_radarr }}" - AUTHENTIK_HOST_BROWSER: https://sso.castaldifamily.com - AUTHENTIK_INSECURE_SKIP_VERIFY: "false" - TRUST_PROXY_HEADERS: "true" - networks: - - name: "{{ plex_network }}" - labels: - traefik.enable: "true" - traefik.http.routers.radarr.entrypoints: websecure - traefik.http.routers.radarr.rule: "Host(`radarr.castaldifamily.com`)" - traefik.http.routers.radarr.tls: "true" - traefik.http.routers.radarr.tls.certresolver: cloudflare - traefik.http.services.radarr.loadbalancer.server.port: "9002" - memory: 256m - cpus: 0.25 - - # -------------------------------------------------- - # STEP 7: Overseerr - # -------------------------------------------------- - - - name: Deploy Overseerr - community.docker.docker_container: - name: overseerr - image: lscr.io/linuxserver/overseerr:1.34.0 - pull: always - restart_policy: unless-stopped - state: "{{ plex_deploy_state | default('started') }}" - published_ports: - - "8150:5055" - env: - PUID: "1000" - PGID: "1000" - TZ: America/New_York - volumes: - - "{{ overseerr_config_dir }}:/config" - networks: - - name: "{{ plex_network }}" - labels: - traefik.enable: "true" - traefik.http.routers.overseerr.entrypoints: websecure - traefik.http.routers.overseerr.rule: "Host(`overseerr.castaldifamily.com`)" - traefik.http.routers.overseerr.tls: "true" - traefik.http.routers.overseerr.tls.certresolver: cloudflare - traefik.http.routers.overseerr.service: overseerr - traefik.http.services.overseerr.loadbalancer.server.port: "8150" - homepage.name: Overseerr - homepage.icon: si:overseerr - homepage.url: https://overseerr.castaldifamily.com - homepage.description: Media request management - memory: 512m - cpus: 0.2 - - # -------------------------------------------------- - # STEP 8: Wizarr - # NOTE: homelab_status=broken in source-compose. Deploying as-is; SSO - # integration requires a dedicated Authentik outpost token (not yet - # configured). DISABLE_BUILTIN_AUTH=True means the web UI will be - # unprotected until the outpost is wired up. - # -------------------------------------------------- - - - name: Deploy Wizarr - community.docker.docker_container: - name: wizarr - image: ghcr.io/wizarrrr/wizarr:v2025.12.0 - pull: always - restart_policy: unless-stopped - state: "{{ plex_deploy_state | default('started') }}" - published_ports: - - "8157:5690" - env: - PUID: "1000" - PGID: "1000" - TZ: America/New_York - DISABLE_BUILTIN_AUTH: "True" - volumes: - - "{{ wizarr_config_dir }}:/data/database" - networks: - - name: "{{ plex_network }}" - labels: - traefik.enable: "true" - traefik.http.routers.wizarr.entrypoints: websecure - traefik.http.routers.wizarr.rule: "Host(`wizarr.castaldifamily.com`)" - traefik.http.routers.wizarr.tls: "true" - traefik.http.routers.wizarr.tls.certresolver: cloudflare - traefik.http.routers.wizarr.service: wizarr - traefik.http.services.wizarr.loadbalancer.server.port: "8157" - homepage.name: Wizarr - homepage.icon: si:wizarr - homepage.url: https://wizarr.castaldifamily.com - homepage.description: Media management - memory: 512m - cpus: 0.2 - - # -------------------------------------------------- - # STEP 9: Summary - # -------------------------------------------------- - - - name: Show deployment summary - ansible.builtin.debug: - msg: - - "Plex media stack deployed to {{ inventory_hostname }}" - - "Plex config : {{ plex_config_dir }} (TNAS)" - - "Media : {{ media_base }} (TNAS)" - - "Network : {{ plex_network }}" - - "Services : plex, sabnzbd, sonarr, radarr, overseerr, wizarr" - - "Outposts : sabnzbd (9004), sonarr (9001), radarr (9002)" diff --git a/ansible/archive/playbooks/docker/deploy_swarm_stack.yml b/ansible/archive/playbooks/docker/deploy_swarm_stack.yml deleted file mode 100644 index 750c363..0000000 --- a/ansible/archive/playbooks/docker/deploy_swarm_stack.yml +++ /dev/null @@ -1,20 +0,0 @@ ---- -# Generic playbook to deploy one Swarm stack from a repo-tracked compose file. -# Usage example: -# ansible-playbook -i inventory/hosts.ini playbooks/docker/deploy_swarm_stack.yml \ -# -e "stack_name=gitea" \ -# -e "stack_compose_src=/home/chester/homelab/ansible/templates/stacks/gitea.stack.yml" \ -# -e "stack_required_directories=['/mnt/appdata/gitea']" - -- name: Deploy one stack from source-controlled compose - hosts: swarm_managers - become: false - gather_facts: false - vars_files: - - ../../group_vars/all.yml - - tasks: - - name: Deploy from primary manager only - ansible.builtin.include_role: - name: swarm_stack_deploy - when: inventory_hostname == groups['swarm_managers'][0] diff --git a/ansible/archive/playbooks/docker/deploy_traefik_kop.yml b/ansible/archive/playbooks/docker/deploy_traefik_kop.yml deleted file mode 100644 index a279054..0000000 --- a/ansible/archive/playbooks/docker/deploy_traefik_kop.yml +++ /dev/null @@ -1,160 +0,0 @@ ---- -# playbooks/docker/deploy_traefik_kop.yml -# -# Purpose: -# Deploy the traefik-kop Swarm service, which bridges Swarm service labels -# to Traefik routing via Redis. Once deployed, any Swarm service labelled -# with traefik.enable=true will have its routes published automatically. -# -# Architecture: -# Swarm services β†’ traefik-kop β†’ Redis (10.0.0.151:6379) β†’ Traefik (heimdall) -# traefik-kop reads Docker service state on the Swarm manager and writes -# routing rules to Redis. Traefik's redis provider picks them up in real time. -# -# Pre-requisites: -# - Swarm must be active and swarm-manager-1 (10.0.0.211) must be reachable -# - Redis on Heimdall (10.0.0.151:6379) must be running -# - community.docker collection installed: ansible-galaxy collection install community.docker -# -# Usage: -# ansible-playbook -i inventory/hosts.ini playbooks/docker/deploy_traefik_kop.yml -# -# Dry-run (no changes to Swarm): -# ansible-playbook -i inventory/hosts.ini playbooks/docker/deploy_traefik_kop.yml --check -# -# Tear down: -# ansible-playbook -i inventory/hosts.ini playbooks/docker/deploy_traefik_kop.yml \ -# -e "stack_state=absent" -# -# Labelling Swarm services for auto-discovery: -# After this deploys, Swarm services only need these labels (under deploy.labels): -# -# deploy: -# labels: -# - "traefik.enable=true" -# - "traefik.http.routers..rule=Host(`.castaldifamily.com`)" -# - "traefik.http.routers..entrypoints=websecure" -# - "traefik.http.routers..tls.certresolver=cloudflare" -# - "traefik.http.services..loadbalancer.server.port=" -# -# NOTE: Use deploy.labels (not top-level labels) for Swarm services. -# Top-level labels apply to the container image; deploy.labels apply -# to the Swarm service β€” which is what traefik-kop reads. - -- name: Deploy traefik-kop Swarm stack - hosts: swarm_managers - become: false - gather_facts: false - vars: - traefik_kop_stack_state: "{{ stack_state | default('present') }}" - vars_files: - - ../../group_vars/all.yml - - tasks: - # -------------------------------------------------- - # STEP 1: Assert Swarm is active and reachable - # -------------------------------------------------- - - - name: Verify target is an active Swarm manager - ansible.builtin.command: > - docker info --format '{{ "{{" }}.Swarm.LocalNodeState{{ "}}" }}|{{ "{{" }}.Swarm.ControlAvailable{{ "}}" }}' - register: _swarm_info - changed_when: false - when: inventory_hostname == groups['swarm_managers'][0] - - - name: Assert Swarm manager pre-conditions - ansible.builtin.assert: - that: - - _swarm_info.stdout is search('active') - - _swarm_info.stdout is search('true') - fail_msg: >- - {{ inventory_hostname }} must be an active Swarm manager. - Current state: {{ _swarm_info.stdout | default('unknown') }} - when: inventory_hostname == groups['swarm_managers'][0] - - # -------------------------------------------------- - # STEP 2: Ensure proxy-net overlay network exists - # WHY: The traefik-kop stack declares proxy-net as an external overlay. - # Future Swarm services join this network to be discoverable by kop. - # This network is separate from the bridge of the same name on Heimdall. - # WHY attachable: allows standalone containers to join for debugging. - # -------------------------------------------------- - - - name: Ensure proxy-net overlay network exists on Swarm - community.docker.docker_network: - name: "{{ edge_routing.swarm.proxy_network }}" - driver: overlay - attachable: true - state: present - when: inventory_hostname == groups['swarm_managers'][0] - tags: [network] - - # -------------------------------------------------- - # STEP 3: Verify Redis is reachable from manager - # WHY: Fail fast before deploying β€” if kop can't reach Redis, the - # container will start but immediately fail to publish routes. - # -------------------------------------------------- - - - name: Verify Redis on Heimdall is reachable from Swarm manager - ansible.builtin.wait_for: - host: "{{ edge_routing.edge_host.ip }}" - port: 6379 - timeout: 10 - state: started - when: inventory_hostname == groups['swarm_managers'][0] - tags: [preflight] - - # -------------------------------------------------- - # STEP 4: Deploy traefik-kop stack - # WHY swarm_stack_deploy role: handles template render, compose validation, - # docker stack deploy idempotently, and external network pre-checks. - # -------------------------------------------------- - - - name: Deploy traefik-kop stack - ansible.builtin.include_role: - name: swarm_stack_deploy - vars: - stack_name: "traefik-kop" - stack_compose_src: "{{ playbook_dir }}/../../templates/stacks/traefik-kop.stack.yml" - stack_state: "{{ traefik_kop_stack_state }}" - stack_required_external_networks: - - "{{ edge_routing.swarm.proxy_network }}" - stack_required_directories: [] - when: inventory_hostname == groups['swarm_managers'][0] - tags: [deploy] - - # -------------------------------------------------- - # STEP 5: Verify the service is running - # -------------------------------------------------- - - - name: Wait for traefik-kop service to converge - ansible.builtin.command: > - docker service ls --filter name=traefik-kop_traefik-kop --format '{{ "{{" }}.Replicas{{ "}}" }}' - register: _kop_replicas - retries: 6 - delay: 5 - until: _kop_replicas.stdout is search('1/1') - changed_when: false - when: - - inventory_hostname == groups['swarm_managers'][0] - - traefik_kop_stack_state == 'present' - - not ansible_check_mode - tags: [verify] - - - name: Report deployment result - ansible.builtin.debug: - msg: - - "================================================" - - "traefik-kop deployment complete." - - "================================================" - - "Stack : traefik-kop" - - "Manager : {{ inventory_hostname }} ({{ ansible_host | default('') }})" - - "Redis : {{ edge_routing.integration.redis_addr }}" - - "Bind IP : {{ edge_routing.swarm.bind_ip }}" - - "Network : {{ edge_routing.swarm.proxy_network }} (overlay)" - - "------------------------------------------------" - - "To verify routes in Redis, run on Heimdall:" - - " docker exec redis redis-cli keys 'traefik/*'" - - "================================================" - when: inventory_hostname == groups['swarm_managers'][0] - tags: [always] diff --git a/ansible/archive/playbooks/docker/heimdall_audit.yml b/ansible/archive/playbooks/docker/heimdall_audit.yml deleted file mode 100644 index f02cab4..0000000 --- a/ansible/archive/playbooks/docker/heimdall_audit.yml +++ /dev/null @@ -1,181 +0,0 @@ ---- -# playbooks/docker/heimdall_audit.yml -# Read-only OS and stack health audit for the Heimdall edge router. -# Safe to schedule. Makes no changes to any host. -# -# What this asserts: -# OS: kernel, distro, swap, swappiness, bridge netfilter, ip_forward -# Docker: log rotation configured -# Stack: traefik, redis, docker-socket-proxy containers are running -# -# Usage: -# ansible-playbook -i inventory/hosts.ini playbooks/docker/heimdall_audit.yml -# -# Output: -# outputs/heimdall_audit_.md (repo root) - -- name: "Play 1: Gather Heimdall state" - hosts: heimdall - become: true - gather_facts: true - - tasks: - - name: Read sysctl values - ansible.builtin.shell: "sysctl -n {{ item }} 2>/dev/null || echo 0" - register: sysctl_raw - loop: - - vm.swappiness - - net.bridge.bridge-nf-call-iptables - - net.bridge.bridge-nf-call-ip6tables - - net.ipv4.ip_forward - changed_when: false - check_mode: false - - - name: Read Docker daemon.json - ansible.builtin.command: cat /etc/docker/daemon.json - register: daemon_json_content - changed_when: false - failed_when: false - check_mode: false - - - name: Get running container names - ansible.builtin.command: > - docker ps --format '{{ '{{' }}.Names{{ '}}' }}' - register: running_containers - changed_when: false - failed_when: false - check_mode: false - - - name: Stash audit facts - ansible.builtin.set_fact: - heimdall_audit: - kernel: "{{ ansible_kernel }}" - distro: "{{ ansible_distribution }}" - distro_version: "{{ ansible_distribution_version }}" - swap_mb: "{{ ansible_swaptotal_mb }}" - swappiness: "{{ (sysctl_raw.results | selectattr('item', 'equalto', 'vm.swappiness') | first).stdout | trim }}" - bridge_iptables: "{{ (sysctl_raw.results | selectattr('item', 'equalto', 'net.bridge.bridge-nf-call-iptables') | first).stdout | trim }}" - bridge_ip6tables: "{{ (sysctl_raw.results | selectattr('item', 'equalto', 'net.bridge.bridge-nf-call-ip6tables') | first).stdout | trim }}" - ip_forward: "{{ (sysctl_raw.results | selectattr('item', 'equalto', 'net.ipv4.ip_forward') | first).stdout | trim }}" - log_rotation_configured: "{{ 'max-size' in (daemon_json_content.stdout | default('{}')) }}" - running_containers: "{{ running_containers.stdout_lines | default([]) }}" - traefik_running: "{{ running_containers.stdout_lines | default([]) | select('search', 'traefik') | list | length > 0 }}" - redis_running: "{{ running_containers.stdout_lines | default([]) | select('search', 'redis') | list | length > 0 }}" - socket_proxy_running: "{{ running_containers.stdout_lines | default([]) | select('search', 'socket-proxy|socketproxy|docker-socket') | list | length > 0 }}" - - -- name: "Play 2: Assertions and drift report" - hosts: localhost - gather_facts: false - - vars: - audit_timestamp: "{{ lookup('pipe', 'date +%Y%m%dT%H%M%S') }}" - report_path: "{{ playbook_dir }}/../../../outputs/heimdall_audit_{{ audit_timestamp }}.md" - h: "{{ hostvars['heimdall']['heimdall_audit'] }}" - - tasks: - - name: Ensure outputs directory exists - ansible.builtin.file: - path: "{{ playbook_dir }}/../../../outputs" - state: directory - mode: '0755' - - - name: Write drift report - ansible.builtin.copy: - dest: "{{ report_path }}" - mode: '0644' - content: | - # Heimdall Edge Router Audit Report - - Generated: {{ audit_timestamp }} - - ## System - - | Property | Value | - |----------|-------| - | Kernel | `{{ h.kernel }}` | - | Distro | {{ h.distro }} {{ h.distro_version }} | - | Swap | {{ h.swap_mb }}MB | - - ## Sysctl - - | Parameter | Value | Expected | - |-----------|-------|----------| - | vm.swappiness | {{ h.swappiness }} | 0 | - | net.bridge.bridge-nf-call-iptables | {{ h.bridge_iptables }} | 1 | - | net.bridge.bridge-nf-call-ip6tables | {{ h.bridge_ip6tables }} | 1 | - | net.ipv4.ip_forward | {{ h.ip_forward }} | 1 | - - ## Docker - - | Check | Status | - |-------|--------| - | Log rotation configured | {{ 'βœ…' if h.log_rotation_configured | bool else '❌' }} | - - ## Stack Health - - | Container | Status | - |-----------|--------| - | traefik | {{ 'βœ… running' if h.traefik_running | bool else '❌ not running' }} | - | redis | {{ 'βœ… running' if h.redis_running | bool else '❌ not running' }} | - | docker-socket-proxy | {{ 'βœ… running' if h.socket_proxy_running | bool else '❌ not running' }} | - - ## Running Containers - - {% for c in h.running_containers %} - - {{ c }} - {% endfor %} - - - name: Assert swap is disabled - ansible.builtin.assert: - that: h.swap_mb | int == 0 - fail_msg: "❌ Swap enabled: {{ h.swap_mb }}MB β€” run heimdall_baseline.yml --tags storage" - success_msg: "βœ… Heimdall: swap disabled" - - - name: Assert vm.swappiness=0 - ansible.builtin.assert: - that: h.swappiness | int == 0 - fail_msg: "❌ vm.swappiness={{ h.swappiness }} β€” run heimdall_baseline.yml --tags sysctl" - success_msg: "βœ… Heimdall: vm.swappiness=0" - - - name: Assert bridge netfilter enabled - ansible.builtin.assert: - that: - - h.bridge_iptables | int == 1 - - h.bridge_ip6tables | int == 1 - fail_msg: >- - ❌ Bridge netfilter not fully enabled: - bridge-nf-call-iptables={{ h.bridge_iptables }} - bridge-nf-call-ip6tables={{ h.bridge_ip6tables }} - Run heimdall_baseline.yml --tags sysctl. - success_msg: "βœ… Heimdall: bridge netfilter enabled" - - - name: Assert ip_forward enabled - ansible.builtin.assert: - that: h.ip_forward | int == 1 - fail_msg: "❌ net.ipv4.ip_forward={{ h.ip_forward }} β€” run heimdall_baseline.yml --tags sysctl" - success_msg: "βœ… Heimdall: ip_forward=1" - - - name: Assert Docker log rotation configured - ansible.builtin.assert: - that: h.log_rotation_configured | bool - fail_msg: "❌ Docker log rotation not configured β€” run heimdall_baseline.yml --tags docker" - success_msg: "βœ… Heimdall: Docker log rotation configured" - - - name: Assert Traefik container is running - ansible.builtin.assert: - that: h.traefik_running | bool - fail_msg: "❌ Traefik container is not running β€” check: docker ps -a | grep traefik" - success_msg: "βœ… Heimdall: Traefik running" - - - name: Assert Redis container is running - ansible.builtin.assert: - that: h.redis_running | bool - fail_msg: "❌ Redis container is not running β€” check: docker ps -a | grep redis" - success_msg: "βœ… Heimdall: Redis running" - - - name: Assert docker-socket-proxy container is running - ansible.builtin.assert: - that: h.socket_proxy_running | bool - fail_msg: "❌ docker-socket-proxy container is not running β€” check: docker ps -a | grep socket" - success_msg: "βœ… Heimdall: docker-socket-proxy running" diff --git a/ansible/archive/playbooks/docker/heimdall_baseline.yml b/ansible/archive/playbooks/docker/heimdall_baseline.yml deleted file mode 100644 index c80dd49..0000000 --- a/ansible/archive/playbooks/docker/heimdall_baseline.yml +++ /dev/null @@ -1,156 +0,0 @@ ---- -# playbooks/docker/heimdall_baseline.yml -# Idempotent OS baseline enforcement for the Heimdall edge router host. -# -# ───────────────────────────────────────────────────────────────────────────── -# PURPOSE: Ongoing OS drift enforcement β€” safe to run any time, safe to schedule. -# Does NOT upgrade packages. Does NOT reboot. -# Does NOT touch the Traefik/Redis application stack. -# For the application stack: use playbooks/self-heal/heimdall.yml -# For OS updates: use playbooks/docker/heimdall_update.yml -# For audit: use playbooks/docker/heimdall_audit.yml -# ───────────────────────────────────────────────────────────────────────────── -# -# What this enforces (all idempotent): -# 0. Packages: Required system packages present (docker-ce, nfs-common, etc.) -# 1. Storage: Swap disabled (swapoff + fstab + zram masked) -# 2. Sysctl: vm.swappiness=0, bridge netfilter, ip_forward -# 3. Docker: /etc/docker/daemon.json with log rotation -# -# Usage: -# ansible-playbook -i inventory/hosts.ini playbooks/docker/heimdall_baseline.yml -# -# # Dry-run: -# ansible-playbook -i inventory/hosts.ini playbooks/docker/heimdall_baseline.yml --check --diff -# -# # Target a specific section: -# ansible-playbook -i inventory/hosts.ini playbooks/docker/heimdall_baseline.yml --tags sysctl - -- name: Heimdall OS baseline enforcement - hosts: heimdall - become: true - - vars: - lab_user: "{{ lab_ansible_user | default('chester') }}" - - handlers: - - name: Restart Docker - ansible.builtin.service: - name: docker - state: restarted - - tasks: - - name: "0. Packages: ensure required system packages are present" - tags: [packages, baseline] - ansible.builtin.apt: - name: - - docker-ce - - docker-ce-cli - - containerd.io - - nfs-common - - curl - - htop - - ca-certificates - state: present - update_cache: true - - - name: "1. Storage: disable swap" - tags: [storage, baseline] - block: - - name: Disable swap immediately (covers traditional + zram) - ansible.builtin.command: swapoff -a - when: ansible_swaptotal_mb > 0 - changed_when: ansible_swaptotal_mb > 0 - - - name: Comment out swap entries in /etc/fstab - ansible.builtin.replace: - path: /etc/fstab - regexp: '^([^#].*\s+swap\s+.*)$' - replace: '# \1' - - - name: Remove zram-generator config to prevent zram swap at boot - ansible.builtin.copy: - dest: /etc/systemd/zram-generator.conf - owner: root - group: root - mode: '0644' - content: | - # Managed by Ansible β€” heimdall_baseline.yml - # Empty config disables zram swap on Ubuntu 24.04. - - - name: Stop and mask systemd-zram-generator service if present - ansible.builtin.systemd: - name: systemd-zram-generator - state: stopped - enabled: false - masked: true - failed_when: false - - - name: Swapoff zram devices explicitly - ansible.builtin.shell: | - for dev in $(ls /dev/zram* 2>/dev/null); do - swapoff "$dev" 2>/dev/null || true - done - changed_when: false - - - name: "2. Sysctl: Docker networking parameters" - tags: [sysctl, baseline] - block: - - name: Ensure br_netfilter module is loaded - community.general.modprobe: - name: br_netfilter - state: present - - - name: Persist br_netfilter module load at boot - ansible.builtin.copy: - dest: /etc/modules-load.d/br_netfilter.conf - content: "br_netfilter\n" - owner: root - group: root - mode: '0644' - - - name: Apply and persist sysctl parameters - ansible.posix.sysctl: - name: "{{ item.key }}" - value: "{{ item.value }}" - sysctl_file: /etc/sysctl.d/90-heimdall.conf - state: present - reload: true - loop: - - { key: vm.swappiness, value: "0" } - - { key: net.bridge.bridge-nf-call-iptables, value: "1" } - - { key: net.bridge.bridge-nf-call-ip6tables, value: "1" } - - { key: net.ipv4.ip_forward, value: "1" } - - - name: "3. Docker: daemon configuration and log rotation" - tags: [docker, baseline] - block: - - name: Ensure /etc/docker directory exists - ansible.builtin.file: - path: /etc/docker - state: directory - owner: root - group: root - mode: '0755' - - - name: Deploy Docker daemon.json with log rotation - ansible.builtin.copy: - dest: /etc/docker/daemon.json - owner: root - group: root - mode: '0644' - content: | - { - "log-driver": "json-file", - "log-opts": { - "max-size": "10m", - "max-file": "3" - } - } - notify: Restart Docker - - - name: Ensure '{{ lab_user }}' is in the docker group - ansible.builtin.user: - name: "{{ lab_user }}" - groups: docker - append: true diff --git a/ansible/archive/playbooks/docker/heimdall_update.yml b/ansible/archive/playbooks/docker/heimdall_update.yml deleted file mode 100644 index 8acdb72..0000000 --- a/ansible/archive/playbooks/docker/heimdall_update.yml +++ /dev/null @@ -1,81 +0,0 @@ ---- -# playbooks/docker/heimdall_update.yml -# OS package update for the Heimdall edge router host. -# -# ───────────────────────────────────────────────────────────────────────────── -# ⚠️ HUMAN-TRIGGERED ONLY β€” do not automate or schedule. -# Heimdall is a standalone Docker host (not in Swarm) β€” no drain needed. -# Reboot will take Traefik/edge routing offline briefly. -# ───────────────────────────────────────────────────────────────────────────── -# -# What this does: -# 1. Runs apt dist-upgrade -# 2. Reboots if a newer kernel was installed and waits for return -# 3. Verifies Docker is back up before completing -# -# Usage: -# ansible-playbook -i inventory/hosts.ini playbooks/docker/heimdall_update.yml -# -# # Dry-run: -# ansible-playbook -i inventory/hosts.ini playbooks/docker/heimdall_update.yml --check -# -# # Update packages but skip reboot even if kernel changed: -# ansible-playbook -i inventory/hosts.ini playbooks/docker/heimdall_update.yml --skip-tags reboot - -- name: Heimdall OS update - hosts: heimdall - become: true - - tasks: - - name: Update apt cache - ansible.builtin.apt: - update_cache: true - cache_valid_time: 0 - - - name: Run apt dist-upgrade - ansible.builtin.apt: - upgrade: dist - update_cache: false - register: dist_upgrade_result - tags: [update] - - - name: Check if a newer kernel is installed but not yet booted - ansible.builtin.shell: | - LATEST=$(ls /boot/vmlinuz-* | sort -V | tail -1 | sed 's|/boot/vmlinuz-||') - RUNNING=$(uname -r) - if [ "$LATEST" != "$RUNNING" ]; then echo "reboot_needed"; fi - register: reboot_check - changed_when: false - check_mode: false - tags: [reboot] - - - name: Reboot if a newer kernel is installed - ansible.builtin.reboot: - msg: "Rebooting into updated kernel β€” initiated by heimdall_update.yml" - reboot_timeout: 300 - when: reboot_check.stdout | trim == 'reboot_needed' - tags: [reboot] - - - name: Wait for Heimdall to return post-reboot - ansible.builtin.wait_for_connection: - delay: 10 - timeout: 300 - when: reboot_check.stdout | trim == 'reboot_needed' - tags: [reboot] - - - name: Wait for Docker daemon to be ready after reboot - ansible.builtin.command: docker info - register: docker_ready - until: docker_ready.rc == 0 - retries: 18 - delay: 10 - changed_when: false - check_mode: false - when: reboot_check.stdout | trim == 'reboot_needed' - tags: [reboot] - - - name: Report result - ansible.builtin.debug: - msg: >- - βœ… Heimdall updated. - {{ 'Rebooted into new kernel.' if reboot_check.stdout | trim == 'reboot_needed' else 'No kernel change β€” reboot not required.' }} diff --git a/ansible/archive/playbooks/docker/install_portainer.yml b/ansible/archive/playbooks/docker/install_portainer.yml deleted file mode 100644 index 7b705ac..0000000 --- a/ansible/archive/playbooks/docker/install_portainer.yml +++ /dev/null @@ -1,113 +0,0 @@ ---- -- name: Install Portainer server - hosts: watchtower - become: true - gather_facts: true - vars: - portainer_version: "latest" - portainer_data_dir: "/opt/portainer/data" - portainer_http_port: 9000 - portainer_https_port: 9443 - - tasks: - - name: Ensure Portainer data directory exists - ansible.builtin.file: - path: "{{ portainer_data_dir }}" - state: directory - mode: '0755' - - - name: Deploy Portainer server container - community.docker.docker_container: - name: portainer - image: "portainer/portainer-ce:{{ portainer_version }}" - state: started - restart_policy: always - recreate: false - pull: true - ports: - - "{{ portainer_http_port }}:9000" - - "{{ portainer_https_port }}:9443" - volumes: - - "/var/run/docker.sock:/var/run/docker.sock" - - "{{ portainer_data_dir }}:/data" - - - name: Wait for Portainer server to become reachable - ansible.builtin.wait_for: - port: "{{ portainer_http_port }}" - delay: 5 - timeout: 60 - state: started - - - name: Show Portainer server endpoints - ansible.builtin.debug: - msg: - - "Portainer server is running on {{ inventory_hostname }}" - - "HTTP: http://{{ ansible_default_ipv4.address }}:{{ portainer_http_port }}" - - "HTTPS: https://{{ ansible_default_ipv4.address }}:{{ portainer_https_port }}" - -- name: Deploy Portainer agent service - hosts: swarm_managers[0] - become: true - gather_facts: false - vars: - portainer_agent_version: "2.33.6" - portainer_agent_port: 9001 - portainer_agent_network: "portainer_agent_network" - - tasks: - - name: Ensure Portainer overlay network exists - community.docker.docker_network: - name: "{{ portainer_agent_network }}" - driver: overlay - attachable: true - state: present - - - name: Deploy Portainer agent as global swarm service - community.docker.docker_swarm_service: - name: portainer_agent - image: "portainer/agent:{{ portainer_agent_version }}" - state: present - mode: global - publish: - - published_port: "{{ portainer_agent_port }}" - target_port: 9001 - protocol: tcp - networks: - - name: "{{ portainer_agent_network }}" - constraints: - - node.platform.os == linux - mounts: - - source: /var/run/docker.sock - target: /var/run/docker.sock - type: bind - - source: /var/lib/docker/volumes - target: /var/lib/docker/volumes - type: bind - - source: / - target: /host - type: bind - - - name: Show Portainer agent deployment status - ansible.builtin.command: docker service ls --filter name=portainer_agent - register: portainer_agent_status - changed_when: false - - - name: Display Portainer agent summary - ansible.builtin.debug: - msg: - - "Portainer agent service is deployed" - - "Network: {{ portainer_agent_network }}" - - "Status: {{ portainer_agent_status.stdout }}" - -- name: Display Portainer installation summary - hosts: watchtower - gather_facts: true - - tasks: - - name: Show post-install summary - ansible.builtin.debug: - msg: - - "Portainer installation complete" - - "Server URL: http://{{ ansible_default_ipv4.address }}:9000" - - "HTTPS URL: https://{{ ansible_default_ipv4.address }}:9443" - - "Add Swarm environment in Portainer using any manager IP on port 9001" diff --git a/ansible/archive/playbooks/docker/manage_containers.yml b/ansible/archive/playbooks/docker/manage_containers.yml deleted file mode 100644 index f0c0c76..0000000 --- a/ansible/archive/playbooks/docker/manage_containers.yml +++ /dev/null @@ -1,158 +0,0 @@ ---- -- name: Manage Docker environment - hosts: docker_hosts - become: true - vars: - docker_users: - - chester - docker_daemon_options: - log-driver: "json-file" - log-opts: - max-size: "10m" - max-file: "3" - storage-driver: "overlay2" - docker_cleanup_enabled: false - docker_cleanup_older_than_days: 30 - - tasks: - - name: Install Docker prerequisite packages - ansible.builtin.apt: - name: - - apt-transport-https - - ca-certificates - - curl - - gnupg - - lsb-release - - python3-pip - - python3-docker - state: present - update_cache: true - - - name: Add Docker apt signing key - ansible.builtin.apt_key: - url: "https://download.docker.com/linux/ubuntu/gpg" - state: present - - - name: Add Docker apt repository - ansible.builtin.apt_repository: - repo: "deb [arch=amd64] https://download.docker.com/linux/ubuntu {{ ansible_distribution_release }} stable" - state: present - - - name: Install Docker Engine packages - ansible.builtin.apt: - name: - - docker-ce - - docker-ce-cli - - containerd.io - - docker-buildx-plugin - - docker-compose-plugin - state: present - update_cache: true - - - name: Ensure Docker service is enabled and started - ansible.builtin.systemd: - name: docker - state: started - enabled: true - - - name: Configure Docker daemon options - ansible.builtin.copy: - content: "{{ docker_daemon_options | to_nice_json }}" - dest: /etc/docker/daemon.json - mode: '0644' - notify: Restart Docker - - - name: Add configured users to docker group - ansible.builtin.user: - name: "{{ item }}" - groups: docker - append: true - loop: "{{ docker_users }}" - - - name: Ensure Docker networks directory exists - ansible.builtin.file: - path: /etc/docker/networks - state: directory - mode: '0755' - - - name: Gather Docker host information - community.docker.docker_host_info: - register: docker_info - - - name: Show Docker version - ansible.builtin.debug: - msg: "Docker version {{ docker_info.host_info.ServerVersion }}" - - - name: Ensure required Docker networks exist - community.docker.docker_network: - name: "{{ item }}" - state: present - loop: - - backend - - frontend - - - name: Check Docker disk usage - ansible.builtin.command: docker system df - register: docker_disk_usage - changed_when: false - - - name: Show Docker disk usage output - ansible.builtin.debug: - var: docker_disk_usage.stdout_lines - - - name: Check for unhealthy containers - ansible.builtin.command: docker ps --filter health=unhealthy --format '{{"{{.Names}}\t{{.Status}}"}}' - register: unhealthy_containers - changed_when: false - failed_when: false - - - name: Report unhealthy containers - ansible.builtin.debug: - msg: "Unhealthy containers detected: {{ unhealthy_containers.stdout_lines }}" - when: unhealthy_containers.stdout | length > 0 - - - name: Prune Docker resources when cleanup is enabled - community.docker.docker_prune: - containers: true - images: true - images_filters: - until: "{{ docker_cleanup_older_than_days * 24 }}h" - networks: true - volumes: true - when: docker_cleanup_enabled - register: docker_prune_result - - - name: Show Docker cleanup results - ansible.builtin.debug: - var: docker_prune_result - when: docker_cleanup_enabled - - - name: Create Docker backup directory - ansible.builtin.file: - path: /opt/docker-backups - state: directory - mode: '0750' - - - name: Find docker-compose files - ansible.builtin.find: - paths: - - /opt - - /home - patterns: "docker-compose*.yml" - recurse: true - register: compose_files - - - name: Back up docker-compose files - ansible.builtin.copy: - src: "{{ item.path }}" - dest: "/opt/docker-backups/{{ item.path | basename }}.{{ ansible_date_time.date }}" - remote_src: true - mode: '0644' - loop: "{{ compose_files.files }}" - when: compose_files.files | length > 0 - - handlers: - - name: Restart Docker - ansible.builtin.systemd: - name: docker - state: restarted diff --git a/ansible/archive/playbooks/docker/swarm_audit.yml b/ansible/archive/playbooks/docker/swarm_audit.yml deleted file mode 100644 index 7429f17..0000000 --- a/ansible/archive/playbooks/docker/swarm_audit.yml +++ /dev/null @@ -1,201 +0,0 @@ ---- -# playbooks/docker/swarm_audit.yml -# Read-only cross-node consistency audit for the Docker Swarm cluster. -# Safe to schedule. Makes no changes to any host. -# -# What this does: -# Play 1 β€” Gathers key state from all swarm_hosts nodes (kernel, distro, -# swap, sysctl, daemon.json, Docker Swarm role) -# Play 2 β€” Asserts consistency across all 6 nodes and writes a markdown -# drift report to outputs/swarm_audit_.md -# -# Usage: -# ansible-playbook -i inventory/hosts.ini playbooks/docker/swarm_audit.yml -# -# Output: -# outputs/swarm_audit_.md (repo root) - -- name: "Play 1: Gather Swarm node state" - hosts: swarm_hosts - become: true - gather_facts: true - - tasks: - - name: Read sysctl values for audit - ansible.builtin.shell: "sysctl -n {{ item }} 2>/dev/null || echo 0" - register: sysctl_raw - loop: - - vm.swappiness - - net.bridge.bridge-nf-call-iptables - - net.bridge.bridge-nf-call-ip6tables - - net.ipv4.ip_forward - changed_when: false - check_mode: false - - - name: Read Docker daemon.json - ansible.builtin.command: cat /etc/docker/daemon.json - register: daemon_json_content - changed_when: false - failed_when: false - check_mode: false - - - name: Get Docker Swarm node role - ansible.builtin.shell: > - docker info --format '{{ '{{' }}.Swarm.LocalNodeState{{ '}}' }}:{{ '{{' }}.Swarm.ControlAvailable{{ '}}' }}' - register: docker_swarm_info - changed_when: false - failed_when: false - check_mode: false - - - name: Stash per-node audit facts - ansible.builtin.set_fact: - swarm_audit: - kernel: "{{ ansible_kernel }}" - distro: "{{ ansible_distribution }}" - distro_version: "{{ ansible_distribution_version }}" - swap_mb: "{{ ansible_swaptotal_mb }}" - swappiness: "{{ (sysctl_raw.results | selectattr('item', 'equalto', 'vm.swappiness') | first).stdout | trim }}" - bridge_iptables: "{{ (sysctl_raw.results | selectattr('item', 'equalto', 'net.bridge.bridge-nf-call-iptables') | first).stdout | trim }}" - bridge_ip6tables: "{{ (sysctl_raw.results | selectattr('item', 'equalto', 'net.bridge.bridge-nf-call-ip6tables') | first).stdout | trim }}" - ip_forward: "{{ (sysctl_raw.results | selectattr('item', 'equalto', 'net.ipv4.ip_forward') | first).stdout | trim }}" - daemon_json: "{{ daemon_json_content.stdout | default('{}') }}" - log_rotation_configured: "{{ 'max-size' in (daemon_json_content.stdout | default('{}')) }}" - swarm_local_state: "{{ docker_swarm_info.stdout.split(':')[0] | trim }}" - swarm_is_manager: "{{ docker_swarm_info.stdout.split(':')[1] | trim | lower == 'true' }}" - - -- name: "Play 2: Cross-node consistency assertions and drift report" - hosts: localhost - gather_facts: false - - vars: - swarm_nodes: "{{ groups['swarm_hosts'] }}" - managers: "{{ groups['swarm_managers'] }}" - workers: "{{ groups['swarm_workers'] }}" - audit_timestamp: "{{ lookup('pipe', 'date +%Y%m%dT%H%M%S') }}" - report_path: "{{ playbook_dir }}/../../../outputs/swarm_audit_{{ audit_timestamp }}.md" - - tasks: - - name: Ensure outputs directory exists - ansible.builtin.file: - path: "{{ playbook_dir }}/../../../outputs" - state: directory - mode: '0755' - - - name: Write drift report - ansible.builtin.copy: - dest: "{{ report_path }}" - mode: '0644' - content: | - # Swarm Cluster Audit Report - - Generated: {{ audit_timestamp }} - Nodes audited: {{ swarm_nodes | join(', ') }} - - ## Node Summary - - | Node | Role | Kernel | Distro | Swap | Swappiness | Bridge IPTables | IP Forward | Log Rotation | - |------|------|--------|--------|------|------------|-----------------|------------|--------------| - {% for node in swarm_nodes %} - | {{ node }} | {{ 'Manager' if hostvars[node]['swarm_audit']['swarm_is_manager'] | bool else 'Worker' }} | `{{ hostvars[node]['swarm_audit']['kernel'] }}` | {{ hostvars[node]['swarm_audit']['distro'] }} {{ hostvars[node]['swarm_audit']['distro_version'] }} | {{ hostvars[node]['swarm_audit']['swap_mb'] }}MB | {{ hostvars[node]['swarm_audit']['swappiness'] }} | {{ hostvars[node]['swarm_audit']['bridge_iptables'] }} | {{ hostvars[node]['swarm_audit']['ip_forward'] }} | {{ 'βœ…' if hostvars[node]['swarm_audit']['log_rotation_configured'] | bool else '❌' }} | - {% endfor %} - - ## Swarm Role Mapping - - | Node | Inventory Role | Docker ControlAvailable | - |------|----------------|------------------------| - {% for node in managers %} - | {{ node }} | Manager | {{ 'βœ… true' if hostvars[node]['swarm_audit']['swarm_is_manager'] | bool else '❌ false (DRIFT!)' }} | - {% endfor %} - {% for node in workers %} - | {{ node }} | Worker | {{ '❌ true (UNEXPECTED!)' if hostvars[node]['swarm_audit']['swarm_is_manager'] | bool else 'βœ… false' }} | - {% endfor %} - - ## Docker Swarm State - - | Node | LocalNodeState | - |------|----------------| - {% for node in swarm_nodes %} - | {{ node }} | {{ hostvars[node]['swarm_audit']['swarm_local_state'] }} | - {% endfor %} - - - name: Assert kernel consistency across all nodes - ansible.builtin.assert: - that: - - hostvars[item]['swarm_audit']['kernel'] == hostvars[swarm_nodes[0]]['swarm_audit']['kernel'] - fail_msg: >- - ❌ Kernel drift: {{ item }} has {{ hostvars[item]['swarm_audit']['kernel'] }} - but {{ swarm_nodes[0] }} has {{ hostvars[swarm_nodes[0]]['swarm_audit']['kernel'] }} - success_msg: "βœ… {{ item }}: kernel {{ hostvars[item]['swarm_audit']['kernel'] }}" - loop: "{{ swarm_nodes }}" - - - name: Assert distro version consistency across all nodes - ansible.builtin.assert: - that: - - hostvars[item]['swarm_audit']['distro_version'] == hostvars[swarm_nodes[0]]['swarm_audit']['distro_version'] - fail_msg: >- - ❌ Distro version drift: {{ item }} has {{ hostvars[item]['swarm_audit']['distro_version'] }} - but {{ swarm_nodes[0] }} has {{ hostvars[swarm_nodes[0]]['swarm_audit']['distro_version'] }} - success_msg: "βœ… {{ item }}: distro {{ hostvars[item]['swarm_audit']['distro'] }} {{ hostvars[item]['swarm_audit']['distro_version'] }}" - loop: "{{ swarm_nodes }}" - - - name: Assert swap is disabled on all nodes - ansible.builtin.assert: - that: - - hostvars[item]['swarm_audit']['swap_mb'] | int == 0 - fail_msg: "❌ Swap is enabled on {{ item }}: {{ hostvars[item]['swarm_audit']['swap_mb'] }}MB β€” run swarm_baseline.yml --tags storage" - success_msg: "βœ… {{ item }}: swap disabled" - loop: "{{ swarm_nodes }}" - - - name: Assert vm.swappiness=0 on all nodes - ansible.builtin.assert: - that: - - hostvars[item]['swarm_audit']['swappiness'] | int == 0 - fail_msg: "❌ vm.swappiness={{ hostvars[item]['swarm_audit']['swappiness'] }} on {{ item }} β€” run swarm_baseline.yml --tags sysctl" - success_msg: "βœ… {{ item }}: vm.swappiness=0" - loop: "{{ swarm_nodes }}" - - - name: Assert bridge netfilter is enabled on all nodes - ansible.builtin.assert: - that: - - hostvars[item]['swarm_audit']['bridge_iptables'] | int == 1 - - hostvars[item]['swarm_audit']['bridge_ip6tables'] | int == 1 - fail_msg: >- - ❌ Bridge netfilter not fully enabled on {{ item }}: - bridge-nf-call-iptables={{ hostvars[item]['swarm_audit']['bridge_iptables'] }} - bridge-nf-call-ip6tables={{ hostvars[item]['swarm_audit']['bridge_ip6tables'] }} - Run swarm_baseline.yml --tags sysctl to fix. - success_msg: "βœ… {{ item }}: bridge netfilter enabled" - loop: "{{ swarm_nodes }}" - - - name: Assert ip_forward is enabled on all nodes - ansible.builtin.assert: - that: - - hostvars[item]['swarm_audit']['ip_forward'] | int == 1 - fail_msg: "❌ net.ipv4.ip_forward={{ hostvars[item]['swarm_audit']['ip_forward'] }} on {{ item }} β€” run swarm_baseline.yml --tags sysctl" - success_msg: "βœ… {{ item }}: ip_forward=1" - loop: "{{ swarm_nodes }}" - - - name: Assert Docker log rotation configured on all nodes - ansible.builtin.assert: - that: - - hostvars[item]['swarm_audit']['log_rotation_configured'] | bool - fail_msg: "❌ Docker log rotation not configured on {{ item }} β€” run swarm_baseline.yml --tags docker" - success_msg: "βœ… {{ item }}: Docker log rotation configured" - loop: "{{ swarm_nodes }}" - - - name: Assert swarm_managers are Docker managers - ansible.builtin.assert: - that: - - hostvars[item]['swarm_audit']['swarm_is_manager'] | bool - fail_msg: "❌ {{ item }} is in swarm_managers inventory group but Docker reports it is NOT a manager" - success_msg: "βœ… {{ item }}: confirmed Docker manager" - loop: "{{ managers }}" - - - name: Assert swarm_workers are not Docker managers - ansible.builtin.assert: - that: - - not (hostvars[item]['swarm_audit']['swarm_is_manager'] | bool) - fail_msg: "❌ {{ item }} is in swarm_workers inventory group but Docker reports it is a Manager" - success_msg: "βœ… {{ item }}: confirmed Docker worker" - loop: "{{ workers }}" diff --git a/ansible/archive/playbooks/docker/swarm_baseline.yml b/ansible/archive/playbooks/docker/swarm_baseline.yml deleted file mode 100644 index aa61dfa..0000000 --- a/ansible/archive/playbooks/docker/swarm_baseline.yml +++ /dev/null @@ -1,188 +0,0 @@ ---- -# playbooks/docker/swarm_baseline.yml -# Idempotent Ubuntu/Docker Swarm node baseline enforcement. -# -# ───────────────────────────────────────────────────────────────────────────── -# PURPOSE: Ongoing drift enforcement β€” safe to run any time, safe to schedule. -# Does NOT upgrade packages. Does NOT reboot. -# For rolling OS updates: use playbooks/docker/swarm_update.yml -# For cross-node consistency audit: use playbooks/docker/swarm_audit.yml -# ───────────────────────────────────────────────────────────────────────────── -# -# What this enforces (all idempotent): -# 0. Identity: Operational user, SSH key, passwordless sudo, docker group -# 1. Packages: Required packages present (docker-ce, nfs-common, curl, htop) -# 2. Storage: Swap disabled (swapoff -a + fstab commented) -# 3. Sysctl: vm.swappiness=0, bridge netfilter, ip_forward (Docker requirements) -# 4. Docker: /etc/docker/daemon.json with log rotation -# -# Usage: -# # All swarm nodes: -# ansible-playbook -i inventory/hosts.ini playbooks/docker/swarm_baseline.yml -# -# # Single node: -# ansible-playbook -i inventory/hosts.ini playbooks/docker/swarm_baseline.yml --limit swarm-manager-1 -# -# # Dry-run: -# ansible-playbook -i inventory/hosts.ini playbooks/docker/swarm_baseline.yml --check --diff -# -# # Target a specific section only: -# ansible-playbook -i inventory/hosts.ini playbooks/docker/swarm_baseline.yml --tags sysctl -# ansible-playbook -i inventory/hosts.ini playbooks/docker/swarm_baseline.yml --tags docker - -- name: Swarm node baseline enforcement - hosts: swarm_hosts - become: true - - vars: - lab_user: "{{ lab_ansible_user | default('chester') }}" - controller_ssh_pubkey_candidates: - - "{{ lookup('env', 'HOME') }}/.ssh/id_ed25519_homelab.pub" - - "{{ lookup('env', 'HOME') }}/.ssh/id_ed25519.pub" - - handlers: - - name: Restart Docker - ansible.builtin.service: - name: docker - state: restarted - - tasks: - - name: "0. Identity: ensure user '{{ lab_user }}' is configured" - tags: [identity, baseline] - block: - - name: "Ensure group '{{ lab_user }}' exists" - ansible.builtin.group: - name: "{{ lab_user }}" - state: present - - - name: "Ensure user '{{ lab_user }}' exists with sudo and docker access" - ansible.builtin.user: - name: "{{ lab_user }}" - group: "{{ lab_user }}" - groups: - - sudo - - docker - append: true - shell: /bin/bash - password: '!' - password_lock: true - - - name: Locate SSH public key on control machine - ansible.builtin.set_fact: - controller_ssh_pubkey_path: >- - {{ lookup('ansible.builtin.first_found', {'files': controller_ssh_pubkey_candidates, 'skip': true}) }} - delegate_to: localhost - become: false - - - name: Fail early if SSH public key is missing - ansible.builtin.fail: - msg: >- - SSH public key not found on the control machine. - Checked: {{ controller_ssh_pubkey_candidates | join(', ') }} - Generate one with: ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519 - when: controller_ssh_pubkey_path | default('') | length == 0 - - - name: "Deploy SSH key to {{ lab_user }}" - ansible.posix.authorized_key: - user: "{{ lab_user }}" - state: present - key: "{{ lookup('file', controller_ssh_pubkey_path) }}" - - - name: "Grant '{{ lab_user }}' passwordless sudo" - ansible.builtin.copy: - dest: "/etc/sudoers.d/{{ lab_user }}" - content: "{{ lab_user }} ALL=(ALL) NOPASSWD: ALL\n" - mode: '0440' - owner: root - group: root - validate: '/usr/sbin/visudo -cf %s' - - - name: "1. Packages: ensure required packages are present" - tags: [packages, baseline] - block: - - name: Update apt cache - ansible.builtin.apt: - update_cache: true - cache_valid_time: 3600 - - - name: Ensure required packages present - ansible.builtin.apt: - name: - - docker-ce - - docker-ce-cli - - containerd.io - - nfs-common - - curl - - htop - - ca-certificates - state: present - - - name: "2. Storage: disable swap" - tags: [storage, baseline] - block: - - name: Disable swap immediately - ansible.builtin.command: swapoff -a - when: ansible_swaptotal_mb > 0 - changed_when: ansible_swaptotal_mb > 0 - - - name: Comment out swap entries in /etc/fstab - ansible.builtin.replace: - path: /etc/fstab - regexp: '^([^#].*\s+swap\s+.*)$' - replace: '# \1' - - - name: "3. Sysctl: apply Docker Swarm networking parameters" - tags: [sysctl, baseline] - block: - - name: Ensure br_netfilter module is loaded - community.general.modprobe: - name: br_netfilter - state: present - - - name: Persist br_netfilter module load at boot - ansible.builtin.copy: - dest: /etc/modules-load.d/br_netfilter.conf - content: "br_netfilter\n" - owner: root - group: root - mode: '0644' - - - name: Apply and persist sysctl parameters for swarm - ansible.posix.sysctl: - name: "{{ item.key }}" - value: "{{ item.value }}" - sysctl_file: /etc/sysctl.d/90-swarm.conf - state: present - reload: true - loop: - - { key: vm.swappiness, value: "0" } - - { key: net.bridge.bridge-nf-call-iptables, value: "1" } - - { key: net.bridge.bridge-nf-call-ip6tables, value: "1" } - - { key: net.ipv4.ip_forward, value: "1" } - - - name: "4. Docker: daemon configuration and log rotation" - tags: [docker, baseline] - block: - - name: Ensure /etc/docker directory exists - ansible.builtin.file: - path: /etc/docker - state: directory - owner: root - group: root - mode: '0755' - - - name: Deploy Docker daemon.json with log rotation - ansible.builtin.copy: - dest: /etc/docker/daemon.json - owner: root - group: root - mode: '0644' - content: | - { - "log-driver": "json-file", - "log-opts": { - "max-size": "10m", - "max-file": "3" - } - } - notify: Restart Docker diff --git a/ansible/archive/playbooks/docker/swarm_preflight.yml b/ansible/archive/playbooks/docker/swarm_preflight.yml deleted file mode 100644 index b840e06..0000000 --- a/ansible/archive/playbooks/docker/swarm_preflight.yml +++ /dev/null @@ -1,110 +0,0 @@ ---- -# ansible/playbooks/docker/swarm_preflight.yml -# -# Swarm Foundation Pre-flight -# =========================== -# Addresses all four hard prerequisites before any service can be deployed to -# the swarm. Run this once after swarm_bootstrap and before swarm_stack_deploy. -# -# Prerequisites satisfied: -# 1. NFS mounts β€” /mnt/homelab + /mnt/media mounted on every node -# 2. proxy-net β€” overlay network present on the swarm (172.20.0.0/24) -# 3. Node labels β€” role=manager / role=worker applied to every node -# 4. /opt/stacks β€” deploy root created on every node (owned by lab user) -# -# Usage: -# # Dry-run (safe, no changes): -# ansible-playbook -i inventory/hosts.ini playbooks/docker/swarm_preflight.yml --check -# -# # Live run: -# ansible-playbook -i inventory/hosts.ini playbooks/docker/swarm_preflight.yml -# -# # Single-concern run: -# ansible-playbook -i inventory/hosts.ini playbooks/docker/swarm_preflight.yml --tags storage -# ansible-playbook -i inventory/hosts.ini playbooks/docker/swarm_preflight.yml --tags network -# ansible-playbook -i inventory/hosts.ini playbooks/docker/swarm_preflight.yml --tags labels -# ansible-playbook -i inventory/hosts.ini playbooks/docker/swarm_preflight.yml --tags stacks_root -# -# Verification (post-run): -# ansible swarm_hosts -i inventory/hosts.ini -m command -a "findmnt /mnt/homelab" -# ansible swarm_hosts -i inventory/hosts.ini -m stat -a "path=/opt/stacks" -# docker node ls --format '{{ "{{" }}.Hostname{{ "}}" }}\t{{ "{{" }}.Labels{{ "}}" }}' -# docker network inspect proxy-net - -############################################################################### -# PLAY 1 β€” Storage: NFS mounts + /opt/stacks on every swarm node # -############################################################################### -- name: "Swarm pre-flight | Storage" - hosts: swarm_hosts - become: true - gather_facts: false - tags: [storage, stacks_root] - vars: - lab_user: "{{ lab_ansible_user | default('chester') }}" - - roles: - - role: storage_mounts - tags: [storage] - - tasks: - - name: "Create /opt/stacks deploy root" - ansible.builtin.file: - path: /opt/stacks - state: directory - owner: "{{ lab_user }}" - group: "{{ lab_user }}" - mode: "0755" - tags: [stacks_root] - -############################################################################### -# PLAY 2 β€” Network: ensure proxy-net overlay exists (run from one manager) # -############################################################################### -- name: "Swarm pre-flight | proxy-net overlay network" - hosts: swarm_managers[0] - become: false - gather_facts: false - tags: [network] - - roles: - - role: swarm_overlay_network - tags: [network] - -############################################################################### -# PLAY 3 β€” Labels: apply role=manager / role=worker to every swarm node # -############################################################################### -- name: "Swarm pre-flight | Node labels" - hosts: swarm_managers[0] - become: false - gather_facts: false - tags: [labels] - - tasks: - - name: "Apply role=manager label to manager nodes" - ansible.builtin.command: >- - docker node update --label-add role=manager {{ item }} - loop: "{{ groups['swarm_managers'] }}" - changed_when: false - # docker node update is idempotent β€” labels are additive and - # re-applying the same label does not change cluster state. - tags: [labels] - - - name: "Apply role=worker label to worker nodes" - ansible.builtin.command: >- - docker node update --label-add role=worker {{ item }} - loop: "{{ groups['swarm_workers'] }}" - changed_when: false - tags: [labels] - - - name: "Show node label summary" - ansible.builtin.shell: >- - for node in $(docker node ls --format "{{ '{{' }}.Hostname{{ '}}' }}"); do - echo "$node $(docker node inspect $node --format '{{ '{{' }}json .Spec.Labels{{ '}}' }}')"; - done - register: swarm_node_summary - changed_when: false - tags: [labels] - - - name: "Print node label summary" - ansible.builtin.debug: - msg: "{{ swarm_node_summary.stdout_lines }}" - tags: [labels] diff --git a/ansible/archive/playbooks/docker/swarm_update.yml b/ansible/archive/playbooks/docker/swarm_update.yml deleted file mode 100644 index a9b4667..0000000 --- a/ansible/archive/playbooks/docker/swarm_update.yml +++ /dev/null @@ -1,170 +0,0 @@ ---- -# playbooks/docker/swarm_update.yml -# Rolling Docker Swarm node OS update with drain-before-reboot. -# -# ───────────────────────────────────────────────────────────────────────────── -# ⚠️ HUMAN-TRIGGERED ONLY β€” do not automate or schedule. -# serial: 1 ensures one node is updated at a time. -# Each node is drained before update and re-activated after reboot. -# ───────────────────────────────────────────────────────────────────────────── -# -# What this does per node: -# 1. Pre-checks that Docker Swarm is healthy on the node -# 2. Drains the node (tasks migrate to remaining nodes) -# 3. Runs apt dist-upgrade -# 4. Reboots if a newer kernel was installed -# 5. Waits for the node and Docker daemon to return online -# 6. Re-activates the node in the swarm -# 7. Asserts node is Ready + Active before proceeding to the next node -# -# NOTE: drain/restore commands are delegated to a healthy manager. -# When updating swarm-manager-1, delegation falls back to swarm-manager-2. -# Assumes inventory_hostname matches the Docker Swarm node name (VM hostname). -# -# Usage: -# # All nodes (rolling β€” managers first, then workers): -# ansible-playbook -i inventory/hosts.ini playbooks/docker/swarm_update.yml -# -# # Single node: -# ansible-playbook -i inventory/hosts.ini playbooks/docker/swarm_update.yml --limit swarm-worker-1 -# -# # Dry-run (confirms serial order and reboot conditions without modifying): -# ansible-playbook -i inventory/hosts.ini playbooks/docker/swarm_update.yml --check -# -# # Update packages but skip reboot even if kernel changed: -# ansible-playbook -i inventory/hosts.ini playbooks/docker/swarm_update.yml --skip-tags reboot - -- name: Rolling Swarm node update - hosts: swarm_hosts - become: true - serial: 1 - - vars: - # Delegate swarm CLI commands to a healthy manager. - # If we are updating swarm-manager-1 itself, fall back to swarm-manager-2. - swarm_delegate: >- - {{ 'swarm-manager-2' if inventory_hostname == 'swarm-manager-1' else 'swarm-manager-1' }} - - tasks: - - name: "Pre-flight: verify Swarm is healthy before touching this node" - block: - - name: Check Docker Swarm state on this node - ansible.builtin.shell: > - docker info --format '{{ '{{' }}.Swarm.LocalNodeState{{ '}}' }}' - register: swarm_pre - changed_when: false - check_mode: false - - - name: Fail if node is not an active swarm member - ansible.builtin.assert: - that: - - swarm_pre.stdout | trim == 'active' - fail_msg: >- - β›” {{ inventory_hostname }} reports Swarm.LocalNodeState={{ swarm_pre.stdout | trim }}. - Expected 'active'. Resolve swarm health before proceeding. - success_msg: "βœ… {{ inventory_hostname }} is an active swarm member β€” safe to drain" - - - name: "Drain: migrate tasks off {{ inventory_hostname }}" - tags: [drain] - when: not ansible_check_mode - block: - - name: Set node availability to drain - ansible.builtin.command: > - docker node update --availability drain {{ inventory_hostname }} - delegate_to: "{{ swarm_delegate }}" - become: false - changed_when: true - - - name: Wait for running tasks to evacuate - ansible.builtin.shell: > - docker node ps {{ inventory_hostname }} --filter desired-state=running -q 2>/dev/null | wc -l - delegate_to: "{{ swarm_delegate }}" - become: false - register: running_tasks - until: running_tasks.stdout | trim | int == 0 - retries: 18 - delay: 10 - changed_when: false - - - name: "Update packages" - block: - - name: Update apt cache - ansible.builtin.apt: - update_cache: true - cache_valid_time: 0 - - - name: Run apt dist-upgrade - ansible.builtin.apt: - upgrade: dist - update_cache: false - register: dist_upgrade_result - tags: [update] - - - name: Check if a newer kernel is installed but not yet booted - ansible.builtin.shell: | - LATEST=$(ls /boot/vmlinuz-* | sort -V | tail -1 | sed 's|/boot/vmlinuz-||') - RUNNING=$(uname -r) - if [ "$LATEST" != "$RUNNING" ]; then echo "reboot_needed"; fi - register: reboot_check - changed_when: false - check_mode: false - tags: [reboot] - - - name: Reboot if a newer kernel is installed - ansible.builtin.reboot: - msg: "Rebooting into updated kernel β€” initiated by swarm_update.yml" - reboot_timeout: 600 - when: reboot_check.stdout | trim == 'reboot_needed' - tags: [reboot] - - - name: Wait for node to return post-reboot - ansible.builtin.wait_for_connection: - delay: 10 - timeout: 600 - when: reboot_check.stdout | trim == 'reboot_needed' - tags: [reboot] - - - name: Wait for Docker daemon to be ready after reboot - ansible.builtin.command: docker info - register: docker_ready - until: docker_ready.rc == 0 - retries: 18 - delay: 10 - changed_when: false - check_mode: false - when: reboot_check.stdout | trim == 'reboot_needed' - tags: [reboot] - - - name: "Restore: re-activate {{ inventory_hostname }} in the swarm" - tags: [drain] - when: not ansible_check_mode - block: - - name: Set node availability back to active - ansible.builtin.command: > - docker node update --availability active {{ inventory_hostname }} - delegate_to: "{{ swarm_delegate }}" - become: false - changed_when: true - - - name: Wait for node to be Ready and Active - ansible.builtin.shell: > - docker node ls --filter name={{ inventory_hostname }} - delegate_to: "{{ swarm_delegate }}" - become: false - register: node_ls - until: "'Ready' in node_ls.stdout and 'Active' in node_ls.stdout" - retries: 12 - delay: 10 - changed_when: false - - - name: Confirm node status after update - ansible.builtin.assert: - that: - - "'Ready' in node_ls.stdout" - - "'Active' in node_ls.stdout" - fail_msg: >- - β›” {{ inventory_hostname }} is not Ready+Active after update. - Investigate before proceeding to the next node. - docker node ls output: - {{ node_ls.stdout }} - success_msg: "βœ… {{ inventory_hostname }} updated β€” Ready + Active. Proceeding." diff --git a/ansible/archive/playbooks/generate_inventory.yml b/ansible/archive/playbooks/generate_inventory.yml deleted file mode 100644 index 2d2abd9..0000000 --- a/ansible/archive/playbooks/generate_inventory.yml +++ /dev/null @@ -1,28 +0,0 @@ ---- -# Generate `ansible/inventory/hosts.ini` from the central YAML SoT -# Run locally: `ansible-playbook ansible/playbooks/generate_inventory.yml --connection=local` -- name: Generate inventory from central source of truth - hosts: localhost - connection: local - gather_facts: false - vars: - sod_file: "../group_vars/all.yml" - inventory_dest: "../inventory/hosts.ini" - - tasks: - - name: Generate inventory file using local script - ansible.builtin.command: "python3 ../scripts/generate_inventory.py --sot {{ sod_file }} --out /tmp/generated_hosts.ini" - args: - chdir: "{{ playbook_dir }}" - changed_when: false - - - name: Install generated inventory with backup - ansible.builtin.copy: - src: /tmp/generated_hosts.ini - dest: "{{ inventory_dest }}" - mode: '0644' - backup: true - - - name: Show result path - ansible.builtin.debug: - msg: "Wrote inventory to {{ inventory_dest }} (backup created if present)" diff --git a/ansible/archive/playbooks/monitoring/deploy_swarm_monitoring.yml b/ansible/archive/playbooks/monitoring/deploy_swarm_monitoring.yml deleted file mode 100644 index b7ddaab..0000000 --- a/ansible/archive/playbooks/monitoring/deploy_swarm_monitoring.yml +++ /dev/null @@ -1,441 +0,0 @@ ---- -# playbooks/monitoring/deploy_swarm_monitoring.yml -# Complete observability stack deployment for Docker Swarm cluster + standalone hosts -# -# === ARCHITECTURE OVERVIEW === -# This playbook deploys a three-tier monitoring solution: -# -# TIER 1: Data Collection (Swarm Nodes + Standalone Docker Hosts) -# - node-exporter: Host metrics (CPU, RAM, disk, network) on swarm nodes and standalone hosts -# - cAdvisor: Container metrics (per-container resource usage) on swarm nodes only -# -# TIER 2: Aggregation & Storage (Watchtower) -# - Prometheus: Metrics time-series database -# - Loki: Log aggregation and indexing -# -# TIER 3: Visualization & Alerting (Watchtower) -# - Grafana: Dashboards and data exploration -# - Uptime Kuma: HTTP health checks -# - Dozzle: Real-time log viewer -# -# === PREREQUISITES === -# - Docker Swarm cluster is initialized and running -# - All nodes are accessible via SSH -# - Docker is installed on all nodes (swarm + standalone hosts) -# - Authentik token is set in group_vars (for Dozzle auth) -# -# === USAGE === -# Deploy full stack (swarm nodes, standalone hosts, and watchtower): -# ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml -# -# Deploy only to swarm nodes: -# ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml --tags swarm -# -# Deploy only to standalone docker hosts: -# ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml --tags docker-hosts -# -# Deploy only watchtower stack: -# ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml --tags watchtower - -- name: Deploy monitoring exporters on swarm nodes - hosts: swarm_hosts - become: false - gather_facts: true - tags: ['swarm', 'exporters'] - - pre_tasks: - - name: Verify Docker is installed - ansible.builtin.command: docker --version - register: docker_check - changed_when: false - failed_when: docker_check.rc != 0 - - - name: Display deployment target - ansible.builtin.debug: - msg: - - "🎯 Deploying monitoring exporters to: {{ inventory_hostname }}" - - " Role: {{ 'Manager' if inventory_hostname in groups['swarm_managers'] else 'Worker' }}" - - " IP: {{ ansible_host }}" - - roles: - - role: swarm_node_exporter - tags: ['node-exporter'] - - - role: swarm_cadvisor - tags: ['cadvisor'] - -- name: Deploy Dozzle swarm agents - hosts: swarm_managers - become: false - gather_facts: false - tags: ['swarm', 'dozzle-agent'] - - tasks: - - name: Deploy and validate dozzle-agent service from primary manager - ansible.builtin.include_role: - name: swarm_dozzle_agent - when: inventory_hostname == groups['swarm_managers'][0] - - post_tasks: - - name: Validate exporter endpoints - ansible.builtin.uri: - url: "{{ item.url }}" - method: GET - status_code: 200 - loop: - - { name: "node-exporter", url: "http://localhost:9100/metrics" } - - { name: "cAdvisor", url: "http://localhost:8080/metrics" } - loop_control: - label: "{{ item.name }}" - register: endpoint_check - retries: 3 - delay: 5 - - - name: Display exporter status - ansible.builtin.debug: - msg: "βœ… {{ inventory_hostname }}: All exporters are healthy" - -- name: Deploy node-exporter on standalone docker hosts - hosts: docker_hosts - become: false - gather_facts: true - tags: ['docker-hosts', 'exporters', 'node-exporter'] - - pre_tasks: - - name: Verify Docker is installed - ansible.builtin.command: docker --version - register: docker_check - changed_when: false - failed_when: docker_check.rc != 0 - - - name: Display deployment target - ansible.builtin.debug: - msg: - - "🎯 Deploying node-exporter to standalone docker host: {{ inventory_hostname }}" - - " IP: {{ ansible_host }}" - - " Purpose: Hardware and software metrics collection" - - tasks: - - name: Deploy node-exporter role with elevated privileges - ansible.builtin.include_role: - name: swarm_node_exporter - apply: - become: true - tags: ['node-exporter'] - - post_tasks: - - name: Validate node-exporter endpoint - ansible.builtin.uri: - url: "http://localhost:9100/metrics" - method: GET - status_code: 200 - retries: 3 - delay: 5 - register: exporter_check - - - name: Display node-exporter status - ansible.builtin.debug: - msg: "βœ… {{ inventory_hostname }}: node-exporter deployed and healthy on port 9100" - -- name: Deploy monitoring stack on Watchtower - hosts: watchtower - connection: local - become: false - gather_facts: true - tags: ['watchtower', 'stack'] - - vars: - # Canonical encrypted vars location (ADR-008) - vault_encrypted_vars_file: "{{ playbook_dir }}/../../group_vars/vault/all.yml" - - pre_tasks: - - name: Check vault encrypted vars file state - ansible.builtin.stat: - path: "{{ vault_encrypted_vars_file }}" - register: vault_vars_file_state - - - name: Load encrypted vars when present - ansible.builtin.include_vars: - file: "{{ vault_encrypted_vars_file }}" - name: vault_vars - when: vault_vars_file_state.stat.exists - no_log: true - - - name: Resolve monitoring secrets from vault or environment fallback - ansible.builtin.set_fact: - grafana_admin_password: >- - {{ - ( - vault_vars.vault_grafana_admin_password - if (vault_vars is defined and 'vault_grafana_admin_password' in vault_vars) - else (grafana_admin_password | default('')) - ) | default('', true) - }} - authentik_outpost_dozzle_token: >- - {{ - ( - vault_vars.vault_authentik_outpost_dozzle_token - if (vault_vars is defined and 'vault_authentik_outpost_dozzle_token' in vault_vars) - else ( - secrets.AUTHENTIK_OUTPOST_DOZZLE_TOKEN - if (secrets is defined and 'AUTHENTIK_OUTPOST_DOZZLE_TOKEN' in secrets) - else lookup('env', 'AUTHENTIK_OUTPOST_DOZZLE_TOKEN') - ) - ) | default('', true) - }} - pve_exporter_token: >- - {{ - ( - vault_vars.vault_pve_exporter_token - if (vault_vars is defined and 'vault_pve_exporter_token' in vault_vars) - else lookup('env', 'PVE_EXPORTER_TOKEN') - ) | default('', true) - }} - no_log: true - - - name: Verify Docker Compose V2 is available - ansible.builtin.command: docker compose version - register: compose_check - changed_when: false - failed_when: compose_check.rc != 0 - - - name: Display Watchtower deployment info - ansible.builtin.debug: - msg: - - "πŸ—οΈ Deploying monitoring stack to Watchtower" - - " Swarm targets: {{ groups['swarm_managers'] | length }} managers + {{ groups['swarm_workers'] | length }} workers" - - " Standalone hosts: {{ groups['docker_hosts'] | length }} (node-exporter)" - - " Total monitored nodes: {{ groups['swarm_hosts'] | length + groups['docker_hosts'] | length + 1 }} (including Watchtower)" - - roles: - - role: monitoring_stack - - post_tasks: - - name: Wait for Prometheus to be ready - ansible.builtin.uri: - url: "http://{{ watchtower_ip }}:{{ prometheus_host_port }}/-/ready" - method: GET - status_code: 200 - retries: 10 - delay: 5 - register: prometheus_ready - when: not (monitoring_focus_mode | default(false) | bool) or (monitoring_focus_service | default('') == 'prometheus') - - - name: Verify Prometheus can scrape all targets - ansible.builtin.uri: - url: "http://{{ watchtower_ip }}:{{ prometheus_host_port }}/api/v1/targets" - method: GET - return_content: true - register: prometheus_targets - retries: 3 - delay: 10 - when: not (monitoring_focus_mode | default(false) | bool) or (monitoring_focus_service | default('') == 'prometheus') - - - name: Build watchtower edge route backend reconciliation list - ansible.builtin.set_fact: - watchtower_edge_route_backends: >- - {{ - [ - {'name': 'grafana', 'url': 'http://' ~ watchtower_ip ~ ':' ~ (grafana_port | string)}, - {'name': 'uptime', 'url': 'http://' ~ watchtower_ip ~ ':' ~ (uptime_kuma_port | string)} - ] - + - ( - [ - {'name': 'dozzle', 'url': 'http://' ~ watchtower_ip ~ ':' ~ (dozzle_port | string)} - ] - if (monitoring_enable_dozzle | default(false) | bool) and (dozzle_expose_via_traefik | default(false) | bool) - else [] - ) - + - ( - [ - {'name': 'authentik-outpost-dozzle', 'url': 'http://' ~ watchtower_ip ~ ':' ~ (authentik_outpost_port | string)} - ] - if monitoring_enable_authentik_outpost | default(false) | bool - else [] - ) - + - [ - {'name': 'portainer', 'url': 'http://' ~ watchtower_ip ~ ':' ~ (portainer_http_port | string)} - ] - }} - - - name: Reconcile watchtower service backends in Redis edge routing - ansible.builtin.command: >- - ssh {{ (edge_routing | default({})).get('edge_host', {}).get('ip', '10.0.0.151') }} - sudo docker exec redis redis-cli SET - traefik/http/services/{{ item.name }}/loadBalancer/servers/0/url - {{ item.url }} - changed_when: true - loop: "{{ watchtower_edge_route_backends }}" - loop_control: - label: "{{ item.name }} -> {{ item.url }}" - - - name: Verify reconciled watchtower service backends in Redis - ansible.builtin.command: >- - ssh {{ (edge_routing | default({})).get('edge_host', {}).get('ip', '10.0.0.151') }} - sudo docker exec redis redis-cli GET - traefik/http/services/{{ item.name }}/loadBalancer/servers/0/url - register: watchtower_route_backend_reads - changed_when: false - loop: "{{ watchtower_edge_route_backends }}" - loop_control: - label: "{{ item.name }}" - - - name: Assert watchtower service backends are reconciled to host IP routes - ansible.builtin.assert: - that: - - item.stdout == item.item.url - fail_msg: >- - Edge route drift persisted for {{ item.item.name }}. - Expected {{ item.item.url }}, got {{ item.stdout | default('') }}. - success_msg: >- - Edge route {{ item.item.name }} correctly reconciled to {{ item.item.url }}. - loop: "{{ watchtower_route_backend_reads.results }}" - loop_control: - label: "{{ item.item.name }}" - - - name: Display monitoring stack summary - ansible.builtin.debug: - msg: - - "╔════════════════════════════════════════════════════════╗" - - "β•‘ πŸŽ‰ SWARM MONITORING STACK DEPLOYED SUCCESSFULLY! β•‘" - - "β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•" - - "" - - "πŸ“Š METRICS & DASHBOARDS:" - - " Prometheus: http://{{ watchtower_ip }}:{{ prometheus_host_port }}" - - " Grafana: https://{{ grafana_domain }}" - - "" - - "πŸ“‹ LOGS:" - - " Dozzle: https://{{ dozzle_domain }}" - - " Loki API: http://{{ watchtower_ip }}:{{ loki_port }}" - - "" - - "βœ… UPTIME:" - - " Uptime Kuma: https://{{ uptime_domain }}" - - "" - - "πŸ” NEXT STEPS:" - - " 1. Open Grafana: https://{{ grafana_domain }}" - - " 2. Verify provisioned data sources: {{ grafana_prometheus_datasource_name }} + {{ grafana_loki_datasource_name }}" - - " 3. Review the provisioned dashboard folder: {{ grafana_dashboards_folder }}" - - " 4. Optionally import extra dashboards: 1860, 893, 13639, 10347" - - " 5. Configure Uptime Kuma health checks for swarm services" - - "" - - "πŸ“š CONCEPTS YOU LEARNED:" - - " βœ“ Multi-tier monitoring architecture" - - " βœ“ Prometheus service discovery & scraping" - - " βœ“ Loki label-based log indexing" - - " βœ“ Ansible roles for modular infrastructure" - - " βœ“ Idempotent deployment (run this playbook anytime!)" - when: not (monitoring_focus_mode | default(false) | bool) or (monitoring_focus_service | default('') == 'prometheus') - - - name: Display focused deployment summary - ansible.builtin.debug: - msg: - - "Focused deployment completed" - - "Service: {{ monitoring_focus_service | default('not-set') }}" - - "Mode: additive (existing running services preserved)" - when: monitoring_focus_mode | default(false) | bool and (monitoring_focus_service | default('') != 'prometheus') - -- name: Generate monitoring documentation - hosts: localhost - connection: local - gather_facts: false - tags: ['docs'] - run_once: true - - tasks: - - name: Create monitoring quick-reference guide - ansible.builtin.copy: - dest: "{{ playbook_dir }}/../../documentation/swarm-monitoring-guide.md" - mode: '0644' - content: | - # Docker Swarm Monitoring Guide - - **Deployed:** {{ ansible_date_time.iso8601 }} - **Cluster:** {{ groups['swarm_hosts'] | length }} nodes ({{ groups['swarm_managers'] | length }} managers, {{ groups['swarm_workers'] | length }} workers) - - ## Quick Access - - | Service | URL | Purpose | - |---------|-----|---------| - | Prometheus | http://{{ hostvars['localhost'].watchtower_ip }}:{{ hostvars['localhost'].prometheus_port }} | Metrics storage & query | - | Grafana | https://{{ hostvars['localhost'].grafana_domain }} | Dashboards & visualization | - | Loki | http://{{ hostvars['localhost'].watchtower_ip }}:{{ hostvars['localhost'].loki_port }} | Log aggregation | - | Dozzle | https://{{ hostvars['localhost'].dozzle_domain }} | Real-time log viewer | - | Uptime Kuma | https://{{ hostvars['localhost'].uptime_domain }} | Service uptime tracking | - - ## Monitored Nodes - - ### Managers - {% for host in groups['swarm_managers'] %} - - **{{ host }}** ({{ hostvars[host].ansible_host }}) - - node-exporter: http://{{ hostvars[host].ansible_host }}:9100/metrics - - cAdvisor: http://{{ hostvars[host].ansible_host }}:8080/metrics - {% endfor %} - - ### Workers - {% for host in groups['swarm_workers'] %} - - **{{ host }}** ({{ hostvars[host].ansible_host }}) - - node-exporter: http://{{ hostvars[host].ansible_host }}:9100/metrics - - cAdvisor: http://{{ hostvars[host].ansible_host }}:8080/metrics - {% endfor %} - - ## Useful Prometheus Queries - - ```promql - # Total cluster CPU usage - 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) - - # Memory usage per node - (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 - - # Container count per node - count(container_last_seen) by (instance) - - # Network traffic by node - rate(node_network_receive_bytes_total[5m]) - ``` - - ## Troubleshooting - - ### Exporter not reachable - ```bash - # Check if container is running - ansible swarm_hosts -i inventory/hosts.ini -a "docker ps | grep exporter" - - # Check firewall - ansible swarm_hosts -i inventory/hosts.ini -a "ss -tlnp | grep -E '9100|8080'" - ``` - - ### Prometheus shows target down - ```bash - # Test from Watchtower - curl http://:9100/metrics - curl http://:8080/metrics - ``` - - ## Maintenance - - ### Update all monitoring components - ```bash - cd /home/chester/homelab/ansible - ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml - ``` - - ### View Prometheus configuration - ```bash - cat /opt/stacks/watchtower/prometheus-config/prometheus.yml - ``` - - ### Check alert rules - ```bash - cat /opt/stacks/watchtower/prometheus-config/alerts/homelab.yml - ``` - - register: docs_created - - - name: Display documentation location - ansible.builtin.debug: - msg: "πŸ“š Monitoring guide created at: {{ docs_created.dest }}" - when: docs_created.changed diff --git a/ansible/archive/playbooks/network/baseline_config.yml b/ansible/archive/playbooks/network/baseline_config.yml deleted file mode 100644 index 0f38f23..0000000 --- a/ansible/archive/playbooks/network/baseline_config.yml +++ /dev/null @@ -1,14 +0,0 @@ ---- -# playbooks/baseline_network_config.yml -# Baseline network config for static IPs and VLAN prep -- name: Baseline network configuration - hosts: all - gather_facts: false - tasks: - - name: Document static IP assignments - ansible.builtin.debug: - msg: "[Info] Ensure static IPs match contracts for {{ inventory_hostname }}." - - - name: Review VLAN segmentation readiness - ansible.builtin.debug: - msg: "[Info] VLANs not yet in use, but config should be reviewed periodically." diff --git a/ansible/archive/playbooks/network/create_swarm_proxy_overlay.yml b/ansible/archive/playbooks/network/create_swarm_proxy_overlay.yml deleted file mode 100644 index c7d516a..0000000 --- a/ansible/archive/playbooks/network/create_swarm_proxy_overlay.yml +++ /dev/null @@ -1,22 +0,0 @@ ---- -# Create the shared Swarm overlay network used by edge-routed services. - -- name: Create proxy overlay network on swarm manager - hosts: swarm_managers - become: false - gather_facts: false - - vars: - # Mirrors the current standalone Docker bridge values from migration inputs. - swarm_overlay_network_name: "proxy-net" - swarm_overlay_network_subnet: "172.20.0.0/24" - swarm_overlay_network_gateway: "172.20.0.1" - swarm_overlay_network_attachable: true - swarm_overlay_network_internal: false - swarm_overlay_network_mtu: "1500" - - tasks: - - name: Run network creation only once from the primary manager - ansible.builtin.include_role: - name: swarm_overlay_network - when: inventory_hostname == groups['swarm_managers'][0] diff --git a/ansible/archive/playbooks/network/omada_api_smoke_test.yml b/ansible/archive/playbooks/network/omada_api_smoke_test.yml deleted file mode 100644 index 2b7718a..0000000 --- a/ansible/archive/playbooks/network/omada_api_smoke_test.yml +++ /dev/null @@ -1,74 +0,0 @@ ---- -- name: Omada Open API smoke test - hosts: localhost - connection: local - gather_facts: false - - vars_files: - - "../../group_vars/all.yml" - - "../../group_vars/vault/all.yml" - - vars: - omada_validate_certs: false - - tasks: - - name: Verify required Omada variables are present - ansible.builtin.assert: - that: - - omada_base_url is defined - - omada_id is defined - - omada_client_id is defined - - omada_client_secret is defined - - omada_base_url | length > 0 - - omada_id | length > 0 - - omada_client_id | length > 0 - - omada_client_secret | length > 0 - fail_msg: "Missing Omada variables. Check group_vars/all.yml and group_vars/vault/all.yml." - - - name: Request Omada access token (client credentials) - ansible.builtin.uri: - url: "{{ omada_base_url }}/openapi/authorize/token?grant_type=client_credentials" - method: POST - validate_certs: "{{ omada_validate_certs }}" - headers: - Content-Type: application/json - body_format: json - body: - omadacId: "{{ omada_id }}" - client_id: "{{ omada_client_id }}" - client_secret: "{{ omada_client_secret }}" - return_content: true - status_code: 200 - register: omada_token_response - no_log: true - failed_when: - - omada_token_response.json is not defined - - omada_token_response.json.errorCode | default(-1) != 0 - - - name: Save access token from auth response - ansible.builtin.set_fact: - omada_access_token: "{{ omada_token_response.json.result.accessToken }}" - no_log: true - - - name: Query Omada sites - ansible.builtin.uri: - url: "{{ omada_base_url }}/openapi/v1/{{ omada_id }}/sites?page=1&pageSize=20" - method: GET - validate_certs: "{{ omada_validate_certs }}" - headers: - Content-Type: application/json - Authorization: "AccessToken={{ omada_access_token }}" - return_content: true - status_code: 200 - register: omada_sites_response - no_log: true - failed_when: - - omada_sites_response.json is not defined - - omada_sites_response.json.errorCode | default(-1) != 0 - - - name: Report site inventory summary - ansible.builtin.debug: - msg: - - "Omada API auth OK" - - "Sites discovered: {{ omada_sites_response.json.result.totalRows | default(0) }}" - - "First site name: {{ (omada_sites_response.json.result.data | default([]) | first).name | default('n/a') }}" \ No newline at end of file diff --git a/ansible/archive/playbooks/network/omada_find_hue_hub.yml b/ansible/archive/playbooks/network/omada_find_hue_hub.yml deleted file mode 100644 index 260232c..0000000 --- a/ansible/archive/playbooks/network/omada_find_hue_hub.yml +++ /dev/null @@ -1,208 +0,0 @@ ---- -# Phase 1: Query Omada client table and find the Philips Hue hub by MAC OUI. -# Phase 2: Probe the discovered IP via the Hue Bridge local API. -# -# Philips Hue / Signify known MAC OUI prefixes: -# 00:17:88 β€” classic Hue Bridge and bulbs -# EC:B5:FA β€” newer Hue Bridge v2 and Hue products -# -# Usage: -# ansible-playbook playbooks/network/omada_find_hue_hub.yml --ask-vault-pass - -- name: Find and probe Philips Hue hub via Omada client table - hosts: localhost - connection: local - gather_facts: false - - vars_files: - - "../../group_vars/all.yml" - - "../../group_vars/vault/all.yml" - - vars: - omada_validate_certs: false - omada_page_size: 200 - # Signify / Philips Hue MAC OUI prefixes (lowercase, colon-separated) - hue_mac_ouis: - - "00:17:88" - - "ec:b5:fa" - hue_probe_validate_certs: false - - tasks: - # ----------------------------------------------------------------------- - # PHASE 1 β€” Omada token - # ----------------------------------------------------------------------- - - name: Request Omada access token (client credentials) - ansible.builtin.uri: - url: "{{ omada_base_url }}/openapi/authorize/token?grant_type=client_credentials" - method: POST - validate_certs: "{{ omada_validate_certs }}" - headers: - Content-Type: application/json - body_format: json - body: - omadacId: "{{ omada_id }}" - client_id: "{{ omada_client_id }}" - client_secret: "{{ omada_client_secret }}" - return_content: true - status_code: 200 - register: omada_token_response - no_log: true - failed_when: - - omada_token_response.json is not defined - - omada_token_response.json.errorCode | default(-1) != 0 - - - name: Save access token - ansible.builtin.set_fact: - omada_access_token: "{{ omada_token_response.json.result.accessToken }}" - no_log: true - - # ----------------------------------------------------------------------- - # PHASE 1 β€” Get site list - # ----------------------------------------------------------------------- - - name: Query Omada sites - ansible.builtin.uri: - url: "{{ omada_base_url }}/openapi/v1/{{ omada_id }}/sites?page=1&pageSize={{ omada_page_size }}" - method: GET - validate_certs: "{{ omada_validate_certs }}" - headers: - Content-Type: application/json - Authorization: "AccessToken={{ omada_access_token }}" - return_content: true - status_code: 200 - register: omada_sites_response - no_log: true - failed_when: - - omada_sites_response.json is not defined - - omada_sites_response.json.errorCode | default(-1) != 0 - - - name: Save site list - ansible.builtin.set_fact: - omada_sites: "{{ omada_sites_response.json.result.data | default([]) }}" - - # ----------------------------------------------------------------------- - # PHASE 1 β€” Query clients per site - # ----------------------------------------------------------------------- - - name: Query all clients per site - ansible.builtin.uri: - url: "{{ omada_base_url }}/openapi/v1/{{ omada_id }}/sites/{{ item.siteId }}/clients?page=1&pageSize={{ omada_page_size }}" - method: GET - validate_certs: "{{ omada_validate_certs }}" - headers: - Content-Type: application/json - Authorization: "AccessToken={{ omada_access_token }}" - return_content: true - status_code: 200 - loop: "{{ omada_sites }}" - loop_control: - label: "{{ item.name | default(item.siteId) }}" - register: omada_clients_by_site - no_log: true - failed_when: false - - # ----------------------------------------------------------------------- - # PHASE 1 β€” Filter for Hue hub by OUI - # ----------------------------------------------------------------------- - - name: Collect all clients from all sites into flat list - ansible.builtin.set_fact: - all_clients: "{{ omada_clients_by_site.results - | selectattr('json', 'defined') - | selectattr('json.errorCode', 'equalto', 0) - | map(attribute='json.result.data') - | flatten }}" - - - name: Filter clients by Hue MAC OUI prefixes - ansible.builtin.set_fact: - hue_candidates: "{{ all_clients | selectattr('mac', 'defined') - | selectattr('ip', 'defined') - | selectattr('mac', 'search', hue_mac_ouis | join('|'), ignorecase=True) - | list }}" - - - name: Report Omada client search results - ansible.builtin.debug: - msg: - - "Total clients scanned: {{ all_clients | length }}" - - "Hue hub candidates found: {{ hue_candidates | length }}" - - "Candidates: {{ hue_candidates | map(attribute='mac') | list }}" - - - name: Abort with clear message if no Hue hub found - ansible.builtin.fail: - msg: > - No Philips Hue hub found in Omada client table. - Verify the hub is powered on and connected to a monitored VLAN. - Expected MAC OUI prefixes: {{ hue_mac_ouis | join(', ') }}. - when: hue_candidates | length == 0 - - - name: Save first Hue hub candidate IP and MAC - ansible.builtin.set_fact: - hue_ip: "{{ hue_candidates[0].ip }}" - hue_mac: "{{ hue_candidates[0].mac }}" - hue_hostname: "{{ hue_candidates[0].hostname | default('unknown') }}" - - - name: Report discovered Hue hub - ansible.builtin.debug: - msg: - - "Hue hub MAC : {{ hue_mac }}" - - "Hue hub IP : {{ hue_ip }}" - - "Hue hub hostname: {{ hue_hostname }}" - - # ----------------------------------------------------------------------- - # PHASE 2 β€” Probe Hue Bridge local API - # ----------------------------------------------------------------------- - - name: Probe Hue Bridge local discovery endpoint - ansible.builtin.uri: - url: "http://{{ hue_ip }}/api/config" - method: GET - validate_certs: "{{ hue_probe_validate_certs }}" - return_content: true - timeout: 5 - status_code: 200 - register: hue_config_response - failed_when: false - - - name: Probe Hue Bridge HTTPS clip v2 endpoint (newer firmware) - ansible.builtin.uri: - url: "https://{{ hue_ip }}/clip/v2/resource/bridge" - method: GET - validate_certs: "{{ hue_probe_validate_certs }}" - return_content: true - timeout: 5 - status_code: [200, 401, 403] - register: hue_clipv2_response - failed_when: false - - - name: Build Hue adoption readiness summary - ansible.builtin.set_fact: - hue_adoption_summary: - mac: "{{ hue_mac }}" - ip: "{{ hue_ip }}" - hostname: "{{ hue_hostname }}" - local_api_v1: - reachable: "{{ hue_config_response.status | default('n/a') == 200 }}" - http_status: "{{ hue_config_response.status | default('n/a') }}" - bridge_id: "{{ hue_config_response.json.bridgeid | default('n/a') }}" - model_id: "{{ hue_config_response.json.modelid | default('n/a') }}" - sw_version: "{{ hue_config_response.json.swversion | default('n/a') }}" - api_version: "{{ hue_config_response.json.apiversion | default('n/a') }}" - name: "{{ hue_config_response.json.name | default('n/a') }}" - local_api_v2_clip: - reachable: "{{ hue_clipv2_response.status | default('n/a') in [200, 401, 403] }}" - http_status: "{{ hue_clipv2_response.status | default('n/a') }}" - note: >- - {{ - 'CLIP v2 available (needs app_key header for full access)' if hue_clipv2_response.status | default(0) in [401, 403] - else 'CLIP v2 accessible' if hue_clipv2_response.status | default(0) == 200 - else 'CLIP v2 not reachable' - }} - ansible_adoption: - method: "ansible.builtin.uri (REST API β€” no SSH required)" - auth_required: "Hue app_key token (generate by pressing bridge button + POST /api)" - next_step: >- - {{ - 'Bridge is reachable. Press the physical button on the bridge, then run omada_adopt_hue_hub.yml to generate an app_key.' - if (hue_config_response.status | default(0) == 200 or hue_clipv2_response.status | default(0) in [200, 401, 403]) - else 'Bridge API not responding. Check firewall rules for VLAN ' ~ hue_hostname ~ ' β†’ Ansible control node.' - }} - - - name: Print Hue adoption readiness summary - ansible.builtin.debug: - var: hue_adoption_summary diff --git a/ansible/archive/playbooks/network/omada_health_inventory.yml b/ansible/archive/playbooks/network/omada_health_inventory.yml deleted file mode 100644 index 5a49abe..0000000 --- a/ansible/archive/playbooks/network/omada_health_inventory.yml +++ /dev/null @@ -1,162 +0,0 @@ ---- -- name: Omada read-only health inventory - hosts: localhost - connection: local - gather_facts: false - - vars_files: - - "../../group_vars/all.yml" - - "../../group_vars/vault/all.yml" - - vars: - omada_validate_certs: false - omada_page_size: 200 - - tasks: - - name: Verify required Omada variables are present - ansible.builtin.assert: - that: - - omada_base_url is defined - - omada_id is defined - - omada_client_id is defined - - omada_client_secret is defined - - omada_base_url | length > 0 - - omada_id | length > 0 - - omada_client_id | length > 0 - - omada_client_secret | length > 0 - fail_msg: "Missing Omada variables. Check group_vars/all.yml and group_vars/vault/all.yml." - - - name: Request Omada access token (client credentials) - ansible.builtin.uri: - url: "{{ omada_base_url }}/openapi/authorize/token?grant_type=client_credentials" - method: POST - validate_certs: "{{ omada_validate_certs }}" - headers: - Content-Type: application/json - body_format: json - body: - omadacId: "{{ omada_id }}" - client_id: "{{ omada_client_id }}" - client_secret: "{{ omada_client_secret }}" - return_content: true - status_code: 200 - register: omada_token_response - no_log: true - failed_when: - - omada_token_response.json is not defined - - omada_token_response.json.errorCode | default(-1) != 0 - - - name: Save access token from auth response - ansible.builtin.set_fact: - omada_access_token: "{{ omada_token_response.json.result.accessToken }}" - no_log: true - - - name: Query Omada sites - ansible.builtin.uri: - url: "{{ omada_base_url }}/openapi/v1/{{ omada_id }}/sites?page=1&pageSize={{ omada_page_size }}" - method: GET - validate_certs: "{{ omada_validate_certs }}" - headers: - Content-Type: application/json - Authorization: "AccessToken={{ omada_access_token }}" - return_content: true - status_code: 200 - register: omada_sites_response - no_log: true - failed_when: - - omada_sites_response.json is not defined - - omada_sites_response.json.errorCode | default(-1) != 0 - - - name: Save site list - ansible.builtin.set_fact: - omada_sites: "{{ omada_sites_response.json.result.data | default([]) }}" - - - name: Gather device summary per site - ansible.builtin.uri: - url: "{{ omada_base_url }}/openapi/v1/{{ omada_id }}/sites/{{ item.siteId }}/devices?page=1&pageSize={{ omada_page_size }}" - method: GET - validate_certs: "{{ omada_validate_certs }}" - headers: - Content-Type: application/json - Authorization: "AccessToken={{ omada_access_token }}" - return_content: true - status_code: 200 - loop: "{{ omada_sites }}" - loop_control: - label: "{{ item.name | default(item.siteId) }}" - register: omada_devices_by_site - no_log: true - failed_when: false - - - name: Gather client summary per site - ansible.builtin.uri: - url: "{{ omada_base_url }}/openapi/v1/{{ omada_id }}/sites/{{ item.siteId }}/clients?page=1&pageSize={{ omada_page_size }}" - method: GET - validate_certs: "{{ omada_validate_certs }}" - headers: - Content-Type: application/json - Authorization: "AccessToken={{ omada_access_token }}" - return_content: true - status_code: 200 - loop: "{{ omada_sites }}" - loop_control: - label: "{{ item.name | default(item.siteId) }}" - register: omada_clients_by_site - no_log: true - failed_when: false - - - name: Gather event summary per site - ansible.builtin.uri: - url: "{{ omada_base_url }}/openapi/v1/{{ omada_id }}/sites/{{ item.siteId }}/events?page=1&pageSize=50" - method: GET - validate_certs: "{{ omada_validate_certs }}" - headers: - Content-Type: application/json - Authorization: "AccessToken={{ omada_access_token }}" - return_content: true - status_code: 200 - loop: "{{ omada_sites }}" - loop_control: - label: "{{ item.name | default(item.siteId) }}" - register: omada_events_by_site - no_log: true - failed_when: false - - - name: Build human-readable health summary - ansible.builtin.set_fact: - omada_health_summary: "{{ omada_health_summary | default([]) + [ { - 'site_name': item.name | default(item.siteId), - 'site_id': item.siteId, - 'devices_total': ( - (omada_devices_by_site.results[ansible_loop.index0].json.result.totalRows | default(0)) - if (omada_devices_by_site.results[ansible_loop.index0].json is defined and omada_devices_by_site.results[ansible_loop.index0].json.errorCode | default(-1) == 0) - else 'n/a' - ), - 'devices_http_status': omada_devices_by_site.results[ansible_loop.index0].status | default('n/a'), - 'devices_error_code': omada_devices_by_site.results[ansible_loop.index0].json.errorCode | default('n/a'), - 'devices_error_msg': omada_devices_by_site.results[ansible_loop.index0].json.msg | default('n/a'), - 'clients_total': ( - (omada_clients_by_site.results[ansible_loop.index0].json.result.totalRows | default(0)) - if (omada_clients_by_site.results[ansible_loop.index0].json is defined and omada_clients_by_site.results[ansible_loop.index0].json.errorCode | default(-1) == 0) - else 'n/a' - ), - 'clients_http_status': omada_clients_by_site.results[ansible_loop.index0].status | default('n/a'), - 'clients_error_code': omada_clients_by_site.results[ansible_loop.index0].json.errorCode | default('n/a'), - 'clients_error_msg': omada_clients_by_site.results[ansible_loop.index0].json.msg | default('n/a'), - 'events_page_rows': ( - (omada_events_by_site.results[ansible_loop.index0].json.result.currentSize | default(0)) - if (omada_events_by_site.results[ansible_loop.index0].json is defined and omada_events_by_site.results[ansible_loop.index0].json.errorCode | default(-1) == 0) - else 'n/a' - ), - 'events_http_status': omada_events_by_site.results[ansible_loop.index0].status | default('n/a'), - 'events_error_code': omada_events_by_site.results[ansible_loop.index0].json.errorCode | default('n/a'), - 'events_error_msg': omada_events_by_site.results[ansible_loop.index0].json.msg | default('n/a') - } ] }}" - loop: "{{ omada_sites }}" - loop_control: - extended: true - label: "{{ item.name | default(item.siteId) }}" - - - name: Print Omada health inventory summary - ansible.builtin.debug: - var: omada_health_summary \ No newline at end of file diff --git a/ansible/archive/playbooks/onboarding/ai_workstation.yml b/ansible/archive/playbooks/onboarding/ai_workstation.yml deleted file mode 100644 index 096afdd..0000000 --- a/ansible/archive/playbooks/onboarding/ai_workstation.yml +++ /dev/null @@ -1,348 +0,0 @@ ---- -# ============================================================================ -# AI WORKSTATION BOOTSTRAP PLAYBOOK -# ============================================================================ -# Purpose: Prepare fresh Ubuntu installations for AI/ML workloads -# Targets: ai_grid inventory group (NVIDIA GPU-equipped machines) -# ============================================================================ - -- name: Bootstrap AI workstation (GPU + Ollama + Storage) - hosts: ai_grid - become: true - - vars: - # Ollama network configuration - ollama_host: "0.0.0.0:11434" # Listen on all interfaces - ollama_port: 11434 - - # Essential packages for AI workstations - essential_packages: - - build-essential # Compiler and build tools - - git # Version control - - curl # HTTP client - - wget # Download utility - - htop # System monitoring - - nvtop # GPU monitoring (NVIDIA) - - python3-pip # Python package manager - - python3-venv # Python virtual environments - - net-tools # Network utilities - - nfs-common # NFS client support - - tasks: - # ======================================================================== - # PHASE 1: SYSTEM BASELINE - # ======================================================================== - - - name: Update apt cache - ansible.builtin.apt: - update_cache: true - cache_valid_time: 3600 - tags: [baseline, update] - - - name: Upgrade all installed packages - ansible.builtin.apt: - upgrade: dist - autoremove: true - autoclean: true - register: upgrade_result - tags: [baseline, update] - - # ======================================================================== - # PHASE 2: ESSENTIAL UTILITIES - # ======================================================================== - - - name: Install essential utilities and development tools - ansible.builtin.apt: - name: "{{ essential_packages }}" - state: present - tags: [baseline, utilities] - - # ======================================================================== - # PHASE 2.5: IDENTITY MANAGEMENT - # ======================================================================== - # Purpose: Ensure the 'chester' admin user exists with proper access - # Why: Allows the playbook to bootstrap from a fresh Ubuntu install - # without manual user creation - # ======================================================================== - - - name: Create chester identity and access - block: - - name: Install sudo package - ansible.builtin.apt: - name: sudo - state: present - update_cache: false - - - name: Ensure chester group exists - ansible.builtin.group: - name: chester - state: present - - - name: Create chester user with sudo access - ansible.builtin.user: - name: chester - group: chester - groups: sudo - shell: /bin/bash - password: '!' - password_lock: true - comment: "Homelab Administrator" - - - name: Deploy SSH key to chester user - ansible.posix.authorized_key: - user: chester - state: present - key: "{{ lookup('file', '~/.ssh/id_ed25519.pub') }}" - - - name: Allow chester to use sudo without password - ansible.builtin.copy: - dest: /etc/sudoers.d/chester - content: "chester ALL=(ALL) NOPASSWD: ALL\n" - mode: '0440' - owner: root - group: root - validate: '/usr/sbin/visudo -cf %s' - - tags: [identity, baseline] - - # ======================================================================== - # PHASE 3: NVIDIA DRIVERS - # ======================================================================== - - - name: Install ubuntu-drivers-common package - ansible.builtin.apt: - name: ubuntu-drivers-common - state: present - tags: [gpu, nvidia] - - - name: Detect and install recommended NVIDIA drivers - ansible.builtin.command: ubuntu-drivers autoinstall - args: - creates: /usr/bin/nvidia-smi - register: nvidia_install - changed_when: false - tags: [gpu, nvidia] - - - name: Verify NVIDIA driver installation - ansible.builtin.command: nvidia-smi - register: nvidia_check - failed_when: false - changed_when: false - tags: [gpu, nvidia, verify] - - - name: Display NVIDIA driver status - ansible.builtin.debug: - msg: "{{ nvidia_check.stdout_lines }}" - when: nvidia_check.rc == 0 - tags: [gpu, nvidia, verify] - - # ======================================================================== - # PHASE 3.5: LAPTOP TUNING & SAFETY - # ======================================================================== - - - name: Configure GRUB for ASPM & Intel hybrid cores - ansible.builtin.lineinfile: - path: /etc/default/grub - regexp: '^GRUB_CMDLINE_LINUX_DEFAULT=' - line: 'GRUB_CMDLINE_LINUX_DEFAULT="quiet pcie_aspm=force intel_pstate=passive"' - notify: Update Grub - tags: [laptop, tuning] - - - name: Configure logind to ignore lid-close events - ansible.builtin.lineinfile: - path: /etc/systemd/logind.conf - regexp: "^#?{{ item.key }}=" - line: "{{ item.key }}={{ item.value }}" - loop: - - { key: "HandleLidSwitch", value: "ignore" } - - { key: "HandleLidSwitchExternalPower", value: "ignore" } - notify: Restart Logind - tags: [laptop, safety] - - - name: Mask sleep targets to keep workloads running - ansible.builtin.systemd: - name: "{{ item }}" - masked: true - loop: - - sleep.target - - suspend.target - - hibernate.target - - hybrid-sleep.target - tags: [laptop, safety] - - - name: Disable swap to protect NVMe under sustained load - ansible.builtin.shell: | - swapoff -a - sed -i '/ swap / s/^\(.*\)$/#\1/g' /etc/fstab - when: ansible_swaptotal_mb > 0 - changed_when: false - tags: [storage, tuning] - - - name: Check Intel Thread Director support messages - ansible.builtin.shell: "dmesg | grep -i 'Hardware Feedback Interface'" - register: hfi_check - failed_when: false - changed_when: false - tags: [verify, laptop] - - # ======================================================================== - # PHASE 4: OLLAMA INSTALLATION - # ======================================================================== - - - name: Check if Ollama is already installed - ansible.builtin.stat: - path: /usr/local/bin/ollama - register: ollama_binary - tags: [ollama] - - - name: Download Ollama installation script - ansible.builtin.get_url: - url: https://ollama.ai/install.sh - dest: /tmp/ollama-install.sh - mode: '0755' - when: not ollama_binary.stat.exists - tags: [ollama] - - - name: Install Ollama - ansible.builtin.command: /tmp/ollama-install.sh - when: not ollama_binary.stat.exists - changed_when: false - tags: [ollama] - - - name: Create systemd override directory for Ollama - ansible.builtin.file: - path: /etc/systemd/system/ollama.service.d - state: directory - mode: '0755' - tags: [ollama, network] - - - name: Configure Ollama to listen on all network interfaces - ansible.builtin.copy: - dest: /etc/systemd/system/ollama.service.d/override.conf - content: | - [Service] - Environment="OLLAMA_HOST={{ ollama_host }}" - mode: '0644' - notify: Restart ollama - tags: [ollama, network] - - - name: Ensure Ollama service is enabled and started - ansible.builtin.systemd: - name: ollama - state: started - enabled: true - daemon_reload: true - tags: [ollama] - - - name: Apply pending Ollama handler changes before readiness check - ansible.builtin.meta: flush_handlers - tags: [ollama] - - - name: Restart Ollama to apply network binding - ansible.builtin.systemd: - name: ollama - state: restarted - daemon_reload: true - tags: [ollama] - - - name: Wait for Ollama service to be ready - ansible.builtin.wait_for: - host: "{{ ansible_host }}" - port: "{{ ollama_port }}" - delay: 5 - timeout: 30 - tags: [ollama, verify] - - # ======================================================================== - # PHASE 5: NFS STORAGE MOUNTS (TODO) - # ======================================================================== - # Instructions: - # 1. Define NFS server variables in group_vars/ai_grid.yml: - # nfs_server: "10.0.0.249" - # nfs_export: "/volume1/ai-datasets" - # nfs_mount_point: "/mnt/ai-datasets" - # - # 2. Uncomment the tasks below and customize paths - # ======================================================================== - - # - name: Create NFS mount point directory - # ansible.builtin.file: - # path: "{{ nfs_mount_point }}" - # state: directory - # owner: "{{ ansible_user }}" - # group: "{{ ansible_user }}" - # mode: '0755' - # tags: [storage, nfs] - # - # - name: Mount NFS share for AI datasets - # ansible.posix.mount: - # path: "{{ nfs_mount_point }}" - # src: "{{ nfs_server }}:{{ nfs_export }}" - # fstype: nfs - # opts: defaults,nfsvers=4 - # state: mounted - # tags: [storage, nfs] - # - # - name: Verify NFS mount is accessible - # ansible.builtin.command: "ls -la {{ nfs_mount_point }}" - # register: nfs_verify - # changed_when: false - # tags: [storage, nfs, verify] - - # ======================================================================== - # PHASE 6: POST-INSTALL VERIFICATION - # ======================================================================== - - - name: Check if system reboot is required - ansible.builtin.stat: - path: /var/run/reboot-required - register: reboot_required - tags: [verify, reboot] - - - name: Display reboot notification if needed - ansible.builtin.debug: - msg: | - ╔════════════════════════════════════════════════════════════════╗ - β•‘ WARNING: System reboot is required to complete installation β•‘ - β•‘ Reason: Kernel or driver updates β•‘ - β•‘ Action: Please reboot this host manually β•‘ - β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β• - when: reboot_required.stat.exists - tags: [verify, reboot] - - - name: Display bootstrap completion summary - ansible.builtin.debug: - msg: - - "╔════════════════════════════════════════════════════════════════╗" - - "β•‘ AI Workstation Bootstrap Complete! β•‘" - - "╠════════════════════════════════════════════════════════════════╣" - - "β•‘ βœ“ System updated and essential utilities installed β•‘" - - "β•‘ βœ“ NVIDIA drivers installed (verify with nvidia-smi) β•‘" - - "β•‘ βœ“ Ollama installed and network-accessible β•‘" - - "β•‘ β†’ Ollama API: http://{{ ansible_host }}:{{ ollama_port }} β•‘" - - "╠════════════════════════════════════════════════════════════════╣" - - "β•‘ Next Steps: β•‘" - - "β•‘ 1. Reboot if required (check above) β•‘" - - "β•‘ 2. Pull models: ollama pull llama3.1:8b β•‘" - - "β•‘ 3. Configure NFS mounts (see Phase 5 in playbook) β•‘" - - "β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•" - tags: [verify] - - # ========================================================================== - # HANDLERS - # ========================================================================== - handlers: - - name: Restart ollama - ansible.builtin.systemd: - name: ollama - state: restarted - daemon_reload: true - - - name: Update Grub - ansible.builtin.command: update-grub - changed_when: false - - - name: Restart Logind - ansible.builtin.systemd: - name: systemd-logind - state: restarted diff --git a/ansible/archive/playbooks/onboarding/generic_host.yml b/ansible/archive/playbooks/onboarding/generic_host.yml deleted file mode 100644 index f9c0cc5..0000000 --- a/ansible/archive/playbooks/onboarding/generic_host.yml +++ /dev/null @@ -1,343 +0,0 @@ ---- -# playbooks/onboarding/generic_host.yml -# Bootstrap non-Proxmox hosts for Ansible management -# Supports both existing production hosts and net-new hosts via onboard_profile. - -- name: Onboard non-Proxmox host to Ansible management - hosts: "{{ target_host | default('onboarding_target_undefined') }}" - gather_facts: false - vars: - ansible_user: "{{ onboard_user | default(lab_ansible_user | default('chester')) }}" - local_ssh_key: "{{ lookup('env', 'HOME') }}/.ssh/id_ed25519.pub" - onboard_profile: "{{ onboarding_profile }}" - onboard_enable_security_hardening: "{{ onboard_profile == 'new' }}" - required_packages: - - python3 - - python3-apt - - sudo - - curl - - git - - vim - - htop - - tasks: - - name: Validate required runtime variables - ansible.builtin.assert: - that: - - target_host is defined - - target_host | length > 0 - - onboarding_profile is defined - - onboarding_profile | length > 0 - - target_host not in ['all', '*', 'ubuntu_lab'] - fail_msg: "Invalid onboarding scope. Use explicit non-broad target_host (for example docker_hosts) and onboarding_profile. Example: -e 'target_host=docker_hosts onboarding_profile=existing'" - success_msg: "Required runtime variables provided." - run_once: true - delegate_to: localhost - tags: ['connectivity', 'setup'] - - - name: Validate onboarding profile - ansible.builtin.assert: - that: - - onboard_profile in ['new', 'existing'] - fail_msg: "Invalid onboarding_profile='{{ onboard_profile }}'. Use 'new' or 'existing'." - success_msg: "Onboarding profile '{{ onboard_profile }}' selected." - tags: ['connectivity', 'setup'] - - # ======================================== - # SECTION 1: Connectivity Test - # ======================================== - - name: Initial Connectivity Check - block: - - name: Test raw connection (no Python required) - ansible.builtin.raw: echo "Connection successful" - register: connection_test - changed_when: false - - - name: Display connection status - ansible.builtin.debug: - msg: "βœ… Successfully connected to {{ inventory_hostname }}" - - rescue: - - name: Connection failed - ansible.builtin.fail: - msg: "❌ Cannot connect to {{ inventory_hostname }}. Check SSH credentials and network connectivity." - - tags: ['connectivity', 'test'] - - # ======================================== - # SECTION 2: SSH Key Setup - # ======================================== - - name: Configure SSH Key Authentication - block: - - name: Check if .ssh directory exists - ansible.builtin.raw: test -d ~/.ssh && echo "exists" || echo "missing" - register: ssh_dir_check - changed_when: false - - - name: Create .ssh directory if missing - ansible.builtin.raw: mkdir -p ~/.ssh && chmod 700 ~/.ssh - when: "'missing' in ssh_dir_check.stdout" - - - name: Read local SSH public key - ansible.builtin.set_fact: - ssh_public_key: "{{ lookup('file', local_ssh_key) }}" - delegate_to: localhost - - - name: Check if SSH key is already authorized - ansible.builtin.raw: grep -F "{{ ssh_public_key.split()[1][:30] }}" ~/.ssh/authorized_keys - register: key_check - failed_when: false - changed_when: false - - - name: Add SSH public key to authorized_keys - ansible.builtin.raw: | - echo "{{ ssh_public_key }}" >> ~/.ssh/authorized_keys - chmod 600 ~/.ssh/authorized_keys - when: key_check.rc != 0 - - - name: Verify SSH key authentication - ansible.builtin.ping: - vars: - ansible_ssh_pass: "" - ignore_errors: true - register: ssh_key_test - - - name: Display SSH key status - ansible.builtin.debug: - msg: "βœ… SSH key authentication configured successfully" - when: ssh_key_test is succeeded - - tags: ['ssh', 'setup'] - - # ======================================== - # SECTION 3: Python & Prerequisites - # ======================================== - - name: Install Python and Prerequisites - block: - - name: Check if Python3 is installed - ansible.builtin.raw: which python3 || which python - register: python_check - failed_when: false - changed_when: false - - - name: Install Python3 if missing (Debian/Ubuntu) - ansible.builtin.raw: | - export DEBIAN_FRONTEND=noninteractive - sudo apt-get update -qq - sudo apt-get install -y python3 python3-apt - when: python_check.rc != 0 - args: - executable: /bin/bash - - - name: Gather facts (now that Python is available) - ansible.builtin.setup: - - - name: Display system information - ansible.builtin.debug: - msg: - - "OS: {{ ansible_distribution }} {{ ansible_distribution_version }}" - - "Hostname: {{ ansible_hostname }}" - - "Architecture: {{ ansible_architecture }}" - - "Python: {{ ansible_python_version }}" - - tags: ['python', 'setup', 'prerequisites'] - - # ======================================== - # SECTION 4: Passwordless Sudo - # ======================================== - - name: Configure Passwordless Sudo - become: true - block: - - name: Check current sudo configuration - ansible.builtin.command: sudo -n true - register: sudo_check - failed_when: false - changed_when: false - become: false - - - name: Create sudoers.d entry for passwordless sudo - ansible.builtin.copy: - content: "{{ ansible_user }} ALL=(ALL) NOPASSWD:ALL\n" - dest: "/etc/sudoers.d/{{ ansible_user }}" - mode: '0440' - owner: root - group: root - validate: 'visudo -cf %s' - when: sudo_check.rc != 0 - - - name: Verify passwordless sudo - ansible.builtin.command: sudo -n whoami - register: sudo_verify - changed_when: false - become: false - - - name: Display sudo status - ansible.builtin.debug: - msg: "βœ… Passwordless sudo configured for {{ ansible_user }}" - when: sudo_verify.stdout == "root" - - tags: ['sudo', 'setup'] - - # ======================================== - # SECTION 5: System Packages - # ======================================== - - name: Install Essential Packages - become: true - block: - - name: Update apt cache - ansible.builtin.apt: - update_cache: true - cache_valid_time: 3600 - when: ansible_os_family == "Debian" - - - name: Install required packages - ansible.builtin.apt: - name: "{{ required_packages }}" - state: present - when: ansible_os_family == "Debian" - - - name: Display installed packages - ansible.builtin.debug: - msg: "βœ… Essential packages installed" - - tags: ['packages', 'setup'] - - # ======================================== - # SECTION 6: Basic Security - # ======================================== - - name: Apply Basic Security Hardening - become: true - when: onboard_enable_security_hardening | bool - block: - - name: Disable root SSH login - ansible.builtin.lineinfile: - path: /etc/ssh/sshd_config - regexp: '^#?PermitRootLogin' - line: 'PermitRootLogin no' - state: present - validate: 'sshd -t -f %s' - notify: Restart SSH - - - name: Disable password authentication (SSH keys only) - ansible.builtin.lineinfile: - path: /etc/ssh/sshd_config - regexp: '^#?PasswordAuthentication' - line: 'PasswordAuthentication no' - state: present - validate: 'sshd -t -f %s' - notify: Restart SSH - when: ssh_key_test is succeeded - - - name: Enable UFW firewall (allow SSH) - block: - - name: Check if ufw is installed - ansible.builtin.command: command -v ufw - register: ufw_check - changed_when: false - failed_when: false - - - name: Allow SSH via UFW when available - community.general.ufw: - rule: allow - port: '22' - proto: tcp - when: ufw_check.rc == 0 - - - name: Skip UFW configuration when ufw is unavailable - ansible.builtin.debug: - msg: "UFW not installed; skipping firewall onboarding step." - when: ufw_check.rc != 0 - when: ansible_os_family == "Debian" - - tags: ['security', 'hardening'] - - - name: Security hardening skipped notice - ansible.builtin.debug: - msg: - - "⚠️ Security hardening skipped (existing profile)." - - " Set -e onboarding_profile=new to enforce SSH hardening controls." - when: not (onboard_enable_security_hardening | bool) - tags: ['security', 'hardening'] - - # ======================================== - # SECTION 7: Final Validation - # ======================================== - - name: Validate Onboarding - block: - - name: Test passwordless authentication - ansible.builtin.ping: - vars: - ansible_ssh_pass: "" - ansible_become_pass: "" - - - name: Test passwordless sudo - ansible.builtin.command: sudo -n whoami - register: final_sudo_test - changed_when: false - - - name: Gather final system state - ansible.builtin.setup: - - - name: Display onboarding summary - ansible.builtin.debug: - msg: - - "════════════════════════════════════════════════" - - "βœ… HOST ONBOARDING COMPLETE" - - "════════════════════════════════════════════════" - - "Profile: {{ onboard_profile }}" - - "Host: {{ inventory_hostname }} ({{ ansible_hostname }})" - - "IP: {{ ansible_default_ipv4.address }}" - - "OS: {{ ansible_distribution }} {{ ansible_distribution_version }}" - - "Python: {{ ansible_python_version }}" - - "SSH Key Auth: βœ… Enabled" - - "Passwordless Sudo: βœ… Enabled" - - "Security Hardening: {{ 'βœ… Applied' if (onboard_enable_security_hardening | bool) else '⚠️ Skipped' }}" - - "Ansible User: {{ ansible_user }}" - - "════════════════════════════════════════════════" - - "Next Steps:" - - " β€’ Add host to appropriate inventory groups" - - " β€’ Run role-specific playbooks" - - " β€’ Configure host-specific variables" - - " β€’ (Swarm nodes) Run: ansible-playbook playbooks/docker/bootstrap_swarm.yml --limit {{ inventory_hostname }}" - - "════════════════════════════════════════════════" - - tags: ['validate', 'summary'] - - # ======================================== - # SECTION 8: Disk grow (Swarm nodes only) - # WHY before storage mounts: root filesystem must have space before - # nfs-common installs or Docker pulls images. Cloud templates ship - # with a ~2.4G root; Proxmox resizes the virtual disk via qm resize - # but the in-guest partition and filesystem need explicit expansion. - # WHY conditional: non-Swarm hosts manage their own disk sizing. - # ======================================== - - name: Grow root disk partition and filesystem (Swarm nodes only) - ansible.builtin.include_role: - name: disk_grow - apply: - become: true - when: inventory_hostname in (groups['swarm_hosts'] | default([])) - tags: ['disk'] - - # ======================================== - # SECTION 9: Storage mounts (Swarm nodes only) - # WHY here: NFS mounts must exist before any Swarm stack deploy runs. - # Wiring this into onboarding ensures a freshly provisioned Swarm VM - # is storage-ready in a single playbook run with no manual follow-up. - # WHY conditional: non-Swarm hosts (Proxmox, watchtower, heimdall) do not - # need /mnt/homelab or /mnt/media; skip silently for those targets. - # ======================================== - - name: Configure NFS storage mounts (Swarm nodes only) - ansible.builtin.include_role: - name: storage_mounts - apply: - become: true - when: inventory_hostname in (groups['swarm_hosts'] | default([])) - tags: ['storage'] - - handlers: - - name: Restart SSH - ansible.builtin.systemd: - name: "{{ 'ssh' if ansible_os_family == 'Debian' else 'sshd' }}" - state: restarted - become: true diff --git a/ansible/archive/playbooks/onboarding/generic_host_conversational.yml b/ansible/archive/playbooks/onboarding/generic_host_conversational.yml deleted file mode 100644 index c416fa6..0000000 --- a/ansible/archive/playbooks/onboarding/generic_host_conversational.yml +++ /dev/null @@ -1,107 +0,0 @@ ---- -- name: Intelligent VM Provisioner & Onboarder - hosts: localhost - gather_facts: false - - vars: - # 1. THE GOLD IMAGE PRESETS - default_spec: - cpu: 2 - ram: 4096 - disk: "40GB" - os: "ubuntu-22.04" - - # 2. THE RESOURCE GUARD LIMITS - max_safe_ram: 16384 # 16GB Safety Ceiling - max_safe_cpu: 8 - - vars_prompt: - - name: target_host_ip - prompt: "Step 1: Enter the target IP address" - private: false - - - name: target_hostname - prompt: "Step 2: Enter the desired Hostname (e.g., prod-web-01)" - private: false - - tasks: - # ======================================== - # PHASE 1: CONVERSATIONAL OVERRIDES - # ======================================== - - name: Display Target Identity - ansible.builtin.debug: - msg: "Preparing to build [ {{ target_hostname }} ] at [ {{ target_host_ip }} ]" - - - name: Prompt for Modification - ansible.builtin.pause: - prompt: "Current Default: {{ default_spec.cpu }} CPU / {{ default_spec.ram }}MB RAM. Modify? (y/n)" - register: modify_request - - - name: Interactive Override Block - block: - - name: Set New CPU - ansible.builtin.pause: - prompt: "Enter CPU count" - register: user_cpu - - - name: Set New RAM - ansible.builtin.pause: - prompt: "Enter RAM in MB" - register: user_ram - - - name: Consolidate Final Specs - ansible.builtin.set_fact: - final_spec: - cpu: "{{ user_cpu.user_input | default(default_spec.cpu, true) | int }}" - ram: "{{ user_ram.user_input | default(default_spec.ram, true) | int }}" - when: modify_request.user_input | lower == 'y' - - - name: Default Fallback - ansible.builtin.set_fact: - final_spec: - cpu: "{{ default_spec.cpu | int }}" - ram: "{{ default_spec.ram | int }}" - when: modify_request.user_input | lower != 'y' - - # ======================================== - # PHASE 2: THE RESOURCE GUARD (VALIDATION) - # ======================================== - - name: Validate Resource Request - ansible.builtin.assert: - that: - - final_spec.ram <= max_safe_ram - - final_spec.cpu <= max_safe_cpu - fail_msg: "❌ RESOURCE OVERLOAD: Requested {{ final_spec.ram }}MB RAM exceeds the safety limit of {{ max_safe_ram }}MB." - success_msg: "βœ… Resource request validated within safety parameters." - - - name: Final Provisioning Gate - ansible.builtin.pause: - prompt: | - CONFIRMATION REQUIRED: - ID: {{ target_hostname }} ({{ target_host_ip }}) - SPEC: {{ final_spec.cpu }} vCPU / {{ final_spec.ram }} MB RAM - - Press ENTER to deploy, or Ctrl+C to cancel. - - # ======================================== - # PHASE 3: EXECUTION (DELEGATED TO REMOTE) - # ======================================== - - name: Apply Identity and Config - delegate_to: "{{ target_host_ip }}" - become: true - block: - - name: Set System Hostname - ansible.builtin.hostname: - name: "{{ target_hostname }}" - - - name: Update /etc/hosts - ansible.builtin.lineinfile: - path: /etc/hosts - regexp: '^127.0.1.1' - line: "127.0.1.1 {{ target_hostname }}" - - - name: Install Base Packages - ansible.builtin.apt: - name: [python3, curl, git, htop] - state: present - update_cache: yes \ No newline at end of file diff --git a/ansible/archive/playbooks/onboarding/proxmox_host.yml b/ansible/archive/playbooks/onboarding/proxmox_host.yml deleted file mode 100644 index 2978baf..0000000 --- a/ansible/archive/playbooks/onboarding/proxmox_host.yml +++ /dev/null @@ -1,331 +0,0 @@ ---- -# playbooks/onboarding/proxmox_onboarding.yml -# Complete Proxmox host onboarding for 12th Gen Intel laptops -# -# What this does: -# 1. Creates the operational user (lab_ansible_user) with SSH key auth and passwordless sudo -# 2. Configures Proxmox repos (removes enterprise, adds no-subscription) -# 3. Applies 12th Gen Intel hybrid core optimizations (intel_pstate=passive, pcie_aspm=force) -# 4. Hardens laptop for headless operation (disables lid-close suspend) -# 5. Disables swap to protect NVMe lifespan -# 6. Removes Proxmox subscription nag from web/mobile UI -# 7. Optionally disables HA/Corosync for standalone-only nodes -# 8. Runs dist-upgrade to ensure all packages current -# -# Prerequisites: -# - SSH access to Proxmox host as root (with SSH keys) -# - Host defined in inventory under [proxmox_cluster] -# - SSH public key at ~/.ssh/id_ed25519_homelab.pub on control machine -# -# Usage: -# # All hosts in proxmox_cluster: -# ansible-playbook -i inventory/hosts.ini playbooks/onboarding/proxmox_onboarding.yml -# -# # Single host: -# ansible-playbook -i inventory/hosts.ini playbooks/onboarding/proxmox_onboarding.yml --limit pve01 -# -# # New host without SSH keys (first run only): -# ansible-playbook -i inventory/hosts.ini playbooks/onboarding/proxmox_onboarding.yml \ -# --limit pve01 -e "ansible_user=root" --ask-pass -# -# # Standalone prep mode (default): -# ansible-playbook -i inventory/hosts.ini playbooks/onboarding/proxmox_onboarding.yml --limit pve01 -# -# # Cluster-intent prep (skip disabling HA/Corosync): -# ansible-playbook -i inventory/hosts.ini playbooks/onboarding/proxmox_onboarding.yml --limit pve01 -e standalone_mode=false -# -# ───────────────────────────────────────────────────────────────────────────── -# This playbook is for DAY-0 ONBOARDING ONLY (first-time provisioning). -# It runs dist-upgrade and may reboot. Do not use for routine enforcement. -# -# For ongoing drift enforcement: playbooks/proxmox/pve_baseline.yml -# For rolling package updates: playbooks/proxmox/pve_update.yml -# For cross-node consistency audit: playbooks/proxmox/pve_audit.yml -# ───────────────────────────────────────────────────────────────────────────── -# -# After first run, update inventory to use lab_ansible_user (see group_vars/all.yml): -# [proxmox_cluster:vars] -# ansible_user= -# ansible_become=true - -- name: Proxmox host onboarding and laptop hardening - hosts: proxmox_cluster - become: true - - vars: - is_laptop: true - # standalone_mode controls whether pve-ha-lrm, pve-ha-crm, and corosync are stopped. - # DEFAULT IS FALSE β€” pve01/pve02/pve03 always run as a 3-node cluster. - # Only set true with -e standalone_mode=true for a truly isolated single-node PVE install. - standalone_mode: false - # Operational user to create. References group_vars/all.yml; override with -e lab_ansible_user=otheruser. - lab_user: "{{ lab_ansible_user | default('chester') }}" - controller_ssh_pubkey_candidates: - - "{{ lookup('env', 'HOME') }}/.ssh/id_ed25519_homelab.pub" - - "{{ lookup('env', 'HOME') }}/.ssh/id_ed25519.pub" - - tasks: - - name: "0. Identity Management: Create User '{{ lab_user }}'" - block: - - name: "Install sudo package" - ansible.builtin.apt: - name: sudo - state: present - update_cache: false - - - name: "Ensure group '{{ lab_user }}' exists" - ansible.builtin.group: - name: "{{ lab_user }}" - state: present - - - name: "Create user '{{ lab_user }}' with sudo access" - ansible.builtin.user: - name: "{{ lab_user }}" - group: "{{ lab_user }}" - groups: sudo - shell: /bin/bash - password: '!' - password_lock: true - - - name: "Locate SSH public key on control machine" - ansible.builtin.set_fact: - controller_ssh_pubkey_path: >- - {{ lookup('ansible.builtin.first_found', {'files': controller_ssh_pubkey_candidates, 'skip': true}) }} - delegate_to: localhost - become: false - - - name: "Fail early if SSH public key is missing" - ansible.builtin.fail: - msg: >- - SSH public key not found on the control machine. - Checked: - {{ controller_ssh_pubkey_candidates | join(', ') }} - Generate one with: - ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519 - when: controller_ssh_pubkey_path | default('') | length == 0 - - - name: "Deploy SSH Key to {{ lab_user }} user" - ansible.posix.authorized_key: - user: "{{ lab_user }}" - state: present - key: "{{ lookup('file', controller_ssh_pubkey_path) }}" - - - name: "Allow '{{ lab_user }}' to use sudo without password" - ansible.builtin.copy: - dest: "/etc/sudoers.d/{{ lab_user }}" - content: "{{ lab_user }} ALL=(ALL) NOPASSWD: ALL\n" - mode: '0440' - owner: root - group: root - validate: '/usr/sbin/visudo -cf %s' - - - name: "1. Repository & Package Optimization" - block: - - name: "Check if /etc/apt/sources.list exists" - ansible.builtin.stat: - path: /etc/apt/sources.list - register: apt_sources_list_stat - - - name: "Remove Proxmox enterprise repo files (.list/.sources)" - ansible.builtin.file: - path: "{{ item }}" - state: absent - loop: - - /etc/apt/sources.list.d/pve-enterprise.list - - /etc/apt/sources.list.d/pve-enterprise.sources - - /etc/apt/sources.list.d/ceph.list - - /etc/apt/sources.list.d/ceph.sources - - /etc/apt/sources.list.d/ceph-enterprise.list - - /etc/apt/sources.list.d/ceph-enterprise.sources - - - name: "Remove enterprise.proxmox.com entries from /etc/apt/sources.list" - ansible.builtin.lineinfile: - path: /etc/apt/sources.list - regexp: '^.*enterprise\.proxmox\.com.*$' - state: absent - when: apt_sources_list_stat.stat.exists - - - name: "Add Proxmox no-subscription repository" - ansible.builtin.apt_repository: - repo: "deb http://download.proxmox.com/debian/pve {{ ansible_distribution_release }} pve-no-subscription" - filename: pve-no-subscription - state: present - - - name: "Add Proxmox Ceph no-subscription repository" - ansible.builtin.apt_repository: - repo: "deb http://download.proxmox.com/debian/ceph-squid {{ ansible_distribution_release }} no-subscription" - filename: ceph-no-subscription - state: present - - - name: "Ensure Latest Intel Microcode (Required for Hybrid Cores)" - ansible.builtin.apt: - name: [intel-microcode, htop, nvme-cli, lm-sensors] - state: present - update_cache: true - - - name: "Run apt dist-upgrade" - ansible.builtin.apt: - upgrade: dist - update_cache: false - register: dist_upgrade_result - - - name: "Reboot if kernel was updated" - ansible.builtin.reboot: - msg: "Rebooting after kernel upgrade β€” initiated by Ansible" - reboot_timeout: 300 - when: dist_upgrade_result is changed and - dist_upgrade_result.stdout is search('linux-image') - - - name: "2. Kernel Tuning (12th Gen & Power)" - block: - - name: "Configure GRUB for ASPM & Power Savings" - ansible.builtin.lineinfile: - path: /etc/default/grub - regexp: '^GRUB_CMDLINE_LINUX_DEFAULT=' - line: 'GRUB_CMDLINE_LINUX_DEFAULT="quiet pcie_aspm=force intel_pstate=passive"' - notify: Update Grub - - - name: "3. Laptop Safety: Disable Lid-Close Suspend" - when: is_laptop | default(false) - block: - - name: "Configure logind.conf to ignore lid switch" - ansible.builtin.lineinfile: - path: /etc/systemd/logind.conf - regexp: "^#?{{ item.key }}=" - line: "{{ item.key }}={{ item.value }}" - loop: - - { key: "HandleLidSwitch", value: "ignore" } - - { key: "HandleLidSwitchExternalPower", value: "ignore" } - notify: Restart Logind - - - name: "Mask Sleep/Suspend Targets (Hardware Lock)" - ansible.builtin.systemd: - name: "{{ item }}" - masked: true - loop: - - sleep.target - - suspend.target - - hibernate.target - - hybrid-sleep.target - - - name: "4. Storage & SSD Health" - block: - - name: "Disable Swap (Protect NVMe Lifespan)" - ansible.builtin.shell: | - swapoff -a - sed -i '/ swap / s/^\(.*\)$/#\1/g' /etc/fstab - when: ansible_swaptotal_mb > 0 - changed_when: ansible_swaptotal_mb > 0 - - - name: "5. Intel Thread Director Support Check" - ansible.builtin.shell: "dmesg | grep -i 'Hardware Feedback Interface'" - register: hfi_check - failed_when: false - changed_when: false - - - name: "6. Proxmox Web UI: Remove Subscription Nag" - block: - - name: "Deploy subscription nag removal script" - ansible.builtin.copy: - dest: /usr/local/bin/pve-remove-nag.sh - owner: root - group: root - mode: '0755' - content: | - #!/bin/sh - WEB_JS=/usr/share/javascript/proxmox-widget-toolkit/proxmoxlib.js - if [ -s "$WEB_JS" ] && ! grep -q NoMoreNagging "$WEB_JS"; then - echo "Patching Web UI nag..." - sed -i -e "/data\.status/ s/!//" -e "/data\.status/ s/active/NoMoreNagging/" "$WEB_JS" - fi - - MOBILE_TPL=/usr/share/pve-yew-mobile-gui/index.html.tpl - MARKER="" - if [ -f "$MOBILE_TPL" ] && ! grep -q "$MARKER" "$MOBILE_TPL"; then - echo "Patching Mobile UI nag..." - printf "%s\n" \ - "$MARKER" \ - "" \ - "" >>"$MOBILE_TPL" - fi - - - name: "Configure dpkg hook to auto-run nag removal after upgrades" - ansible.builtin.copy: - dest: /etc/apt/apt.conf.d/no-nag-script - owner: root - group: root - mode: '0644' - content: | - DPkg::Post-Invoke { "/usr/local/bin/pve-remove-nag.sh"; }; - - - name: "Run nag removal script immediately" - ansible.builtin.command: /usr/local/bin/pve-remove-nag.sh - register: nag_removal_output - changed_when: "'Patching' in nag_removal_output.stdout" - - - name: "Reinstall proxmox-widget-toolkit to ensure nag patches apply" - ansible.builtin.apt: - name: proxmox-widget-toolkit - state: present - register: widget_reinstall - failed_when: false - - - name: "7. Standalone Optimization: Disable HA/Corosync Services" - when: standalone_mode | bool - block: - - name: "Stop and disable pve-ha-lrm service" - ansible.builtin.systemd: - name: pve-ha-lrm - state: stopped - enabled: false - failed_when: false - - - name: "Stop and disable pve-ha-crm service" - ansible.builtin.systemd: - name: pve-ha-crm - state: stopped - enabled: false - failed_when: false - - - name: "Stop and disable Corosync service" - ansible.builtin.systemd: - name: corosync - state: stopped - enabled: false - failed_when: false - - handlers: - - name: Update Grub - ansible.builtin.command: update-grub - register: grub_update_result - changed_when: grub_update_result.rc == 0 - - - name: Restart Logind - ansible.builtin.systemd: - name: systemd-logind - state: restarted diff --git a/ansible/archive/playbooks/onboarding/setup_ansible_secrets.yml b/ansible/archive/playbooks/onboarding/setup_ansible_secrets.yml deleted file mode 100644 index cf3c371..0000000 --- a/ansible/archive/playbooks/onboarding/setup_ansible_secrets.yml +++ /dev/null @@ -1,32 +0,0 @@ ---- -# Onboarding playbook: bootstrap Ansible Vault infrastructure for secrets management -# Concept: This is the entry point for beginners to safely set up vault on the control node. -# It runs on localhost (control node) and prepares directories, validates prerequisites, -# and provides guidance for encrypting the first secret. -# -# Usage: -# First run (setup only): -# ansible-playbook playbooks/onboarding/setup_ansible_secrets.yml --tags bootstrap -# -# Validation (check infrastructure health): -# ansible-playbook playbooks/onboarding/setup_ansible_secrets.yml --tags validate -# -# With vault password prompts (instead of password file): -# ansible-playbook playbooks/onboarding/setup_ansible_secrets.yml --ask-vault-pass -# -# Example creation (for self-learning): -# ansible-playbook playbooks/onboarding/setup_ansible_secrets.yml --tags example --extra-vars create_example_vault=true - -- name: Bootstrap Ansible Vault for secrets management - hosts: localhost - gather_facts: false - vars: - # Override these to customize vault paths or behavior - # Example: ansible-playbook ... --extra-vars vault_base_dir=/etc/ansible/vault - vault_base_dir: "{{ lookup('env', 'HOME') }}/.ansible/vault" - vault_password_file: "{{ vault_base_dir }}/password" - vault_vars_dir: "{{ playbook_dir }}/../group_vars/vault" - roles: - - secrets_onboarding - tags: - - always diff --git a/ansible/archive/playbooks/onboarding/watchtower_audit.yml b/ansible/archive/playbooks/onboarding/watchtower_audit.yml deleted file mode 100644 index b8b2420..0000000 --- a/ansible/archive/playbooks/onboarding/watchtower_audit.yml +++ /dev/null @@ -1,200 +0,0 @@ ---- -# playbooks/onboarding/watchtower_audit.yml -# Read-only audit for the Ansible control node (Watchtower). -# Safe to schedule. Makes no changes to any host. -# -# What this asserts: -# - Kernel / distro / swap / swappiness / bridge netfilter / ip_forward -# - Docker daemon log rotation configured -# - Python venv exists -# - Ansible version meets minimum (>= 2.18.0) -# - SSH private key present -# -# Usage: -# ansible-playbook -i inventory/hosts.ini playbooks/onboarding/watchtower_audit.yml -# -# Output: -# outputs/watchtower_audit_.md (repo root) - -- name: "Play 1: Gather Watchtower state" - hosts: watchtower - become: true - gather_facts: true - - vars: - lab_user: "{{ lab_ansible_user | default('chester') }}" - homelab_root: "/home/{{ lab_user }}/homelab" - venv_path: "{{ homelab_root }}/.venv" - ssh_key_path: "/home/{{ lab_user }}/.ssh/id_ed25519" - min_ansible_version: "2.18.0" - - tasks: - - name: Read sysctl values - ansible.builtin.shell: "sysctl -n {{ item }} 2>/dev/null || echo 0" - register: sysctl_raw - loop: - - vm.swappiness - - net.bridge.bridge-nf-call-iptables - - net.bridge.bridge-nf-call-ip6tables - - net.ipv4.ip_forward - changed_when: false - check_mode: false - - - name: Read Docker daemon.json - ansible.builtin.command: cat /etc/docker/daemon.json - register: daemon_json_content - changed_when: false - failed_when: false - check_mode: false - - - name: Check Python venv exists - ansible.builtin.stat: - path: "{{ venv_path }}/bin/activate" - register: venv_stat - - - name: Get Ansible version from venv - ansible.builtin.command: "{{ venv_path }}/bin/ansible --version" - register: ansible_version_raw - changed_when: false - failed_when: false - check_mode: false - become: false - - - name: Check SSH private key exists - ansible.builtin.stat: - path: "{{ ssh_key_path }}" - register: ssh_key_stat - become: false - - - name: Stash audit facts - ansible.builtin.set_fact: - wt_audit: - kernel: "{{ ansible_kernel }}" - arch: "{{ ansible_architecture }}" - distro: "{{ ansible_distribution }}" - distro_version: "{{ ansible_distribution_version }}" - swap_mb: "{{ ansible_swaptotal_mb }}" - swappiness: "{{ (sysctl_raw.results | selectattr('item', 'equalto', 'vm.swappiness') | first).stdout | trim }}" - bridge_iptables: "{{ (sysctl_raw.results | selectattr('item', 'equalto', 'net.bridge.bridge-nf-call-iptables') | first).stdout | trim }}" - bridge_ip6tables: "{{ (sysctl_raw.results | selectattr('item', 'equalto', 'net.bridge.bridge-nf-call-ip6tables') | first).stdout | trim }}" - ip_forward: "{{ (sysctl_raw.results | selectattr('item', 'equalto', 'net.ipv4.ip_forward') | first).stdout | trim }}" - daemon_json: "{{ daemon_json_content.stdout | default('{}') }}" - log_rotation_configured: "{{ 'max-size' in (daemon_json_content.stdout | default('{}')) }}" - venv_exists: "{{ venv_stat.stat.exists }}" - ansible_version_raw: "{{ ansible_version_raw.stdout_lines[0] | default('unknown') }}" - ansible_version_number: "{{ ansible_version_raw.stdout | regex_search('ansible \\[core ([0-9.]+)\\]', '\\1') | first | default('0.0.0') }}" - ansible_version_ok: "{{ (ansible_version_raw.stdout | regex_search('ansible \\[core ([0-9.]+)\\]', '\\1') | first | default('0.0.0')) is version(min_ansible_version, '>=') }}" - ssh_key_present: "{{ ssh_key_stat.stat.exists }}" - - -- name: "Play 2: Assertions and drift report" - hosts: localhost - gather_facts: false - - vars: - audit_timestamp: "{{ lookup('pipe', 'date +%Y%m%dT%H%M%S') }}" - report_path: "{{ playbook_dir }}/../../../outputs/watchtower_audit_{{ audit_timestamp }}.md" - wt: "{{ hostvars['localhost']['wt_audit'] }}" - - tasks: - - name: Ensure outputs directory exists - ansible.builtin.file: - path: "{{ playbook_dir }}/../../../outputs" - state: directory - mode: '0755' - - - name: Write drift report - ansible.builtin.copy: - dest: "{{ report_path }}" - mode: '0644' - content: | - # Watchtower Control Node Audit Report - - Generated: {{ audit_timestamp }} - - ## System - - | Property | Value | - |----------|-------| - | Kernel | `{{ wt.kernel }}` | - | Architecture | {{ wt.arch }} | - | Distro | {{ wt.distro }} {{ wt.distro_version }} | - | Swap | {{ wt.swap_mb }}MB | - - ## Sysctl - - | Parameter | Value | Expected | - |-----------|-------|----------| - | vm.swappiness | {{ wt.swappiness }} | 0 | - | net.bridge.bridge-nf-call-iptables | {{ wt.bridge_iptables }} | 1 | - | net.bridge.bridge-nf-call-ip6tables | {{ wt.bridge_ip6tables }} | 1 | - | net.ipv4.ip_forward | {{ wt.ip_forward }} | 1 | - - ## Toolchain - - | Check | Status | - |-------|--------| - | Python venv | {{ 'βœ… exists' if wt.venv_exists | bool else '❌ missing' }} | - | Ansible version | {{ wt.ansible_version_raw }} | - | Ansible version ok (>= 2.18.0) | {{ 'βœ…' if wt.ansible_version_ok | bool else '❌' }} | - | SSH private key | {{ 'βœ… present' if wt.ssh_key_present | bool else '❌ missing' }} | - - ## Docker - - | Check | Status | - |-------|--------| - | Log rotation configured | {{ 'βœ…' if wt.log_rotation_configured | bool else '❌' }} | - - - name: Assert swap is disabled - ansible.builtin.assert: - that: wt.swap_mb | int == 0 - fail_msg: "❌ Swap enabled: {{ wt.swap_mb }}MB β€” run watchtower_baseline.yml --tags storage" - success_msg: "βœ… Watchtower: swap disabled" - - - name: Assert vm.swappiness=0 - ansible.builtin.assert: - that: wt.swappiness | int == 0 - fail_msg: "❌ vm.swappiness={{ wt.swappiness }} β€” run watchtower_baseline.yml --tags sysctl" - success_msg: "βœ… Watchtower: vm.swappiness=0" - - - name: Assert bridge netfilter enabled - ansible.builtin.assert: - that: - - wt.bridge_iptables | int == 1 - - wt.bridge_ip6tables | int == 1 - fail_msg: >- - ❌ Bridge netfilter not fully enabled: - bridge-nf-call-iptables={{ wt.bridge_iptables }} - bridge-nf-call-ip6tables={{ wt.bridge_ip6tables }} - Run watchtower_baseline.yml --tags sysctl. - success_msg: "βœ… Watchtower: bridge netfilter enabled" - - - name: Assert ip_forward enabled - ansible.builtin.assert: - that: wt.ip_forward | int == 1 - fail_msg: "❌ net.ipv4.ip_forward={{ wt.ip_forward }} β€” run watchtower_baseline.yml --tags sysctl" - success_msg: "βœ… Watchtower: ip_forward=1" - - - name: Assert Docker log rotation configured - ansible.builtin.assert: - that: wt.log_rotation_configured | bool - fail_msg: "❌ Docker log rotation not configured β€” run watchtower_baseline.yml --tags docker" - success_msg: "βœ… Watchtower: Docker log rotation configured" - - - name: Assert Python venv exists - ansible.builtin.assert: - that: wt.venv_exists | bool - fail_msg: "❌ Python venv missing β€” run watchtower_baseline.yml --tags toolchain" - success_msg: "βœ… Watchtower: Python venv present" - - - name: Assert Ansible version meets minimum - ansible.builtin.assert: - that: wt.ansible_version_ok | bool - fail_msg: "❌ Ansible version {{ wt.ansible_version_number }} is below minimum 2.18.0 β€” run watchtower_baseline.yml --tags toolchain" - success_msg: "βœ… Watchtower: Ansible {{ wt.ansible_version_number }} >= 2.18.0" - - - name: Assert SSH private key present - ansible.builtin.assert: - that: wt.ssh_key_present | bool - fail_msg: "❌ SSH private key missing β€” generate with: ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519" - success_msg: "βœ… Watchtower: SSH private key present" diff --git a/ansible/archive/playbooks/onboarding/watchtower_baseline.yml b/ansible/archive/playbooks/onboarding/watchtower_baseline.yml deleted file mode 100644 index 0feaf07..0000000 --- a/ansible/archive/playbooks/onboarding/watchtower_baseline.yml +++ /dev/null @@ -1,215 +0,0 @@ ---- -# playbooks/onboarding/watchtower_baseline.yml -# Idempotent baseline enforcement for the Ansible control node (Watchtower). -# -# ───────────────────────────────────────────────────────────────────────────── -# PURPOSE: Ongoing drift enforcement β€” safe to run any time, safe to schedule. -# Does NOT upgrade packages. Does NOT reboot. -# For OS updates: use playbooks/onboarding/watchtower_update.yml -# For control node audit: use playbooks/onboarding/watchtower_audit.yml -# ───────────────────────────────────────────────────────────────────────────── -# -# What this enforces (all idempotent): -# 0. Packages: Required system packages present (git, curl, python3, python3-venv, etc.) -# 1. Storage: Swap disabled (swapoff -a + fstab commented) -# 2. Sysctl: vm.swappiness=0, bridge netfilter, ip_forward (Docker on control node) -# 3. Docker: /etc/docker/daemon.json with log rotation -# 4. Toolchain: Python venv exists, Ansible Galaxy collections installed -# 5. SSH: Ansible SSH key present on control node -# -# NOTE: Watchtower connects to itself via ansible_connection=local. -# Tasks that would SSH to a remote host are not needed here. -# -# Usage: -# ansible-playbook -i inventory/hosts.ini playbooks/onboarding/watchtower_baseline.yml -# -# # Dry-run: -# ansible-playbook -i inventory/hosts.ini playbooks/onboarding/watchtower_baseline.yml --check --diff -# -# # Target a specific section only: -# ansible-playbook -i inventory/hosts.ini playbooks/onboarding/watchtower_baseline.yml --tags toolchain -# ansible-playbook -i inventory/hosts.ini playbooks/onboarding/watchtower_baseline.yml --tags docker - -- name: Watchtower control node baseline enforcement - hosts: watchtower - become: true - - vars: - lab_user: "{{ lab_ansible_user | default('chester') }}" - homelab_root: "/home/{{ lab_user }}/homelab" - venv_path: "{{ homelab_root }}/.venv" - ssh_key_path: "/home/{{ lab_user }}/.ssh/id_ed25519" - galaxy_requirements: "{{ homelab_root }}/ansible/requirements.yml" - - handlers: - - name: Restart Docker - ansible.builtin.service: - name: docker - state: restarted - - tasks: - - name: "0. Packages: ensure required system packages are present" - tags: [packages, baseline] - ansible.builtin.apt: - name: - - git - - curl - - htop - - python3 - - python3-pip - - python3-venv - - python3-apt - - nfs-common - - ca-certificates - - docker-ce - - docker-ce-cli - - containerd.io - state: present - update_cache: true - - - name: "1. Storage: disable swap" - tags: [storage, baseline] - block: - - name: Disable swap immediately (covers traditional + zram) - ansible.builtin.command: swapoff -a - when: ansible_swaptotal_mb > 0 - changed_when: ansible_swaptotal_mb > 0 - - - name: Comment out swap entries in /etc/fstab - ansible.builtin.replace: - path: /etc/fstab - regexp: '^([^#].*\s+swap\s+.*)$' - replace: '# \1' - - - name: Remove zram-generator config to prevent zram swap at boot - ansible.builtin.copy: - dest: /etc/systemd/zram-generator.conf - owner: root - group: root - mode: '0644' - content: | - # Managed by Ansible β€” watchtower_baseline.yml - # Empty config disables zram swap on Ubuntu 24.04. - - - name: Stop and mask systemd-zram-generator service if present - ansible.builtin.systemd: - name: systemd-zram-generator - state: stopped - enabled: false - masked: true - failed_when: false - - - name: Swapoff zram devices explicitly - ansible.builtin.shell: | - for dev in $(ls /dev/zram* 2>/dev/null); do - swapoff "$dev" 2>/dev/null || true - done - changed_when: false - - - name: "2. Sysctl: Docker networking parameters" - tags: [sysctl, baseline] - block: - - name: Ensure br_netfilter module is loaded - community.general.modprobe: - name: br_netfilter - state: present - - - name: Persist br_netfilter module load at boot - ansible.builtin.copy: - dest: /etc/modules-load.d/br_netfilter.conf - content: "br_netfilter\n" - owner: root - group: root - mode: '0644' - - - name: Apply and persist sysctl parameters - ansible.posix.sysctl: - name: "{{ item.key }}" - value: "{{ item.value }}" - sysctl_file: /etc/sysctl.d/90-watchtower.conf - state: present - reload: true - loop: - - { key: vm.swappiness, value: "0" } - - { key: net.bridge.bridge-nf-call-iptables, value: "1" } - - { key: net.bridge.bridge-nf-call-ip6tables, value: "1" } - - { key: net.ipv4.ip_forward, value: "1" } - - - name: "3. Docker: daemon configuration and log rotation" - tags: [docker, baseline] - block: - - name: Ensure /etc/docker directory exists - ansible.builtin.file: - path: /etc/docker - state: directory - owner: root - group: root - mode: '0755' - - - name: Deploy Docker daemon.json with log rotation - ansible.builtin.copy: - dest: /etc/docker/daemon.json - owner: root - group: root - mode: '0644' - content: | - { - "log-driver": "json-file", - "log-opts": { - "max-size": "10m", - "max-file": "3" - } - } - notify: Restart Docker - - - name: Ensure '{{ lab_user }}' is in the docker group - ansible.builtin.user: - name: "{{ lab_user }}" - groups: docker - append: true - - - name: "4. Toolchain: Python venv and Ansible Galaxy collections" - tags: [toolchain, baseline] - become: false - block: - - name: Ensure Python venv exists at {{ venv_path }} - ansible.builtin.command: "python3 -m venv {{ venv_path }}" - args: - creates: "{{ venv_path }}/bin/activate" - - - name: Ensure pip is up to date in venv - ansible.builtin.pip: - name: pip - state: latest - virtualenv: "{{ venv_path }}" - - - name: Ensure Ansible is installed in venv - ansible.builtin.pip: - name: ansible - state: present - virtualenv: "{{ venv_path }}" - - - name: Install Ansible Galaxy collections from requirements.yml - ansible.builtin.command: > - {{ venv_path }}/bin/ansible-galaxy collection install - -r {{ galaxy_requirements }} - --upgrade - register: galaxy_install - changed_when: "'Installing' in galaxy_install.stdout" - - - name: "5. SSH: verify Ansible SSH key is present" - tags: [ssh, baseline] - become: false - block: - - name: Check SSH private key exists - ansible.builtin.stat: - path: "{{ ssh_key_path }}" - register: ssh_key_stat - - - name: Fail if SSH private key is missing - ansible.builtin.fail: - msg: >- - SSH private key not found at {{ ssh_key_path }}. - Generate one with: ssh-keygen -t ed25519 -f {{ ssh_key_path }} - then run: ssh-copy-id -i {{ ssh_key_path }}.pub chester@ - when: not ssh_key_stat.stat.exists diff --git a/ansible/archive/playbooks/onboarding/watchtower_update.yml b/ansible/archive/playbooks/onboarding/watchtower_update.yml deleted file mode 100644 index a922be8..0000000 --- a/ansible/archive/playbooks/onboarding/watchtower_update.yml +++ /dev/null @@ -1,73 +0,0 @@ ---- -# playbooks/onboarding/watchtower_update.yml -# OS package update for the Ansible control node (Watchtower). -# -# ───────────────────────────────────────────────────────────────────────────── -# ⚠️ HUMAN-TRIGGERED ONLY β€” do not automate or schedule. -# Watchtower is not a Swarm member, so no drain/restore is needed. -# Reboot will briefly interrupt Ansible control plane connectivity. -# ───────────────────────────────────────────────────────────────────────────── -# -# What this does: -# 1. Runs apt dist-upgrade -# 2. Reboots if a newer kernel was installed and waits for return -# -# NOTE: Because Watchtower connects to itself via ansible_connection=local, -# the reboot module uses localhost semantics (the node reboots and -# Ansible waits for the local machine to come back). -# -# Usage: -# ansible-playbook -i inventory/hosts.ini playbooks/onboarding/watchtower_update.yml -# -# # Dry-run: -# ansible-playbook -i inventory/hosts.ini playbooks/onboarding/watchtower_update.yml --check -# -# # Update packages but skip reboot even if kernel changed: -# ansible-playbook -i inventory/hosts.ini playbooks/onboarding/watchtower_update.yml --skip-tags reboot - -- name: Watchtower control node OS update - hosts: watchtower - become: true - - tasks: - - name: Update apt cache - ansible.builtin.apt: - update_cache: true - cache_valid_time: 0 - - - name: Run apt dist-upgrade - ansible.builtin.apt: - upgrade: dist - update_cache: false - register: dist_upgrade_result - tags: [update] - - - name: Check if a newer kernel is installed but not yet booted - ansible.builtin.shell: | - LATEST=$(ls /boot/vmlinuz-* | sort -V | tail -1 | sed 's|/boot/vmlinuz-||') - RUNNING=$(uname -r) - if [ "$LATEST" != "$RUNNING" ]; then echo "reboot_needed"; fi - register: reboot_check - changed_when: false - check_mode: false - tags: [reboot] - - - name: Reboot if a newer kernel is installed - ansible.builtin.reboot: - msg: "Rebooting into updated kernel β€” initiated by watchtower_update.yml" - reboot_timeout: 300 - when: reboot_check.stdout | trim == 'reboot_needed' - tags: [reboot] - - - name: Wait for Watchtower to return post-reboot - ansible.builtin.wait_for_connection: - delay: 10 - timeout: 300 - when: reboot_check.stdout | trim == 'reboot_needed' - tags: [reboot] - - - name: Report result - ansible.builtin.debug: - msg: >- - βœ… Watchtower updated. - {{ 'Rebooted into new kernel.' if reboot_check.stdout | trim == 'reboot_needed' else 'No kernel change β€” reboot not required.' }} diff --git a/ansible/archive/playbooks/preflight/capture_heimdall_baseline.yml b/ansible/archive/playbooks/preflight/capture_heimdall_baseline.yml deleted file mode 100644 index a831f15..0000000 --- a/ansible/archive/playbooks/preflight/capture_heimdall_baseline.yml +++ /dev/null @@ -1,507 +0,0 @@ ---- -# playbooks/preflight/capture_heimdall_baseline.yml -# -# Purpose: -# Snapshot Heimdall's (10.0.0.151) full running configuration before -# Ansible migration begins. Captured state is the authoritative input -# for all subsequent role defaults, vars, and templates. -# -# MUST be run and artifacts reviewed before any heimdall_edge role changes. -# -# Usage: -# ansible-playbook -i inventory/hosts.ini \ -# playbooks/preflight/capture_heimdall_baseline.yml -# -# Dry-run (no writes): -# ansible-playbook -i inventory/hosts.ini \ -# playbooks/preflight/capture_heimdall_baseline.yml --check -# -# Output: outputs/heimdall-baseline-/ -# manifest.yml β€” What was found vs missing (machine-readable) -# host_facts.yml β€” OS, kernel, CPU, RAM, IP facts -# docker_info.yml β€” Docker daemon config and version -# containers.yml β€” Running container inspect data -# networks_and_volumes.yml β€” Docker network and volume inventory -# compose_files/ β€” Fetched compose files -# env_keys/ β€” Env KEY inventory (values REDACTED) -# firewall_rules.txt β€” UFW + iptables state -# systemd_units.txt β€” Loaded systemd service units -# -# Idempotency: -# All tasks are read-only on the remote host. No state changes to Heimdall. -# Safe to re-run at any time without side effects. -# -# Modules: -# ansible.builtin.setup β€” Host facts; WHY: native, structured, zero-install -# community.docker.docker_host_info β€” Docker daemon and container list; WHY: builtin, structured -# community.docker.docker_container_info β€” Full inspect per container; WHY: richer than docker ps -# ansible.builtin.find β€” Locate compose/env files; WHY: idempotent path discovery -# ansible.builtin.fetch β€” Retrieve files to controller; WHY: preserves structure -# ansible.builtin.slurp β€” Read remote file content; WHY: no shell, structured b64 output - -- name: Capture Heimdall baseline configuration - hosts: heimdall - become: true - gather_facts: true - - vars: - heimdall_stack_search_paths: - - /opt/stacks - - /opt/docker - - /home/chester/traefik - - /home/chester - - pre_tasks: - - name: Freeze capture timestamp (evaluated once, used by all tasks) - ansible.builtin.set_fact: - capture_dir: "{{ playbook_dir }}/../../outputs/heimdall-baseline-{{ ansible_date_time.iso8601_basic_short }}" - tags: [always] - # WHY no delegate_to here: set_fact must store capture_dir on heimdall's variable scope. - # All subsequent tasks β€” even those with delegate_to: localhost β€” resolve variables - # in the context of their original host (heimdall). Storing on localhost's scope means - # heimdall never sees the frozen value, so {{ capture_dir }} re-evaluates - # ansible_date_time after task 1.1 (setup) refreshes it. - - - name: Ensure capture output directories exist on control node - ansible.builtin.file: - path: "{{ item }}" - state: directory - mode: '0755' - loop: - - "{{ capture_dir }}/compose_files" - - "{{ capture_dir }}/env_keys" - - "{{ capture_dir }}/traefik_configs" - delegate_to: localhost - become: false - run_once: true - tags: [always] - - tasks: - - # -------------------------------------------------- - # SECTION 1: Host facts - # -------------------------------------------------- - - - name: "1.1 Gather all system facts" - ansible.builtin.setup: - tags: [facts, always] - - - name: "1.2 Compile host fact summary" - ansible.builtin.set_fact: - baseline_host_facts: - hostname: "{{ ansible_hostname }}" - fqdn: "{{ ansible_fqdn }}" - os_family: "{{ ansible_os_family }}" - distribution: "{{ ansible_distribution }}" - distribution_version: "{{ ansible_distribution_version }}" - distribution_release: "{{ ansible_distribution_release }}" - kernel: "{{ ansible_kernel }}" - architecture: "{{ ansible_architecture }}" - uptime_seconds: "{{ ansible_uptime_seconds }}" - memory_total_mb: "{{ ansible_memtotal_mb }}" - memory_free_mb: "{{ ansible_memfree_mb }}" - cpu_vcpus: "{{ ansible_processor_vcpus }}" - default_ipv4: "{{ ansible_default_ipv4 }}" - interfaces: "{{ ansible_interfaces }}" - python_version: "{{ ansible_python_version }}" - ansible_user: "{{ ansible_user_id }}" - tags: [facts] - - - name: "1.3 Save host facts to control node" - ansible.builtin.copy: - content: "{{ baseline_host_facts | to_nice_yaml }}" - dest: "{{ capture_dir }}/host_facts.yml" - mode: '0640' - delegate_to: localhost - become: false - tags: [facts] - - # -------------------------------------------------- - # SECTION 2: Docker daemon state - # -------------------------------------------------- - - - name: "2.1 Verify Docker service is active" - ansible.builtin.systemd: - name: docker - register: _docker_service - failed_when: _docker_service.status.ActiveState != 'active' - changed_when: false - tags: [docker] - - - name: "2.2 Query Docker daemon info" - community.docker.docker_host_info: - register: _docker_host_info - tags: [docker] - - - name: "2.3 Read Docker daemon config" - ansible.builtin.slurp: - src: /etc/docker/daemon.json - register: _daemon_json - failed_when: false - tags: [docker] - - - name: "2.4 Compile Docker info summary" - ansible.builtin.set_fact: - baseline_docker_info: - server_version: "{{ _docker_host_info.host_info.ServerVersion | default('unknown') }}" - storage_driver: "{{ _docker_host_info.host_info.Driver | default('unknown') }}" - logging_driver: "{{ _docker_host_info.host_info.LoggingDriver | default('unknown') }}" - cgroup_driver: "{{ _docker_host_info.host_info.CgroupDriver | default('unknown') }}" - swarm_state: "{{ _docker_host_info.host_info.Swarm.LocalNodeState | default('inactive') }}" - containers_running: "{{ _docker_host_info.host_info.ContainersRunning | default(0) }}" - containers_total: "{{ _docker_host_info.host_info.Containers | default(0) }}" - daemon_config: "{{ (_daemon_json.content | b64decode | from_json) if _daemon_json.content is defined else {} }}" - tags: [docker] - - - name: "2.5 Save Docker info to control node" - ansible.builtin.copy: - content: "{{ baseline_docker_info | to_nice_yaml }}" - dest: "{{ capture_dir }}/docker_info.yml" - mode: '0640' - delegate_to: localhost - become: false - tags: [docker] - - # -------------------------------------------------- - # SECTION 3: Running containers - # -------------------------------------------------- - - - name: "3.1 Get all container summaries (running + stopped)" - community.docker.docker_host_info: - containers: true - containers_all: true - register: _all_containers - tags: [containers] - - - name: "3.2 Inspect each container for full configuration" - community.docker.docker_container_info: - name: "{{ item.Names[0] | regex_replace('^/', '') }}" - loop: "{{ _all_containers.containers }}" - loop_control: - label: "{{ item.Names[0] | regex_replace('^/', '') }}" - register: _container_inspects - when: _all_containers.containers | length > 0 - tags: [containers] - - - name: "3.3 Save container inspections to control node" - ansible.builtin.copy: - content: "{{ _container_inspects.results | map(attribute='container') | select() | list | to_nice_yaml }}" - dest: "{{ capture_dir }}/containers.yml" - mode: '0640' - delegate_to: localhost - become: false - when: _all_containers.containers | length > 0 - tags: [containers] - - - name: "3.3b Save empty container file when no containers found" - ansible.builtin.copy: - content: "# No containers found on {{ inventory_hostname }}\n" - dest: "{{ capture_dir }}/containers.yml" - mode: '0640' - delegate_to: localhost - become: false - when: _all_containers.containers | length == 0 - tags: [containers] - - # -------------------------------------------------- - # SECTION 4: Docker networks and volumes - # -------------------------------------------------- - - - name: "4.1 Query Docker networks" - community.docker.docker_host_info: - networks: true - register: _docker_networks - tags: [networks] - - - name: "4.2 Query Docker volumes" - community.docker.docker_host_info: - volumes: true - register: _docker_volumes - tags: [networks] - - - name: "4.3 Save network and volume inventory" - ansible.builtin.copy: - content: | - --- - # Docker network and volume inventory - # Host: {{ inventory_hostname }} | Captured: {{ ansible_date_time.iso8601 }} - - networks: - {{ _docker_networks.networks | default([]) | to_nice_yaml | indent(2) }} - volumes: - {{ _docker_volumes.volumes | default([]) | to_nice_yaml | indent(2) }} - dest: "{{ capture_dir }}/networks_and_volumes.yml" - mode: '0640' - delegate_to: localhost - become: false - tags: [networks] - - # -------------------------------------------------- - # SECTION 5: Compose and env files - # -------------------------------------------------- - - - name: "5.1 Locate docker compose files" - ansible.builtin.find: - paths: "{{ heimdall_stack_search_paths }}" - patterns: - - "docker-compose.yml" - - "docker-compose.yaml" - - "compose.yml" - - "compose.yaml" - recurse: true - register: _compose_files - tags: [compose] - - - name: "5.2 Locate env files" - ansible.builtin.find: - paths: "{{ heimdall_stack_search_paths }}" - patterns: - - ".env" - - "*.env" - hidden: true - recurse: true - register: _env_files - tags: [compose] - - - name: "5.3 Fetch compose files to control node" - ansible.builtin.fetch: - src: "{{ item.path }}" - dest: "{{ capture_dir }}/compose_files/{{ item.path | replace('/', '_') }}" - flat: true - loop: "{{ _compose_files.files }}" - loop_control: - label: "{{ item.path }}" - tags: [compose] - - - name: "5.4 Read env files for key extraction" - ansible.builtin.slurp: - src: "{{ item.path }}" - loop: "{{ _env_files.files }}" - loop_control: - label: "{{ item.path }}" - register: _env_contents - tags: [compose] - - - name: "5.5 Save redacted env key inventory to control node" - ansible.builtin.copy: - content: | - # Env key inventory β€” values REDACTED for security - # Source: {{ item.item.path }} - # Host: {{ inventory_hostname }} | Captured: {{ ansible_date_time.iso8601 }} - # - # To restore secrets: ansible-vault encrypt_string '' --name '' - {% for line in (item.content | b64decode).splitlines() %} - {% if line | regex_search('^[A-Za-z_][A-Za-z0-9_]*=') %} - {% set key = line.split('=')[0] %} - {{ key }}= - {% elif line.startswith('#') or line | length == 0 %} - {{ line }} - {% else %} - {{ line }} - {% endif %} - {% endfor %} - dest: "{{ capture_dir }}/env_keys/{{ item.item.path | replace('/', '_') }}.redacted" - mode: '0640' - delegate_to: localhost - become: false - loop: "{{ _env_contents.results }}" - loop_control: - label: "{{ item.item.path }}" - when: item.content is defined - tags: [compose] - - # -------------------------------------------------- - # SECTION 5.6: Traefik-specific configuration files - # -------------------------------------------------- - - - name: "5.6 Stat Traefik static and dynamic config files" - ansible.builtin.stat: - path: "{{ item }}" - loop: - - /home/chester/traefik/traefik.yml - - /home/chester/traefik/traefik-data/dynamic/middleware.yml - - /home/chester/traefik/traefik-data/dynamic/static-backends.yml - register: _traefik_config_stats - tags: [compose, traefik] - - - name: "5.7 Fetch Traefik config files that exist" - ansible.builtin.fetch: - src: "{{ item.stat.path }}" - dest: "{{ capture_dir }}/traefik_configs/{{ item.stat.path | basename }}" - flat: true - loop: "{{ _traefik_config_stats.results }}" - loop_control: - label: "{{ item.item }}" - when: item.stat.exists - tags: [compose, traefik] - - # -------------------------------------------------- - # SECTION 6: Firewall and systemd state - # -------------------------------------------------- - - - name: "6.1 Capture UFW status" - ansible.builtin.command: ufw status verbose - register: _ufw_status - changed_when: false - failed_when: false - tags: [security] - - - name: "6.2 Capture iptables rules" - ansible.builtin.command: iptables -L -n --line-numbers - register: _iptables_rules - changed_when: false - failed_when: false - tags: [security] - - - name: "6.3 Save firewall state" - ansible.builtin.copy: - content: | - # Firewall state on {{ inventory_hostname }} - # Captured: {{ ansible_date_time.iso8601 }} - - ## UFW STATUS - {{ _ufw_status.stdout | default('ufw not available or not active') }} - - ## IPTABLES (reference) - {{ _iptables_rules.stdout | default('iptables not available') }} - dest: "{{ capture_dir }}/firewall_rules.txt" - mode: '0640' - delegate_to: localhost - become: false - tags: [security] - - - name: "6.4 Query loaded systemd service units" - ansible.builtin.command: systemctl list-units --type=service --state=loaded --no-pager - register: _systemd_units - changed_when: false - tags: [systemd] - - - name: "6.5 Save systemd unit list" - ansible.builtin.copy: - content: "{{ _systemd_units.stdout }}" - dest: "{{ capture_dir }}/systemd_units.txt" - mode: '0640' - delegate_to: localhost - become: false - tags: [systemd] - - # -------------------------------------------------- - # SECTION 7: Critical path inventory - # -------------------------------------------------- - - - name: "7.1 Stat critical stack paths" - ansible.builtin.stat: - path: "{{ item }}" - loop: - - /opt/stacks/heimdall - - /opt/stacks/heimdall/docker-compose.yml - - /opt/stacks/heimdall/.env - - /opt/stacks/heimdall/traefik-certs - - /opt/stacks/heimdall/traefik-certs/acme.json - - /opt/stacks/heimdall/redis-data - - /opt/stacks/heimdall/runner-data - - /home/chester/traefik - - /home/chester/traefik/docker-compose.yml - - /home/chester/traefik/.env - - /home/chester/traefik/traefik.yml - - /home/chester/traefik/traefik-data/dynamic/middleware.yml - - /home/chester/traefik/traefik-data/dynamic/static-backends.yml - - /home/chester/traefik/traefik-data/certs/acme.json - - /etc/docker/daemon.json - register: _critical_path_stats - tags: [paths] - - - name: "7.2 Build critical path presence map" - ansible.builtin.set_fact: - manifest_paths: >- - {{ - dict( - _critical_path_stats.results | map(attribute='item') | list - | zip( - _critical_path_stats.results | map(attribute='stat') | map(attribute='exists') | list - ) - ) - }} - tags: [paths] - - # -------------------------------------------------- - # SECTION 8: Write manifest and validate - # -------------------------------------------------- - - - name: "8.1 Write machine-readable capture manifest" - ansible.builtin.copy: - content: | - --- - --- - # Heimdall baseline capture manifest - # Generated: {{ ansible_date_time.iso8601 }} - # Host: {{ inventory_hostname }} ({{ ansible_host }}) - # Review this file before proceeding to heimdall_edge role refactor. - - capture_timestamp: "{{ ansible_date_time.iso8601 }}" - capture_dir: "{{ capture_dir }}" - - host: - hostname: "{{ ansible_hostname }}" - ip: "{{ ansible_host }}" - os: "{{ ansible_distribution }} {{ ansible_distribution_version }}" - kernel: "{{ ansible_kernel }}" - - docker: - version: "{{ baseline_docker_info.server_version }}" - storage_driver: "{{ baseline_docker_info.storage_driver }}" - swarm_state: "{{ baseline_docker_info.swarm_state }}" - containers_running: {{ baseline_docker_info.containers_running }} - containers_total: {{ baseline_docker_info.containers_total }} - - inventory: - containers_found: {{ _all_containers.containers | length }} - compose_files_found: {{ _compose_files.files | length }} - env_files_found: {{ _env_files.files | length }} - - critical_paths: - {{ manifest_paths | to_nice_yaml | indent(2) }} - compose_file_paths: - {{ _compose_files.files | map(attribute='path') | list | to_nice_yaml | indent(2) }} - env_file_paths: - {{ _env_files.files | map(attribute='path') | list | to_nice_yaml | indent(2) }} - containers_running: - {{ _all_containers.containers | map(attribute='Names') | map('first') | map('regex_replace', '^/', '') | list | to_nice_yaml | indent(2) }} - validation: - compose_files_present: {{ _compose_files.files | length > 0 }} - containers_present: {{ _all_containers.containers | length > 0 }} - stack_dir_present: {{ manifest_paths['/opt/stacks/heimdall'] | default(false) }} - compose_present: {{ manifest_paths['/opt/stacks/heimdall/docker-compose.yml'] | default(false) }} - env_present: {{ manifest_paths['/opt/stacks/heimdall/.env'] | default(false) }} - dest: "{{ capture_dir }}/manifest.yml" - mode: '0640' - delegate_to: localhost - become: false - tags: [manifest, always] - - - name: "8.2 Assert compose file is present somewhere on host" - ansible.builtin.assert: - that: - - _compose_files.files | length > 0 - fail_msg: >- - No docker-compose files found on {{ inventory_hostname }} under - {{ heimdall_stack_search_paths | join(', ') }}. - Review {{ capture_dir }}/manifest.yml for the full path inventory. - success_msg: "Compose file(s) found: {{ _compose_files.files | map(attribute='path') | list | join(', ') }}" - tags: [validate, always] - - - name: "8.3 Report capture summary" - ansible.builtin.debug: - msg: - - "==========================================" - - "Heimdall baseline capture complete." - - "==========================================" - - "Output dir : {{ capture_dir }}" - - "Containers : {{ _all_containers.containers | length }} found" - - "Compose files: {{ _compose_files.files | length }} found" - - "Env files : {{ _env_files.files | length }} found" - - "------------------------------------------" - - "Next step: review artifacts, then run:" - - " ansible-playbook .../self-heal/heimdall.yml --check" - - "==========================================" - tags: [always] diff --git a/ansible/archive/playbooks/preflight/gather_hardware_facts.yml b/ansible/archive/playbooks/preflight/gather_hardware_facts.yml deleted file mode 100644 index ed9a5e7..0000000 --- a/ansible/archive/playbooks/preflight/gather_hardware_facts.yml +++ /dev/null @@ -1,320 +0,0 @@ ---- -# playbooks/preflight/gather_hardware_facts.yml -# -# Purpose: -# Gather comprehensive hardware specifications from Proxmox hosts for -# capacity planning and topology analysis (e.g., Docker Swarm 3-node cluster) -# -# Usage: -# ansible-playbook -i inventory/hosts.ini playbooks/preflight/gather_hardware_facts.yml \ -# -e "target_hosts=proxmox_cluster" \ -# -e "output_file=hardware_comparison_$(date +%Y%m%d).yml" -# -# Modules Explained: -# - ansible.builtin.gather_facts: Collects system facts (CPU, RAM, network, OS) -# WHY: Native, zero-install, idempotent. Beats shell commands because it's structured. -# - ansible.builtin.setup: Explicit fact-gathering (redundant but explicit docs) -# - ansible.builtin.command: For Proxmox-specific queries (pvesh, pveversion) -# WHY: Cleaner than shell for simple commands; no shell interpretation risk -# - ansible.builtin.copy: Save structured YAML report locally -# WHY: Idempotent, checksummed, handles permissions correctly -# -# Idempotency: -# All tasks are read-only (gather facts, query status). No state changes. -# Safe to run multiple times without side effects. -# -# Safety Notes: -# - Does NOT modify Proxmox configuration or cluster state -# - Requires root SSH access (standard Proxmox assumption) -# - Network scanning is passive (no stress-testing) - -- name: "Gather Hardware Facts from Proxmox Hosts" - hosts: "{{ target_hosts | default('proxmox_cluster') }}" - gather_facts: true - become: true - vars: - output_dir: "{{ playbook_dir }}/../../outputs" - output_file: "{{ output_dir }}/hardware_facts_{{ ansible_date_time.iso8601_basic_short }}.yml" - - pre_tasks: - - name: "Validate target hosts are Proxmox nodes" - ansible.builtin.assert: - that: - - inventory_hostname.startswith('pve') - fail_msg: "This playbook is for Proxmox nodes (pve*). Target: {{ inventory_hostname }}" - tags: - - validate - - - name: "Ensure output directory exists" - ansible.builtin.file: - path: "{{ output_dir }}" - state: directory - mode: '0755' - delegate_to: localhost - run_once: true - tags: - - setup - - tasks: - # ------------------------------------------------------------------------- - # SECTION 1: SYSTEM & OS FACTS - # ------------------------------------------------------------------------- - - - name: "1.1 Gather all facts (native Ansible)" - ansible.builtin.setup: - tags: - - facts - - always - - - name: "1.2 Extract CPU model and frequencies" - ansible.builtin.set_fact: - hw_cpu_model: "{{ ansible_processor | first | default('unknown') }}" - hw_cpu_cores: "{{ ansible_processor_vcpus | default('unknown') }}" - hw_cpu_cores_per_socket: "{{ ansible_processor_cores | default('unknown') }}" - hw_cpu_count: "{{ ansible_processor_count | default(1) }}" - tags: - - facts - - cpu - - - name: "1.3 Query CPU frequency (max freq)" - ansible.builtin.shell: "grep 'cpu MHz' /proc/cpuinfo | head -1" - register: _cpu_freq_output - changed_when: false - failed_when: false - tags: - - facts - - cpu - - - name: "1.4 Parse CPU frequency" - ansible.builtin.set_fact: - hw_cpu_freq_mhz: "{{ (_cpu_freq_output.stdout | regex_search(':\\s+([0-9.]+)') | regex_replace('[^0-9.]', '')).split('.')[0] | float if _cpu_freq_output.stdout else 'unknown' }}" - tags: - - facts - - cpu - - - name: "1.5 Extract RAM information" - ansible.builtin.set_fact: - hw_ram_total_mb: "{{ ansible_memtotal_mb | default('unknown') }}" - hw_ram_free_mb: "{{ ansible_memfree_mb | default('unknown') }}" - tags: - - facts - - ram - - - name: "1.6 Extract OS and kernel" - ansible.builtin.set_fact: - hw_os_family: "{{ ansible_os_family }}" - hw_os_distro: "{{ ansible_distribution }}" - hw_os_release: "{{ ansible_distribution_release }}" - hw_kernel: "{{ ansible_kernel }}" - hw_uptime_seconds: "{{ ansible_uptime_seconds }}" - tags: - - facts - - os - - # ------------------------------------------------------------------------- - # SECTION 2: STORAGE FACTS - # ------------------------------------------------------------------------- - - - name: "2.1 Gather disk information" - ansible.builtin.set_fact: - hw_disks: "{{ ansible_devices | default({}) | dict2items | map(attribute='key') | list }}" - tags: - - facts - - storage - - - name: "2.2 Gather mount points and usage" - ansible.builtin.shell: | - df -h | tail -n +2 | awk '{print $1 " (" $4 " available of " $2 ")"}' - register: _mount_output - changed_when: false - tags: - - facts - - storage - - - name: "2.3 Parse mount data" - ansible.builtin.set_fact: - hw_mounts: "{{ _mount_output.stdout_lines | default([]) }}" - tags: - - facts - - storage - - # ------------------------------------------------------------------------- - # SECTION 3: NETWORK FACTS - # ------------------------------------------------------------------------- - - - name: "3.1 Gather network interface details" - ansible.builtin.set_fact: - hw_network_interfaces: "{{ ansible_interfaces | default([]) }}" - tags: - - facts - - network - - # ----------------------------------------------------------------------- - # SECTION 4: PROXMOX-SPECIFIC FACTS - # ----------------------------------------------------------------------- - - - name: "4.1 Query Proxmox version" - ansible.builtin.command: "pveversion" - register: _pveversion_output - changed_when: false - tags: - - facts - - proxmox - - - name: "4.2 Parse Proxmox version" - ansible.builtin.set_fact: - hw_proxmox_version: "{{ _pveversion_output.stdout | regex_search('proxmox-ve\\s+([0-9.]+)') | regex_replace('[^0-9.]', '') }}" - hw_pveversion_full: "{{ _pveversion_output.stdout }}" - tags: - - facts - - proxmox - - - name: "4.3 Query Proxmox cluster status" - ansible.builtin.command: "pvesh get /cluster/resources --output-format json" - register: _cluster_resources - changed_when: false - failed_when: false - tags: - - facts - - proxmox - - - name: "4.4 Count VMs and containers on this host" - ansible.builtin.set_fact: - hw_vm_count: "{{ (_cluster_resources.stdout | from_json | selectattr('node', 'equalto', inventory_hostname) | selectattr('type', 'match', 'qemu|lxc') | list | length) if _cluster_resources.rc == 0 else 'unknown' }}" - tags: - - facts - - proxmox - - - name: "4.5 Query Proxmox cluster info" - ansible.builtin.command: "pvesh get /cluster/status --output-format json" - register: _cluster_status - changed_when: false - failed_when: false - tags: - - facts - - proxmox - - - name: "4.6 Parse cluster membership" - ansible.builtin.set_fact: - hw_cluster_name: "{{ (_cluster_status.stdout | from_json | first).cluster | default('not-clustered') if _cluster_status.rc == 0 and (_cluster_status.stdout | from_json | length) > 0 else 'not-clustered' }}" - hw_cluster_nodes: "{{ (_cluster_status.stdout | from_json | map(attribute='name') | list) if _cluster_status.rc == 0 else [] }}" - tags: - - facts - - proxmox - - - name: "4.7 Check if node is clustered" - ansible.builtin.set_fact: - hw_is_clustered: "{{ inventory_hostname in hw_cluster_nodes }}" - tags: - - facts - - proxmox - - # ----------------------------------------------------------------------- - # SECTION 5: SYSTEM LOAD & CAPACITY - # ----------------------------------------------------------------------- - - - name: "5.1 Gather system load" - ansible.builtin.set_fact: - hw_load_1min: "{{ ansible_load | default([0, 0, 0]) | first }}" - hw_load_5min: "{{ (ansible_load | default([0, 0, 0]))[1] }}" - hw_load_15min: "{{ (ansible_load | default([0, 0, 0]))[2] }}" - tags: - - facts - - load - - - name: "5.2 Calculate CPU usage percentage" - ansible.builtin.set_fact: - hw_cpu_load_percent: "{{ ((hw_load_1min | float / hw_cpu_cores | int) * 100) | int if hw_load_1min != 0 else 0 }}" - tags: - - facts - - load - - post_tasks: - - name: "Build hardware summary fact" - ansible.builtin.set_fact: - hardware_summary: - hostname: "{{ inventory_hostname }}" - ip_address: "{{ ansible_default_ipv4.address }}" - fqdn: "{{ ansible_fqdn }}" - timestamp: "{{ ansible_date_time.iso8601 }}" - - system: - os: "{{ hw_os_distro }} {{ hw_os_release }}" - kernel: "{{ hw_kernel }}" - uptime_days: "{{ (hw_uptime_seconds / 86400) | int }}" - - cpu: - model: "{{ hw_cpu_model }}" - sockets: "{{ hw_cpu_count }}" - cores_per_socket: "{{ hw_cpu_cores_per_socket }}" - total_cores: "{{ hw_cpu_cores }}" - max_frequency_mhz: "{{ hw_cpu_freq_mhz | int if hw_cpu_freq_mhz != 'unknown' else 'unknown' }}" - current_1min_load: "{{ hw_load_1min }}" - cpu_load_percent: "{{ hw_cpu_load_percent }}%" - - memory: - total_mb: "{{ hw_ram_total_mb }}" - total_gb: "{{ (hw_ram_total_mb | int / 1024) | int if hw_ram_total_mb != 'unknown' else 'unknown' }}" - free_mb: "{{ hw_ram_free_mb }}" - free_gb: "{{ (hw_ram_free_mb | int / 1024) | int if hw_ram_free_mb != 'unknown' else 'unknown' }}" - - storage: - disks_detected: "{{ hw_disks | length }}" - disk_list: "{{ hw_disks }}" - mounts_summary: "{{ hw_mounts }}" - - network: - interfaces_count: "{{ hw_network_interfaces | length }}" - interface_list: "{{ hw_network_interfaces }}" - - proxmox: - version: "{{ hw_proxmox_version }}" - version_full: "{{ hw_pveversion_full }}" - cluster_name: "{{ hw_cluster_name | default('not-clustered') }}" - is_clustered: "{{ hw_is_clustered }}" - cluster_members: "{{ hw_cluster_nodes | default([]) }}" - vms_and_containers: "{{ hw_vm_count }}" - tags: - - summary - - always - - - name: "Display hardware summary" - ansible.builtin.debug: - msg: "{{ hardware_summary }}" - tags: - - summary - - always - - - name: "Collect all hardware facts for report" - ansible.builtin.set_fact: - all_hardware_facts: "{{ all_hardware_facts | default({}) | combine({inventory_hostname: hardware_summary}) }}" - tags: - - report - - - name: "Write hardware comparison report" - ansible.builtin.copy: - content: | - --- - # Hardware Facts Report - # Generated: {{ ansible_date_time.iso8601 }} - # Hosts Analyzed: {{ groups[target_hosts | default('proxmox_cluster')] | length }} - # - # Usage: - # This report compares hardware specifications for Docker Swarm topology planning. - # See README in documentation/architecture/ for capacity analysis. - - {{ all_hardware_facts | to_nice_yaml }} - dest: "{{ output_file }}" - mode: '0644' - delegate_to: localhost - run_once: true - tags: - - report - - - name: "Report output file location" - ansible.builtin.debug: - msg: "βœ“ Hardware facts saved to: {{ output_file }}" - delegate_to: localhost - run_once: true - tags: - - report diff --git a/ansible/archive/playbooks/preflight/reconcile_edge_route.yml b/ansible/archive/playbooks/preflight/reconcile_edge_route.yml deleted file mode 100644 index ed12445..0000000 --- a/ansible/archive/playbooks/preflight/reconcile_edge_route.yml +++ /dev/null @@ -1,92 +0,0 @@ ---- -# Reconcile external Traefik Redis route keys for a single service. -# -# Purpose: -# Codify emergency Redis route edits into repeatable automation so -# route state can be restored without manual redis-cli commands. -# -# Usage: -# cd /home/chester/homelab/ansible -# ansible-playbook -i inventory/hosts.ini playbooks/preflight/reconcile_edge_route.yml \ -# -e "route_name=gitea" \ -# -e "route_fqdn=git.castaldifamily.com" \ -# -e "route_backend_url=http://10.0.0.211:8251" - -- name: Reconcile edge route keys in Redis - hosts: watchtower - gather_facts: false - vars_files: - - ../../group_vars/all.yml - - vars: - route_name: gitea - route_fqdn: git.castaldifamily.com - route_backend_url: "http://{{ edge_routing.swarm.bind_ip }}:8251" - route_entrypoint: websecure - route_cert_resolver: cloudflare - redis_container_name: redis - - tasks: - - name: Validate required route inputs - ansible.builtin.assert: - that: - - route_name | trim | length > 0 - - route_fqdn | trim | length > 0 - - route_backend_url | trim | length > 0 - - edge_routing.edge_host.name | length > 0 - fail_msg: "Missing required route reconciliation inputs." - - - name: Build route key map - ansible.builtin.set_fact: - edge_route_pairs: - - key: "traefik/http/routers/{{ route_name }}/rule" - value: "Host(`{{ route_fqdn }}`)" - - key: "traefik/http/routers/{{ route_name }}/service" - value: "{{ route_name }}" - - key: "traefik/http/routers/{{ route_name }}/entryPoints/0" - value: "{{ route_entrypoint }}" - - key: "traefik/http/routers/{{ route_name }}/tls/certResolver" - value: "{{ route_cert_resolver }}" - - key: "traefik/http/services/{{ route_name }}/loadBalancer/servers/0/url" - value: "{{ route_backend_url }}" - - key: "traefik/http/services/{{ route_name }}/loadBalancer/passHostHeader" - value: "true" - - - name: Read existing route key values - ansible.builtin.command: >- - docker exec {{ redis_container_name }} redis-cli GET {{ item.key }} - delegate_to: "{{ edge_routing.edge_host.name }}" - become: true - loop: "{{ edge_route_pairs }}" - register: edge_route_existing_values - changed_when: false - failed_when: false - - - name: Write route keys when drift is detected - ansible.builtin.command: >- - docker exec {{ redis_container_name }} redis-cli SET {{ item.item.key }} {{ item.item.value | quote }} - delegate_to: "{{ edge_routing.edge_host.name }}" - become: true - loop: "{{ edge_route_existing_values.results }}" - when: (item.stdout | default('')) != item.item.value - register: edge_route_set_results - changed_when: true - - - name: Verify reconciled backend URL - ansible.builtin.command: >- - docker exec {{ redis_container_name }} redis-cli GET - traefik/http/services/{{ route_name }}/loadBalancer/servers/0/url - delegate_to: "{{ edge_routing.edge_host.name }}" - become: true - register: edge_route_backend_verify - changed_when: false - - - name: Assert backend URL matches expected value - ansible.builtin.assert: - that: - - edge_route_backend_verify.stdout | trim == route_backend_url - fail_msg: >- - Redis backend URL for {{ route_name }} is '{{ edge_route_backend_verify.stdout | trim }}' - but expected '{{ route_backend_url }}'. - success_msg: >- - Edge route '{{ route_name }}' reconciled to {{ route_backend_url }}. diff --git a/ansible/archive/playbooks/preflight/validate_control_node.yml b/ansible/archive/playbooks/preflight/validate_control_node.yml deleted file mode 100644 index 15b1d2a..0000000 --- a/ansible/archive/playbooks/preflight/validate_control_node.yml +++ /dev/null @@ -1,11 +0,0 @@ ---- -# playbooks/preflight/validate_control_node.yml -# Run control node system/environment/sanity/reality checks. - -- name: Validate Ansible control node readiness - hosts: localhost - connection: local - gather_facts: false - - roles: - - role: control_node_sanity diff --git a/ansible/archive/playbooks/preflight/validate_edge_ingress.yml b/ansible/archive/playbooks/preflight/validate_edge_ingress.yml deleted file mode 100644 index 55de75d..0000000 --- a/ansible/archive/playbooks/preflight/validate_edge_ingress.yml +++ /dev/null @@ -1,80 +0,0 @@ ---- -# Validate edge ingress readiness for an externally-routed Swarm service. -# Usage: -# ansible-playbook -i inventory/hosts.ini playbooks/preflight/validate_edge_ingress.yml \ -# -e "service_fqdn=git.castaldifamily.com" \ -# -e "backend_port=8251" - -- name: Validate external Traefik ingress path - hosts: localhost - connection: local - gather_facts: false - vars_files: - - ../../group_vars/all.yml - - vars: - service_fqdn: "git.castaldifamily.com" - backend_port: 8251 - # backend_host controls which IP Heimdall probes for the backend. - # Default: swarm.bind_ip β€” correct for Swarm services (routing mesh exposes published - # ports on all nodes). Override with edge_routing.integration.bind_ip for services - # running on Watchtower (Grafana, Dozzle, Uptime Kuma, etc.). - backend_host: "{{ edge_routing.swarm.bind_ip }}" - allowed_external_http_codes: - - "200" - - "301" - - "302" - - "401" - - "403" - - tasks: - - name: Build derived probe URLs - ansible.builtin.set_fact: - backend_url: "http://{{ backend_host }}:{{ backend_port }}" - external_url: "https://{{ service_fqdn }}" - primary_swarm_manager: "{{ groups['swarm_managers'][0] }}" - - - name: Validate required variables - ansible.builtin.assert: - that: - - edge_routing.edge_host.name | length > 0 - - edge_routing.integration.bind_ip | length > 0 - - edge_routing.integration.redis_addr | length > 0 - - service_fqdn | length > 0 - fail_msg: "Missing required edge routing or service probe inputs." - - - name: Probe service backend from edge host - ansible.builtin.command: >- - curl -sS -o /dev/null -w %{http_code} --max-time 6 {{ backend_url }} - delegate_to: "{{ edge_routing.edge_host.name }}" - register: edge_backend_probe - changed_when: false - failed_when: edge_backend_probe.stdout == "000" - - - name: Probe public service endpoint from controller - ansible.builtin.command: >- - curl -sS -k -o /dev/null -w %{http_code} --max-time 10 {{ external_url }} - register: external_probe - changed_when: false - - - name: Check external endpoint health code - ansible.builtin.assert: - that: - - external_probe.stdout in allowed_external_http_codes - fail_msg: >- - External endpoint {{ external_url }} returned HTTP {{ external_probe.stdout }}. - Expected one of {{ allowed_external_http_codes | join(', ') }}. - - - name: Capture traefik-kop logs for publication hints - ansible.builtin.command: docker service logs traefik-kop_traefik-kop --tail 120 - delegate_to: "{{ primary_swarm_manager }}" - register: traefik_kop_logs - changed_when: false - failed_when: false - - - name: Report ingress validation summary - ansible.builtin.debug: - msg: - - "Edge backend probe (from {{ edge_routing.edge_host.name }}): {{ backend_url }} -> HTTP {{ edge_backend_probe.stdout }}" - - "External probe (from controller): {{ external_url }} -> HTTP {{ external_probe.stdout }}" - - "Traefik-kop log sample lines: {{ (traefik_kop_logs.stdout_lines | default([]))[:8] }}" diff --git a/ansible/archive/playbooks/proxmox/deploy_standalone_ubuntu_vm.yml b/ansible/archive/playbooks/proxmox/deploy_standalone_ubuntu_vm.yml deleted file mode 100644 index cafbae9..0000000 --- a/ansible/archive/playbooks/proxmox/deploy_standalone_ubuntu_vm.yml +++ /dev/null @@ -1,56 +0,0 @@ ---- -# Deploy one standalone Ubuntu 24.04 VM on a specific Proxmox host. -# -# Usage: -# ansible-playbook -i inventory/hosts.ini playbooks/proxmox/deploy_standalone_ubuntu_vm.yml \ -# -e "target_host=pve02 standalone_vm_vmid=5210" - -- name: Deploy standalone Ubuntu VM on Proxmox - hosts: "{{ target_host | default('standalone_vm_host_undefined') }}" - gather_facts: true - vars: - standalone_vm_name: "statler" - standalone_vm_memory_mb: 10240 - standalone_vm_cores: 2 - standalone_vm_disk_size: "32G" - standalone_vm_ip_cidr: "10.0.0.210/24" - standalone_vm_gateway: "10.0.0.2" - standalone_vm_dns_servers: - - "10.0.0.2" - - "8.8.8.8" - - pre_tasks: - - name: Validate standalone VM playbook scope - ansible.builtin.assert: - that: - - target_host is defined - - target_host | length > 0 - - target_host not in ['all', '*', 'proxmox_cluster', 'standalone_vm_host_undefined'] - - standalone_vm_vmid is defined - - (standalone_vm_vmid | int) > 0 - fail_msg: "Provide an explicit Proxmox host and VMID. Example: -e 'target_host=pve02 standalone_vm_vmid=5210'" - success_msg: "Standalone VM playbook scope validated." - run_once: true - delegate_to: localhost - tags: ['preflight', 'validate'] - - roles: - - role: statler - vars: - statler_vm_name: "{{ standalone_vm_name }}" - statler_vm_vmid: "{{ standalone_vm_vmid | int }}" - statler_vm_memory_mb: "{{ standalone_vm_memory_mb | int }}" - statler_vm_cores: "{{ standalone_vm_cores | int }}" - statler_vm_disk_size: "{{ standalone_vm_disk_size }}" - statler_vm_ip_cidr: "{{ standalone_vm_ip_cidr }}" - statler_vm_gateway: "{{ standalone_vm_gateway }}" - statler_vm_dns_servers: "{{ standalone_vm_dns_servers }}" - - post_tasks: - - name: Show follow-up onboarding command - ansible.builtin.debug: - msg: - - "Provisioning finished for {{ standalone_vm_name }} at {{ standalone_vm_ip_cidr }}." - - "Next: ansible-playbook -i inventory/hosts.ini playbooks/onboarding/generic_host.yml -e 'target_host={{ standalone_vm_name }} onboarding_profile=new'" - run_once: true - tags: ['summary'] diff --git a/ansible/archive/playbooks/proxmox/grow_vm_disks.yml b/ansible/archive/playbooks/proxmox/grow_vm_disks.yml deleted file mode 100644 index 41ff6fa..0000000 --- a/ansible/archive/playbooks/proxmox/grow_vm_disks.yml +++ /dev/null @@ -1,188 +0,0 @@ ---- -# playbooks/proxmox/grow_vm_disks.yml -# -# Purpose: -# Idempotently ensures all Swarm VM disks are sized to vm_disk_target on the -# Proxmox layer (Play 1), reboots affected VMs so the guest kernel reads the -# new block device geometry (Play 2), then grows the in-guest partition and -# filesystem to match (Play 3). -# -# Architecture: -# Play 1 β€” proxmox_cluster: checks the actual LVM volume size via `lvs` (NOT -# `qm config`, which can be out of sync) and uses `lvextend` if below target. -# WHY lvs not qm config: qm resize updates Proxmox metadata but can silently -# fail to grow the LVM when the VM is running. lvs shows ground truth. -# Play 2 β€” proxmox_cluster: reboots only the VMs whose LVs were just extended. -# Play 3 β€” swarm_hosts: waits for SSH, then runs disk_grow role (growpart + -# resize2fs). WHY reboot required: virtio-scsi guests on this kernel do not -# honour /sys/class/block/sda/device/rescan or scsi_host scans while running. -# Only a cold re-read of block device geometry at boot is reliable. -# -# VMID scheme: manager = (node_index * 100) + 1, worker = (node_index * 100) + 2 -# pve01 β†’ 101/102, pve02 β†’ 201/202, pve03 β†’ 301/302 -# LV path: /dev/pve/vm-{vmid}-disk-0 (standard local-lvm layout) -# -# Pre-requisites: -# - SSH access to proxmox_cluster and swarm_hosts -# - LVM tools available on Proxmox nodes (standard PVE install) -# - cloud-guest-utils will be installed by disk_grow role if absent -# -# Usage: -# Fix all Swarm VMs across all PVE nodes: -# ansible-playbook -i inventory/hosts.ini playbooks/proxmox/grow_vm_disks.yml -# -# Fix a single node end-to-end (all three plays, one guest): -# ansible-playbook -i inventory/hosts.ini playbooks/proxmox/grow_vm_disks.yml \ -# -e "target_vmids=101" --limit pve01 # Play 1+2 on pve01, Play 3 on swarm-manager-1 -# -# In-guest grow only (disk already extended, VM already rebooted): -# ansible-playbook -i inventory/hosts.ini playbooks/proxmox/grow_vm_disks.yml \ -# --limit swarm-manager-1 --tags in_guest -# -# Validate only (no changes): -# ansible-playbook -i inventory/hosts.ini playbooks/proxmox/grow_vm_disks.yml \ -# --check -# -# Verification after run: -# ansible swarm_hosts -i inventory/hosts.ini -m shell -a "df -h /" --become - -# ============================================================ -# PLAY 1: Proxmox layer β€” extend LVM volume for each Swarm VM -# Source of truth: lvs (actual LVM size), NOT qm config (metadata only) -# ============================================================ -- name: Extend Swarm VM LVM volumes on Proxmox hosts - hosts: proxmox_cluster - become: false - gather_facts: false - tags: [proxmox_resize] - - vars: - vm_disk_target: "32G" - vm_disk_target_gb: "{{ vm_disk_target | regex_replace('[^0-9]', '') | int }}" - vm_lv_vg: "pve" - - tasks: - - name: Derive VM IDs and LV names for this PVE node - ansible.builtin.set_fact: - disk_grow_manager_vmid: "{{ (inventory_hostname | regex_replace('[^0-9]', '') | int) * 100 + 1 }}" - disk_grow_worker_vmid: "{{ (inventory_hostname | regex_replace('[^0-9]', '') | int) * 100 + 2 }}" - - # Manager VM ------------------------------------------------------- - # WHY lvs not qm config: qm resize updates Proxmox metadata but silently - # fails to grow the LVM when the VM is already running. lvs is ground truth. - - - name: Get actual LVM size for manager VM {{ disk_grow_manager_vmid }} - ansible.builtin.shell: | - lvs --noheadings --units g -o lv_size \ - /dev/{{ vm_lv_vg }}/vm-{{ disk_grow_manager_vmid }}-disk-0 2>/dev/null \ - | tr -d ' ' | sed 's/g$//' | cut -d. -f1 \ - || echo "absent" - args: - executable: /bin/bash - register: disk_grow_manager_lv_size - changed_when: false - - - name: Extend manager VM LV to {{ vm_disk_target }} if below target - ansible.builtin.shell: | - lvextend -L {{ vm_disk_target }} \ - /dev/{{ vm_lv_vg }}/vm-{{ disk_grow_manager_vmid }}-disk-0 - args: - executable: /bin/bash - when: - - disk_grow_manager_lv_size.stdout | trim != 'absent' - - (disk_grow_manager_lv_size.stdout | trim | int) < (vm_disk_target_gb | int) - register: disk_grow_manager_extend_result - changed_when: disk_grow_manager_extend_result.rc == 0 - - - name: Report manager VM LV state - ansible.builtin.debug: - msg: >- - Manager VM {{ disk_grow_manager_vmid }} LV: - {{ disk_grow_manager_lv_size.stdout | trim }}G - β†’ {{ vm_disk_target }} - ({{ 'extended β€” reboot required' if (disk_grow_manager_extend_result is not skipped) - else 'already at target or absent' }}) - when: disk_grow_manager_lv_size.stdout | trim != 'absent' - - # Worker VM -------------------------------------------------------- - - - name: Get actual LVM size for worker VM {{ disk_grow_worker_vmid }} - ansible.builtin.shell: | - lvs --noheadings --units g -o lv_size \ - /dev/{{ vm_lv_vg }}/vm-{{ disk_grow_worker_vmid }}-disk-0 2>/dev/null \ - | tr -d ' ' | sed 's/g$//' | cut -d. -f1 \ - || echo "absent" - args: - executable: /bin/bash - register: disk_grow_worker_lv_size - changed_when: false - - - name: Extend worker VM LV to {{ vm_disk_target }} if below target - ansible.builtin.shell: | - lvextend -L {{ vm_disk_target }} \ - /dev/{{ vm_lv_vg }}/vm-{{ disk_grow_worker_vmid }}-disk-0 - args: - executable: /bin/bash - when: - - disk_grow_worker_lv_size.stdout | trim != 'absent' - - (disk_grow_worker_lv_size.stdout | trim | int) < (vm_disk_target_gb | int) - register: disk_grow_worker_extend_result - changed_when: disk_grow_worker_extend_result.rc == 0 - - - name: Report worker VM LV state - ansible.builtin.debug: - msg: >- - Worker VM {{ disk_grow_worker_vmid }} LV: - {{ disk_grow_worker_lv_size.stdout | trim }}G - β†’ {{ vm_disk_target }} - ({{ 'extended β€” reboot required' if (disk_grow_worker_extend_result is not skipped) - else 'already at target or absent' }}) - when: disk_grow_worker_lv_size.stdout | trim != 'absent' - - # Reboot any VMs whose LV was just extended --------------------------- - # WHY here not in Play 2: qm reboot runs on the PVE host, not the guest. - # We only reboot VMs that were actually extended this run. - - - name: Reboot manager VM {{ disk_grow_manager_vmid }} to expose new disk size to guest kernel - ansible.builtin.shell: qm reboot {{ disk_grow_manager_vmid }} - args: - executable: /bin/bash - when: - - disk_grow_manager_extend_result is not skipped - - disk_grow_manager_extend_result.changed - changed_when: true - - - name: Reboot worker VM {{ disk_grow_worker_vmid }} to expose new disk size to guest kernel - ansible.builtin.shell: qm reboot {{ disk_grow_worker_vmid }} - args: - executable: /bin/bash - when: - - disk_grow_worker_extend_result is not skipped - - disk_grow_worker_extend_result.changed - changed_when: true - -# ============================================================ -# PLAY 2: Wait for rebooted Swarm nodes to come back -# ============================================================ -- name: Wait for Swarm nodes to return after reboot - hosts: swarm_hosts - become: false - gather_facts: false - tags: [proxmox_resize, in_guest] - - tasks: - - name: Wait for SSH to become available (up to 2 minutes) - ansible.builtin.wait_for_connection: - delay: 10 - timeout: 120 - -# ============================================================ -# PLAY 3: In-guest layer β€” grow partition and filesystem -# ============================================================ -- name: Grow in-guest root partition and filesystem on all Swarm nodes - hosts: swarm_hosts - become: true - gather_facts: true - tags: [in_guest] - roles: - - disk_grow diff --git a/ansible/archive/playbooks/proxmox/list_vms.yml b/ansible/archive/playbooks/proxmox/list_vms.yml deleted file mode 100644 index 04a4bc0..0000000 --- a/ansible/archive/playbooks/proxmox/list_vms.yml +++ /dev/null @@ -1,76 +0,0 @@ ---- -- name: List VMs and Containers on Proxmox hosts - hosts: proxmox_cluster - gather_facts: false - vars: - proxmox_user: "{{ lookup('env','PROXMOX_USER') }}" - proxmox_password: "{{ lookup('env','PROXMOX_PASSWORD') }}" - proxmox_verify_ssl: false - tasks: - - name: Ensure credentials are provided - ansible.builtin.assert: - that: - - proxmox_user is defined - - proxmox_password is defined - fail_msg: "Set PROXMOX_USER and PROXMOX_PASSWORD environment variables before running." - - - name: Build Proxmox Authorization header (use token if available) - ansible.builtin.set_fact: - proxmox_auth_header: >- - {% if lookup('env','PROXMOX_TOKEN_ID') and lookup('env','PROXMOX_TOKEN_SECRET') %} - PVEAPIToken={{ lookup('env','PROXMOX_TOKEN_ID') }}={{ lookup('env','PROXMOX_TOKEN_SECRET') }} - {% elif lookup('env','PROXMOX_USER') and '!' in lookup('env','PROXMOX_USER') and lookup('env','PROXMOX_PASSWORD') %} - PVEAPIToken={{ lookup('env','PROXMOX_USER') }}={{ lookup('env','PROXMOX_PASSWORD') }} - {% else %} - - {% endif %} - - - - name: Query nodes on the host - ansible.builtin.uri: - url: "https://{{ ansible_host }}:8006/api2/json/nodes" - method: GET - validate_certs: "{{ proxmox_verify_ssl }}" - headers: "{{ {'Authorization': proxmox_auth_header} if proxmox_auth_header|length > 0 else {} }}" - return_content: true - register: nodes_resp - failed_when: false - - - name: Set list of nodes - ansible.builtin.set_fact: - proxmox_nodes: "{{ nodes_resp.json.data | map(attribute='node') | list }}" - when: nodes_resp.status == 200 and nodes_resp.json is defined and nodes_resp.json.data is defined - - - name: Query QEMU VMs for each node - ansible.builtin.uri: - url: "https://{{ ansible_host }}:8006/api2/json/nodes/{{ item }}/qemu" - method: GET - validate_certs: "{{ proxmox_verify_ssl }}" - headers: "{{ {'Authorization': proxmox_auth_header} if proxmox_auth_header|length > 0 else {} }}" - return_content: true - loop: "{{ proxmox_nodes | default([]) }}" - register: qemu_results - failed_when: false - when: proxmox_nodes is defined - - - name: Query LXC containers for each node - ansible.builtin.uri: - url: "https://{{ ansible_host }}:8006/api2/json/nodes/{{ item }}/lxc" - method: GET - validate_certs: "{{ proxmox_verify_ssl }}" - headers: "{{ {'Authorization': proxmox_auth_header} if proxmox_auth_header|length > 0 else {} }}" - return_content: true - loop: "{{ proxmox_nodes | default([]) }}" - register: lxc_results - failed_when: false - when: proxmox_nodes is defined - - - name: Compute counts - ansible.builtin.set_fact: - qemu_count: "{{ qemu_results.results | map(attribute='json.data') | map('length') | sum(default=0) }}" - lxc_count: "{{ lxc_results.results | map(attribute='json.data') | map('length') | sum(default=0) }}" - when: proxmox_nodes is defined - - - name: Report VMs/CTs on host - ansible.builtin.debug: - msg: "Host {{ inventory_hostname }} ({{ ansible_host }}) nodes={{ proxmox_nodes | default([]) }} qemu={{ qemu_count | default(0) }} lxc={{ lxc_count | default(0) }} total={{ (qemu_count|default(0)) + (lxc_count|default(0)) }}" diff --git a/ansible/archive/playbooks/proxmox/provision_swarm_vms.yml b/ansible/archive/playbooks/proxmox/provision_swarm_vms.yml deleted file mode 100644 index f75a00e..0000000 --- a/ansible/archive/playbooks/proxmox/provision_swarm_vms.yml +++ /dev/null @@ -1,542 +0,0 @@ ---- -# playbooks/proxmox/provision_swarm_vms.yml -# Provisions Ubuntu 24.04 VMs on Proxmox hosts for Docker Swarm -# -# Prerequisites: -# - community.general collection installed (ansible-galaxy collection install community.general) -# - Ubuntu 24.04 cloud image downloaded to Proxmox storage -# - API token or root SSH access to Proxmox host -# -# Usage: -# ansible-playbook -i inventory/hosts.ini playbooks/proxmox/provision_swarm_vms.yml -e target_host=pve01 -# -# Variables (can be overridden via -e or group_vars): -# - pve_node_id: extracted from inventory (1-5) -# - vm_template_name: base cloud-init template -# - vm_storage: storage pool for VM disks -# - vm_bridge: network bridge for VM NICs - -- name: Provision Swarm VMs on Proxmox - hosts: "{{ target_host | default('proxmox_cluster') }}" - gather_facts: true - vars: - # VM specifications (from standards doc) - vm_disk_size: "32G" - vm_memory_mb: 4096 - vm_cores: 2 - vm_storage: "local-lvm" - vm_bridge: "vmbr0" - - # Cloud image settings - cloud_image_url: "https://cloud-images.ubuntu.com/noble/current/noble-server-cloudimg-amd64.img" - cloud_image_name: "noble-server-cloudimg-amd64.img" - cloud_image_path: "/var/lib/vz/template/iso/{{ cloud_image_name }}" - vm_template_vmid: "{{ 9000 + (node_index | int) }}" - vm_template_name: "ubuntu-24.04-cloud-template-{{ node_index }}" - - # Derive node index from hostname (pve01 -> 1, pve02 -> 2, etc.) and coerce to integer - node_index: "{{ (pve_node_id | default(inventory_hostname | regex_replace('[^0-9]', '')) ) | int }}" - - # VM IDs (unique per node: 101/102 for node 1, 201/202 for node 2, etc.) - manager_vmid: "{{ (node_index | int) * 100 + 1 }}" - worker_vmid: "{{ (node_index | int) * 100 + 2 }}" - - # VM names - manager_name: "swarm-manager-{{ node_index }}" - worker_name: "swarm-worker-{{ node_index }}" - - # Static IPs (from inventory scheme: managers .211-.215, workers .221-.225) - manager_ip: "10.0.0.{{ 210 + (node_index | int) }}" - worker_ip: "10.0.0.{{ 220 + (node_index | int) }}" - network_cidr: "24" - gateway_ip: "10.0.0.2" - dns_primary: "10.0.0.2" - dns_secondary: "8.8.8.8" - - # Cloud-init user - vm_user: "chester" - vm_ssh_key: "{{ lookup('file', lookup('env', 'HOME') + '/.ssh/id_ed25519.pub') }}" - - tasks: - # ======================================== - # SECTION 1: Download Cloud Image - # ======================================== - - name: Check if cloud image already exists - ansible.builtin.stat: - path: "{{ cloud_image_path }}" - register: cloud_image_stat - tags: ['template', 'download'] - - - name: Download Ubuntu 24.04 cloud image - ansible.builtin.get_url: - url: "{{ cloud_image_url }}" - dest: "{{ cloud_image_path }}" - mode: '0644' - when: not cloud_image_stat.stat.exists - tags: ['template', 'download'] - - # ======================================== - # SECTION 2: Create VM Template - # ======================================== - - name: Check if VM template already exists - ansible.builtin.shell: | - qm status {{ vm_template_vmid }} 2>/dev/null && echo "exists" || echo "missing" - register: template_check - changed_when: false - failed_when: false - tags: ['template'] - - - name: Create VM template from cloud image - when: "'missing' in template_check.stdout" - tags: ['template'] - block: - - name: Create base VM for template - ansible.builtin.shell: | - qm create {{ vm_template_vmid }} \ - --name {{ vm_template_name }} \ - --memory 2048 \ - --cores 2 \ - --net0 virtio,bridge={{ vm_bridge }} \ - --scsihw virtio-scsi-pci - register: create_vm - changed_when: false - - - name: Import cloud image as disk - ansible.builtin.shell: | - qm importdisk {{ vm_template_vmid }} {{ cloud_image_path }} {{ vm_storage }} - register: import_disk - changed_when: false - - - name: Attach imported disk to VM - ansible.builtin.shell: | - qm set {{ vm_template_vmid }} \ - --scsi0 {{ vm_storage }}:vm-{{ vm_template_vmid }}-disk-0 \ - --boot c \ - --bootdisk scsi0 - changed_when: false - - - name: Add cloud-init drive - ansible.builtin.shell: | - qm set {{ vm_template_vmid }} --ide2 {{ vm_storage }}:cloudinit - changed_when: false - - - name: Configure serial console for cloud-init - ansible.builtin.shell: | - qm set {{ vm_template_vmid }} --serial0 socket --vga serial0 - changed_when: false - - - name: Convert VM to template - ansible.builtin.shell: | - qm template {{ vm_template_vmid }} - changed_when: false - - # ======================================== - # SECTION 3: Clone and Configure Manager VM - # ======================================== - - name: Check if manager VM already exists - ansible.builtin.shell: | - qm status {{ manager_vmid }} 2>/dev/null && echo "exists" || echo "missing" - register: manager_check - changed_when: false - failed_when: false - tags: ['provision', 'manager'] - - - name: Provision Swarm Manager VM - when: "'missing' in manager_check.stdout" - tags: ['provision', 'manager'] - block: - - name: Clone template to manager VM - ansible.builtin.shell: | - qm clone {{ vm_template_vmid }} {{ manager_vmid }} \ - --name {{ manager_name }} \ - --full - changed_when: false - - - name: Resize manager disk to {{ vm_disk_size }} - ansible.builtin.shell: | - qm resize {{ manager_vmid }} scsi0 {{ vm_disk_size }} - changed_when: false - - - name: Configure manager VM resources - ansible.builtin.shell: | - qm set {{ manager_vmid }} \ - --memory {{ vm_memory_mb }} \ - --cores {{ vm_cores }} \ - --onboot 1 \ - --agent enabled=1 - changed_when: false - - - name: Write SSH public key for manager - ansible.builtin.copy: - content: "{{ vm_ssh_key }}" - dest: "/tmp/sshkey_{{ manager_vmid }}.pub" - mode: '0644' - - - name: Configure manager cloud-init - ansible.builtin.shell: | - qm set {{ manager_vmid }} \ - --ciuser {{ vm_user }} \ - --sshkeys /tmp/sshkey_{{ manager_vmid }}.pub \ - --ipconfig0 ip={{ manager_ip }}/{{ network_cidr }},gw={{ gateway_ip }} \ - --nameserver {{ dns_primary }} \ - --searchdomain local - changed_when: false - - - name: Start manager VM - ansible.builtin.shell: | - qm start {{ manager_vmid }} - changed_when: false - - - name: Display manager VM info - ansible.builtin.debug: - msg: "Manager VM {{ manager_name }} (ID: {{ manager_vmid }}) configured with IP {{ manager_ip }}" - tags: ['provision', 'manager'] - - # ======================================== - # SECTION 4: Clone and Configure Worker VM - # ======================================== - - name: Check if worker VM already exists - ansible.builtin.shell: | - qm status {{ worker_vmid }} 2>/dev/null && echo "exists" || echo "missing" - register: worker_check - changed_when: false - failed_when: false - tags: ['provision', 'worker'] - - - name: Provision Swarm Worker VM - when: "'missing' in worker_check.stdout" - tags: ['provision', 'worker'] - block: - - name: Clone template to worker VM - ansible.builtin.shell: | - qm clone {{ vm_template_vmid }} {{ worker_vmid }} \ - --name {{ worker_name }} \ - --full - changed_when: false - - - name: Resize worker disk to {{ vm_disk_size }} - ansible.builtin.shell: | - qm resize {{ worker_vmid }} scsi0 {{ vm_disk_size }} - changed_when: false - - - name: Configure worker VM resources - ansible.builtin.shell: | - qm set {{ worker_vmid }} \ - --memory {{ vm_memory_mb }} \ - --cores {{ vm_cores }} \ - --onboot 1 \ - --agent enabled=1 - changed_when: false - - - name: Write SSH public key for worker - ansible.builtin.copy: - content: "{{ vm_ssh_key }}" - dest: "/tmp/sshkey_{{ worker_vmid }}.pub" - mode: '0644' - - - name: Configure worker cloud-init - ansible.builtin.shell: | - qm set {{ worker_vmid }} \ - --ciuser {{ vm_user }} \ - --sshkeys /tmp/sshkey_{{ worker_vmid }}.pub \ - --ipconfig0 ip={{ worker_ip }}/{{ network_cidr }},gw={{ gateway_ip }} \ - --nameserver {{ dns_primary }} \ - --searchdomain local - changed_when: false - - - name: Start worker VM - ansible.builtin.shell: | - qm start {{ worker_vmid }} - changed_when: false - - - name: Display worker VM info - ansible.builtin.debug: - msg: "Worker VM {{ worker_name }} (ID: {{ worker_vmid }}) configured with IP {{ worker_ip }}" - tags: ['provision', 'worker'] - - # ======================================== - # SECTION 5: Idempotent Proxmox disk resize - # WHY unconditional: the Provision blocks only run when a VM is absent. - # An existing VM that predates vm_disk_size being set would be left - # undersized. These tasks run on every invocation and are no-ops when - # the disk is already at or above the target size. - # WHY numeric comparison: qm resize cannot shrink; comparing parsed GB - # values prevents an error when the disk is already correct. - # ======================================== - - - name: Get current manager VM disk size - ansible.builtin.shell: | - qm config {{ manager_vmid }} | grep "^scsi0:" | grep -oP 'size=\K[^,\s]+' - register: disk_grow_manager_current - changed_when: false - tags: ['provision', 'disks'] - - - name: Resize manager disk to {{ vm_disk_size }} if below target - ansible.builtin.shell: | - qm resize {{ manager_vmid }} scsi0 {{ vm_disk_size }} - when: > - (disk_grow_manager_current.stdout | regex_replace('[^0-9]', '') | int) - < (vm_disk_size | regex_replace('[^0-9]', '') | int) - tags: ['provision', 'disks'] - - - name: Get current worker VM disk size - ansible.builtin.shell: | - qm config {{ worker_vmid }} | grep "^scsi0:" | grep -oP 'size=\K[^,\s]+' - register: disk_grow_worker_current - changed_when: false - tags: ['provision', 'disks'] - - - name: Resize worker disk to {{ vm_disk_size }} if below target - ansible.builtin.shell: | - qm resize {{ worker_vmid }} scsi0 {{ vm_disk_size }} - when: > - (disk_grow_worker_current.stdout | regex_replace('[^0-9]', '') | int) - < (vm_disk_size | regex_replace('[^0-9]', '') | int) - tags: ['provision', 'disks'] - - # ======================================== - # SECTION 6: Wait for VMs to be ready - # ======================================== - - name: Wait for manager VM to be reachable via SSH - ansible.builtin.wait_for: - host: "{{ manager_ip }}" - port: 22 - delay: 30 - timeout: 300 - state: started - tags: ['provision', 'wait'] - - - name: Wait for worker VM to be reachable via SSH - ansible.builtin.wait_for: - host: "{{ worker_ip }}" - port: 22 - delay: 30 - timeout: 300 - state: started - tags: ['provision', 'wait'] - - - name: VM provisioning complete - ansible.builtin.debug: - msg: | - βœ… VMs provisioned successfully on {{ inventory_hostname }}: - - {{ manager_name }}: {{ manager_ip }} (VMID {{ manager_vmid }}) - - {{ worker_name }}: {{ worker_ip }} (VMID {{ worker_vmid }}) - - Next steps: add VMs to in-memory inventory, install Docker, initialize Docker Swarm, and verify connectivity - tags: ['provision'] - - - name: Add manager VMs to in-memory inventory - ansible.builtin.add_host: - name: "swarm-manager-{{ item | regex_replace('[^0-9]', '') | int }}" - ansible_host: "10.0.0.{{ 210 + (item | regex_replace('[^0-9]', '') | int) }}" - ansible_user: "{{ vm_user }}" - groups: "swarm_managers,swarm_hosts" - loop: "{{ groups['proxmox_cluster'] }}" - run_once: true - tags: ['provision'] - - - name: Add worker VMs to in-memory inventory - ansible.builtin.add_host: - name: "swarm-worker-{{ item | regex_replace('[^0-9]', '') | int }}" - ansible_host: "10.0.0.{{ 220 + (item | regex_replace('[^0-9]', '') | int) }}" - ansible_user: "{{ vm_user }}" - groups: "swarm_workers,swarm_hosts" - loop: "{{ groups['proxmox_cluster'] }}" - run_once: true - tags: ['provision'] - -# ======================================== -# SECTION 6: Install Docker on VMs -# ======================================== -- name: Install Docker Engine (Docker CE) from official repo - hosts: swarm_hosts - become: true - gather_facts: true - vars: - vm_user: chester - - tasks: - - name: Install prerequisites for Docker - ansible.builtin.apt: - name: - - ca-certificates - - curl - - gnupg - - lsb-release - - python3-jsondiff - state: present - update_cache: true - tags: ['docker'] - - - name: Add Docker GPG key (dearmored) - ansible.builtin.shell: | - curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg - args: - creates: /usr/share/keyrings/docker-archive-keyring.gpg - tags: ['docker'] - - - name: Add Docker APT repository - ansible.builtin.apt_repository: - repo: "deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu {{ ansible_distribution_release }} stable" - filename: docker - state: present - tags: ['docker'] - - - name: Update apt cache after adding Docker repo - ansible.builtin.apt: - update_cache: true - tags: ['docker'] - - - name: Install Docker CE, CLI, containerd and compose plugin - ansible.builtin.apt: - name: - - docker-ce - - docker-ce-cli - - containerd.io - - docker-compose-plugin - state: present - update_cache: false - tags: ['docker'] - - - name: Ensure Docker service is started and enabled - ansible.builtin.systemd: - name: docker - state: started - enabled: true - tags: ['docker'] - - - name: Add '{{ vm_user }}' to docker group - ansible.builtin.user: - name: "{{ vm_user }}" - groups: docker - append: true - tags: ['docker'] - - - name: Ensure /opt/stacks exists and is owned by '{{ vm_user }}' - ansible.builtin.file: - path: /opt/stacks - state: directory - owner: "{{ vm_user }}" - group: "{{ vm_user }}" - mode: '0755' - tags: ['docker'] - -# ======================================== -# SECTION 7: Initialize Docker Swarm and Join Nodes -# ======================================== -- name: Initialize Docker Swarm on manager VMs - hosts: swarm_managers - become: true - gather_facts: false - - tasks: - - name: Initialize swarm on primary manager (run once on first manager) - ansible.builtin.command: > - docker swarm init --advertise-addr {{ hostvars[groups['swarm_managers'][0]]['ansible_host'] }} - delegate_to: "{{ groups['swarm_managers'][0] }}" - run_once: true - register: swarm_init - failed_when: false - changed_when: false - - - name: Get worker join token from leader - ansible.builtin.command: docker swarm join-token -q worker - delegate_to: "{{ groups['swarm_managers'][0] }}" - run_once: true - register: swarm_worker_token - changed_when: false - - - name: Get manager join token from leader - ansible.builtin.command: docker swarm join-token -q manager - delegate_to: "{{ groups['swarm_managers'][0] }}" - run_once: true - register: swarm_manager_token - changed_when: false - - - name: Join secondary managers as managers - ansible.builtin.shell: > - docker swarm join --token {{ swarm_manager_token.stdout }} {{ hostvars[groups['swarm_managers'][0]]['ansible_host'] }}:2377 - when: inventory_hostname != groups['swarm_managers'][0] - changed_when: false - -# Join workers (use tokens fetched from leader) -- name: Join worker VMs to Docker Swarm - hosts: swarm_workers - become: true - gather_facts: false - - tasks: - - name: Fetch worker token from leader (delegated) - ansible.builtin.command: docker swarm join-token -q worker - delegate_to: "{{ groups['swarm_managers'][0] }}" - run_once: true - register: swarm_worker_token - changed_when: false - - - name: Check if node is already part of a swarm - ansible.builtin.command: docker info --format '{{"{{.Swarm.LocalNodeState}}"}}' - register: swarm_state - failed_when: false - changed_when: false - - - name: Join this VM to swarm as worker - ansible.builtin.shell: > - docker swarm join --token {{ swarm_worker_token.stdout }} {{ hostvars[groups['swarm_managers'][0]]['ansible_host'] }}:2377 - when: swarm_state.stdout not in ['active','pending'] - changed_when: false - -- name: Verify Swarm Cluster from leader - hosts: "{{ groups.get('swarm_managers', ['localhost'])[0] }}" - become: true - gather_facts: false - - tasks: - - block: - - name: Show docker nodes on leader - ansible.builtin.command: docker node ls - register: node_list - failed_when: false - changed_when: false - - - name: Debug node list - ansible.builtin.debug: - var: node_list.stdout_lines - when: inventory_hostname in groups.get('swarm_managers', []) - -# ======================================== -# SECTION 8: Connectivity Verification (All permutations) -# ======================================== -- name: Verify network connectivity between all Proxmox hosts and VMs - hosts: proxmox_cluster,swarm_hosts - gather_facts: false - become: true - - tasks: - - name: Build list of target IPs (run once) - run_once: true - ansible.builtin.set_fact: - all_targets: > - {{ (groups['proxmox_cluster'] | map('extract', hostvars, 'ansible_host') | list) + (groups['swarm_hosts'] | map('extract', hostvars, 'ansible_host') | list) }} - - - name: Check connectivity to all targets - vars: - target: "{{ item }}" - ansible.builtin.command: ping -c 1 -W 1 {{ item }} - register: ping_result - failed_when: false - changed_when: false - loop: "{{ all_targets }}" - - - name: Report connectivity failures - ansible.builtin.debug: - msg: | - From {{ inventory_hostname }} -> {{ item.item }} : rc={{ item.rc }} - loop: "{{ ping_result.results }}" - when: item.rc != 0 - failed_when: false - - - name: Fail if any critical connectivity missing (optional) - ansible.builtin.fail: - msg: "Connectivity failures detected from {{ inventory_hostname }}" - when: ping_result.results | selectattr('rc','ne',0) | list | length > 0 - failed_when: false diff --git a/ansible/archive/playbooks/proxmox/pve_audit.yml b/ansible/archive/playbooks/proxmox/pve_audit.yml deleted file mode 100644 index f9fa759..0000000 --- a/ansible/archive/playbooks/proxmox/pve_audit.yml +++ /dev/null @@ -1,217 +0,0 @@ ---- -# playbooks/proxmox/pve_audit.yml -# Read-only cross-node consistency audit for the Proxmox cluster. -# Safe to schedule. Makes no changes to any host. -# -# What this does: -# Play 1 β€” Gathers key state from all proxmox_cluster nodes (kernel, repos, -# swap, nag script, GRUB cmdline, HA services, cluster quorum) -# Play 2 β€” Asserts consistency across all 3 nodes and writes a markdown -# drift report to outputs/pve_audit_.md -# -# Usage: -# ansible-playbook -i inventory/hosts.ini playbooks/proxmox/pve_audit.yml -# -# Output: -# outputs/pve_audit_.md (repo root) - -- name: "Play 1: Gather Proxmox cluster node state" - hosts: proxmox_cluster - become: true - gather_facts: true - - tasks: - - name: Check nag removal script presence - ansible.builtin.stat: - path: /usr/local/bin/pve-remove-nag.sh - register: nag_script_stat - - - name: Read GRUB cmdline - ansible.builtin.command: grep '^GRUB_CMDLINE_LINUX_DEFAULT=' /etc/default/grub - register: grub_cmdline - changed_when: false - check_mode: false - - - name: Check enterprise repo files absent - ansible.builtin.stat: - path: "{{ item }}" - loop: - - /etc/apt/sources.list.d/pve-enterprise.list - - /etc/apt/sources.list.d/pve-enterprise.sources - - /etc/apt/sources.list.d/ceph.list - - /etc/apt/sources.list.d/ceph.sources - register: enterprise_repo_stat - - - name: Get cluster quorum status - ansible.builtin.command: pvecm status - register: pvecm_status - changed_when: false - check_mode: false - failed_when: false - - - name: Check HA and cluster service states - ansible.builtin.command: "systemctl is-active {{ item }}" - register: service_active_check - changed_when: false - check_mode: false - failed_when: false - loop: - - corosync - - pve-ha-lrm - - pve-ha-crm - - - name: Check PermitRootLogin effective setting - ansible.builtin.command: sshd -T - register: sshd_config_dump - changed_when: false - check_mode: false - failed_when: false - - - name: Stash per-node audit facts for cross-node comparison - ansible.builtin.set_fact: - pve_audit: - kernel: "{{ ansible_kernel }}" - distro_version: "{{ ansible_distribution_version }}" - swap_mb: "{{ ansible_swaptotal_mb }}" - nag_script_present: "{{ nag_script_stat.stat.exists }}" - grub_cmdline: "{{ grub_cmdline.stdout }}" - enterprise_repos_absent: >- - {{ enterprise_repo_stat.results | selectattr('stat.exists', 'equalto', true) | list | length == 0 }} - pvecm_output: "{{ pvecm_status.stdout | default('(pvecm not available)') }}" - quorate: >- - {{ 'Quorate:' in (pvecm_status.stdout | default('')) and - 'Yes' in ((pvecm_status.stdout | default('')) | regex_search('Quorate:.*') | default('')) }} - corosync_active: >- - {{ (service_active_check.results | selectattr('item', 'equalto', 'corosync') | first).stdout == 'active' }} - ha_lrm_active: >- - {{ (service_active_check.results | selectattr('item', 'equalto', 'pve-ha-lrm') | first).stdout == 'active' }} - ha_crm_active: >- - {{ (service_active_check.results | selectattr('item', 'equalto', 'pve-ha-crm') | first).stdout == 'active' }} - permit_root_login: >- - {{ 'permitrootlogin yes' in (sshd_config_dump.stdout | default('') | lower) }} - - -- name: "Play 2: Cross-node consistency assertions and drift report" - hosts: localhost - gather_facts: false - - vars: - pve_nodes: "{{ groups['proxmox_cluster'] }}" - audit_timestamp: "{{ lookup('pipe', 'date +%Y%m%dT%H%M%S') }}" - report_path: "{{ playbook_dir }}/../../../outputs/pve_audit_{{ audit_timestamp }}.md" - - tasks: - - name: Ensure outputs directory exists - ansible.builtin.file: - path: "{{ playbook_dir }}/../../../outputs" - state: directory - mode: '0755' - - - name: Write drift report - ansible.builtin.copy: - dest: "{{ report_path }}" - mode: '0644' - content: | - # Proxmox Cluster Audit Report - - Generated: {{ audit_timestamp }} - Nodes audited: {{ pve_nodes | join(', ') }} - - ## Node Summary - - | Node | Kernel | Distro | Swap | Nag Script | Enterprise Repos | Quorate | Corosync | HA-LRM | HA-CRM | - |------|--------|--------|------|------------|------------------|---------|----------|--------|--------| - {% for node in pve_nodes %} - | {{ node }} | `{{ hostvars[node]['pve_audit']['kernel'] }}` | {{ hostvars[node]['pve_audit']['distro_version'] }} | {{ hostvars[node]['pve_audit']['swap_mb'] }}MB | {{ 'βœ…' if hostvars[node]['pve_audit']['nag_script_present'] | bool else '❌' }} | {{ 'βœ… absent' if hostvars[node]['pve_audit']['enterprise_repos_absent'] | bool else '❌ present' }} | {{ 'βœ…' if hostvars[node]['pve_audit']['quorate'] | bool else '❌' }} | {{ 'βœ…' if hostvars[node]['pve_audit']['corosync_active'] | bool else '❌' }} | {{ 'βœ…' if hostvars[node]['pve_audit']['ha_lrm_active'] | bool else '❌' }} | {{ 'βœ…' if hostvars[node]['pve_audit']['ha_crm_active'] | bool else '❌' }} | - {% endfor %} - - ## GRUB Cmdline - - {% for node in pve_nodes %} - - **{{ node }}**: `{{ hostvars[node]['pve_audit']['grub_cmdline'] }}` - {% endfor %} - - ## Cluster Quorum Status - - {% for node in pve_nodes %} - ### {{ node }} - - ``` - {{ hostvars[node]['pve_audit']['pvecm_output'] }} - ``` - - {% endfor %} - - - name: Assert kernel consistency across all nodes - ansible.builtin.assert: - that: - - hostvars[item]['pve_audit']['kernel'] == hostvars[pve_nodes[0]]['pve_audit']['kernel'] - fail_msg: >- - ❌ Kernel drift: {{ item }} has {{ hostvars[item]['pve_audit']['kernel'] }} - but {{ pve_nodes[0] }} has {{ hostvars[pve_nodes[0]]['pve_audit']['kernel'] }} - success_msg: "βœ… {{ item }}: kernel {{ hostvars[item]['pve_audit']['kernel'] }}" - loop: "{{ pve_nodes }}" - - - name: Assert distro version consistency across all nodes - ansible.builtin.assert: - that: - - hostvars[item]['pve_audit']['distro_version'] == hostvars[pve_nodes[0]]['pve_audit']['distro_version'] - fail_msg: >- - ❌ Distro version drift: {{ item }} has {{ hostvars[item]['pve_audit']['distro_version'] }} - but {{ pve_nodes[0] }} has {{ hostvars[pve_nodes[0]]['pve_audit']['distro_version'] }} - success_msg: "βœ… {{ item }}: distro {{ hostvars[item]['pve_audit']['distro_version'] }}" - loop: "{{ pve_nodes }}" - - - name: Assert swap is disabled on all nodes - ansible.builtin.assert: - that: - - hostvars[item]['pve_audit']['swap_mb'] | int == 0 - fail_msg: "❌ Swap is enabled on {{ item }}: {{ hostvars[item]['pve_audit']['swap_mb'] }}MB β€” run pve_baseline.yml --tags storage" - success_msg: "βœ… {{ item }}: swap disabled" - loop: "{{ pve_nodes }}" - - - name: Assert nag removal script present on all nodes - ansible.builtin.assert: - that: - - hostvars[item]['pve_audit']['nag_script_present'] | bool - fail_msg: "❌ Nag removal script missing on {{ item }} β€” run pve_baseline.yml --tags nag" - success_msg: "βœ… {{ item }}: nag script present" - loop: "{{ pve_nodes }}" - - - name: Assert enterprise repos absent on all nodes - ansible.builtin.assert: - that: - - hostvars[item]['pve_audit']['enterprise_repos_absent'] | bool - fail_msg: "❌ Enterprise repo still present on {{ item }} β€” run pve_baseline.yml --tags repos" - success_msg: "βœ… {{ item }}: enterprise repos absent" - loop: "{{ pve_nodes }}" - - - name: Assert cluster is quorate - ansible.builtin.assert: - that: - - hostvars[item]['pve_audit']['quorate'] | bool - fail_msg: "❌ {{ item }} reports cluster NOT quorate β€” investigate immediately" - success_msg: "βœ… {{ item }}: cluster quorate" - loop: "{{ pve_nodes }}" - - - name: Assert HA and Corosync services running on all nodes - ansible.builtin.assert: - that: - - hostvars[item]['pve_audit']['corosync_active'] | bool - - hostvars[item]['pve_audit']['ha_lrm_active'] | bool - - hostvars[item]['pve_audit']['ha_crm_active'] | bool - fail_msg: >- - ❌ HA/Corosync degraded on {{ item }}: - corosync={{ hostvars[item]['pve_audit']['corosync_active'] }} - pve-ha-lrm={{ hostvars[item]['pve_audit']['ha_lrm_active'] }} - pve-ha-crm={{ hostvars[item]['pve_audit']['ha_crm_active'] }} - success_msg: "βœ… {{ item }}: corosync + HA services active" - loop: "{{ pve_nodes }}" - - - name: Assert PermitRootLogin is enabled on all nodes - ansible.builtin.assert: - that: - - hostvars[item]['pve_audit']['permit_root_login'] | bool - fail_msg: "❌ PermitRootLogin is not 'yes' on {{ item }} β€” run pve_baseline.yml --tags ssh to fix" - success_msg: "βœ… {{ item }}: PermitRootLogin yes" - loop: "{{ pve_nodes }}" diff --git a/ansible/archive/playbooks/proxmox/pve_baseline.yml b/ansible/archive/playbooks/proxmox/pve_baseline.yml deleted file mode 100644 index 4c66388..0000000 --- a/ansible/archive/playbooks/proxmox/pve_baseline.yml +++ /dev/null @@ -1,330 +0,0 @@ ---- -# playbooks/proxmox/pve_baseline.yml -# Idempotent Proxmox cluster baseline enforcement for 12th Gen Intel laptops. -# -# ───────────────────────────────────────────────────────────────────────────── -# PURPOSE: Ongoing drift enforcement β€” safe to run any time, safe to schedule. -# Does NOT upgrade packages. Does NOT reboot. -# For day-0 first-time provisioning: use playbooks/onboarding/proxmox_host.yml -# For rolling package updates: use playbooks/proxmox/pve_update.yml -# For cross-node consistency audit: use playbooks/proxmox/pve_audit.yml -# ───────────────────────────────────────────────────────────────────────────── -# -# What this enforces (all idempotent): -# 0. Identity: Operational user, SSH key, passwordless sudo -# 1. Repos: Enterprise repos removed, no-subscription repos present -# 2. Kernel: GRUB cmdline (ASPM, intel_pstate) -# 3. Laptop: Lid-switch suppression, suspend targets masked -# 4. Storage: Swap disabled -# 5. HFI Check: Intel Thread Director detection (read-only) -# 6. Nag removal: Subscription nag script + dpkg hook deployed -# 7. HA gate: HA/Corosync left running (standalone_mode: false for cluster) -# -# Usage: -# # All cluster nodes: -# ansible-playbook -i inventory/hosts.ini playbooks/proxmox/pve_baseline.yml -# -# # Single node: -# ansible-playbook -i inventory/hosts.ini playbooks/proxmox/pve_baseline.yml --limit pve01 -# -# # Dry-run: -# ansible-playbook -i inventory/hosts.ini playbooks/proxmox/pve_baseline.yml --check --diff -# -# # Target a specific section only: -# ansible-playbook -i inventory/hosts.ini playbooks/proxmox/pve_baseline.yml --tags repos -# ansible-playbook -i inventory/hosts.ini playbooks/proxmox/pve_baseline.yml --tags identity -# ansible-playbook -i inventory/hosts.ini playbooks/proxmox/pve_baseline.yml --tags nag - -- name: Proxmox cluster baseline enforcement - hosts: proxmox_cluster - become: true - - vars: - is_laptop: true - # standalone_mode: false = HA/Corosync left running (correct for 3-node cluster). - # Override with -e standalone_mode=true only for a truly isolated single-node install. - standalone_mode: false - lab_user: "{{ lab_ansible_user | default('chester') }}" - controller_ssh_pubkey_candidates: - - "{{ lookup('env', 'HOME') }}/.ssh/id_ed25519_homelab.pub" - - "{{ lookup('env', 'HOME') }}/.ssh/id_ed25519.pub" - - tasks: - - name: "0. Identity Management: Ensure user '{{ lab_user }}' is present" - tags: [identity, baseline] - block: - - name: Install sudo package - ansible.builtin.apt: - name: sudo - state: present - update_cache: false - - - name: "Ensure group '{{ lab_user }}' exists" - ansible.builtin.group: - name: "{{ lab_user }}" - state: present - - - name: "Create user '{{ lab_user }}' with sudo access" - ansible.builtin.user: - name: "{{ lab_user }}" - group: "{{ lab_user }}" - groups: sudo - shell: /bin/bash - password: '!' - password_lock: true - - - name: Locate SSH public key on control machine - ansible.builtin.set_fact: - controller_ssh_pubkey_path: >- - {{ lookup('ansible.builtin.first_found', {'files': controller_ssh_pubkey_candidates, 'skip': true}) }} - delegate_to: localhost - become: false - - - name: Fail early if SSH public key is missing - ansible.builtin.fail: - msg: >- - SSH public key not found on the control machine. - Checked: {{ controller_ssh_pubkey_candidates | join(', ') }} - Generate one with: ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519 - when: controller_ssh_pubkey_path | default('') | length == 0 - - - name: "Deploy SSH key to {{ lab_user }} user" - ansible.posix.authorized_key: - user: "{{ lab_user }}" - state: present - key: "{{ lookup('file', controller_ssh_pubkey_path) }}" - - - name: "Allow '{{ lab_user }}' to use sudo without password" - ansible.builtin.copy: - dest: "/etc/sudoers.d/{{ lab_user }}" - content: "{{ lab_user }} ALL=(ALL) NOPASSWD: ALL\n" - mode: '0440' - owner: root - group: root - validate: '/usr/sbin/visudo -cf %s' - - - name: "1. Repository configuration" - tags: [repos, baseline] - block: - - name: Check if /etc/apt/sources.list exists - ansible.builtin.stat: - path: /etc/apt/sources.list - register: apt_sources_list_stat - - - name: Remove Proxmox enterprise repo files (.list/.sources) - ansible.builtin.file: - path: "{{ item }}" - state: absent - loop: - - /etc/apt/sources.list.d/pve-enterprise.list - - /etc/apt/sources.list.d/pve-enterprise.sources - - /etc/apt/sources.list.d/ceph.list - - /etc/apt/sources.list.d/ceph.sources - - /etc/apt/sources.list.d/ceph-enterprise.list - - /etc/apt/sources.list.d/ceph-enterprise.sources - - - name: Remove enterprise.proxmox.com entries from /etc/apt/sources.list - ansible.builtin.lineinfile: - path: /etc/apt/sources.list - regexp: '^.*enterprise\.proxmox\.com.*$' - state: absent - when: apt_sources_list_stat.stat.exists - - - name: Add Proxmox no-subscription repository - ansible.builtin.apt_repository: - repo: "deb http://download.proxmox.com/debian/pve {{ ansible_distribution_release }} pve-no-subscription" - filename: pve-no-subscription - state: present - - - name: Add Proxmox Ceph no-subscription repository - ansible.builtin.apt_repository: - repo: "deb http://download.proxmox.com/debian/ceph-squid {{ ansible_distribution_release }} no-subscription" - filename: ceph-no-subscription - state: present - - - name: Ensure required packages present (intel-microcode, htop, nvme-cli, lm-sensors) - ansible.builtin.apt: - name: [intel-microcode, htop, nvme-cli, lm-sensors] - state: present - update_cache: true - - - name: "2. Kernel tuning (12th Gen & power)" - tags: [kernel, baseline] - block: - - name: Configure GRUB for ASPM & power savings - ansible.builtin.lineinfile: - path: /etc/default/grub - regexp: '^GRUB_CMDLINE_LINUX_DEFAULT=' - line: 'GRUB_CMDLINE_LINUX_DEFAULT="quiet pcie_aspm=force intel_pstate=passive"' - notify: Update Grub - - - name: "2b. SSH hardening: enforce PermitRootLogin for Proxmox cluster management" - tags: [ssh, baseline] - block: - - name: Deploy drop-in sshd config to enforce PermitRootLogin yes - ansible.builtin.copy: - dest: /etc/ssh/sshd_config.d/90-pve-root.conf - owner: root - group: root - mode: '0600' - content: | - # Managed by Ansible β€” pve_baseline.yml - # Proxmox cluster management requires root SSH access. - # This drop-in overrides any openssh-server package default that sets PermitRootLogin no. - PermitRootLogin yes - notify: Restart SSHD - - - name: "3. Laptop safety: disable lid-close suspend" - when: is_laptop | default(false) - tags: [laptop, baseline] - block: - - name: Configure logind.conf to ignore lid switch - ansible.builtin.lineinfile: - path: /etc/systemd/logind.conf - regexp: "^#?{{ item.key }}=" - line: "{{ item.key }}={{ item.value }}" - loop: - - { key: "HandleLidSwitch", value: "ignore" } - - { key: "HandleLidSwitchExternalPower", value: "ignore" } - notify: Restart Logind - - - name: Mask sleep/suspend targets (hardware lock) - ansible.builtin.systemd: - name: "{{ item }}" - masked: true - loop: - - sleep.target - - suspend.target - - hibernate.target - - hybrid-sleep.target - - - name: "4. Storage & SSD health" - tags: [storage, baseline] - block: - - name: Disable swap (protect NVMe lifespan) - ansible.builtin.shell: | - swapoff -a - sed -i '/ swap / s/^\(.*\)$/#\1/g' /etc/fstab - when: ansible_swaptotal_mb > 0 - changed_when: ansible_swaptotal_mb > 0 - - - name: "5. Intel Thread Director support check" - tags: [baseline] - ansible.builtin.shell: "dmesg | grep -i 'Hardware Feedback Interface'" - register: hfi_check - failed_when: false - changed_when: false - - - name: "6. Proxmox Web UI: subscription nag removal" - tags: [nag, baseline] - block: - - name: Deploy subscription nag removal script - ansible.builtin.copy: - dest: /usr/local/bin/pve-remove-nag.sh - owner: root - group: root - mode: '0755' - content: | - #!/bin/sh - WEB_JS=/usr/share/javascript/proxmox-widget-toolkit/proxmoxlib.js - if [ -s "$WEB_JS" ] && ! grep -q NoMoreNagging "$WEB_JS"; then - echo "Patching Web UI nag..." - sed -i -e "/data\.status/ s/!//" -e "/data\.status/ s/active/NoMoreNagging/" "$WEB_JS" - fi - - MOBILE_TPL=/usr/share/pve-yew-mobile-gui/index.html.tpl - MARKER="" - if [ -f "$MOBILE_TPL" ] && ! grep -q "$MARKER" "$MOBILE_TPL"; then - echo "Patching Mobile UI nag..." - printf "%s\n" \ - "$MARKER" \ - "" \ - "" >>"$MOBILE_TPL" - fi - - - name: Configure dpkg hook to auto-run nag removal after upgrades - ansible.builtin.copy: - dest: /etc/apt/apt.conf.d/no-nag-script - owner: root - group: root - mode: '0644' - content: | - DPkg::Post-Invoke { "/usr/local/bin/pve-remove-nag.sh"; }; - - - name: Run nag removal script immediately - ansible.builtin.command: /usr/local/bin/pve-remove-nag.sh - register: nag_removal_output - changed_when: "'Patching' in nag_removal_output.stdout" - - - name: Reinstall proxmox-widget-toolkit to ensure nag patches apply - ansible.builtin.apt: - name: proxmox-widget-toolkit - state: present - register: widget_reinstall - failed_when: false - - - name: "7. Standalone optimization: HA/Corosync gate" - when: standalone_mode | bool - tags: [ha, baseline] - block: - - name: Stop and disable pve-ha-lrm service - ansible.builtin.systemd: - name: pve-ha-lrm - state: stopped - enabled: false - failed_when: false - - - name: Stop and disable pve-ha-crm service - ansible.builtin.systemd: - name: pve-ha-crm - state: stopped - enabled: false - failed_when: false - - - name: Stop and disable Corosync service - ansible.builtin.systemd: - name: corosync - state: stopped - enabled: false - failed_when: false - - handlers: - - name: Update Grub - ansible.builtin.command: update-grub - register: grub_update_result - changed_when: grub_update_result.rc == 0 - - - name: Restart Logind - ansible.builtin.systemd: - name: systemd-logind - state: restarted - - - name: Restart SSHD - ansible.builtin.systemd: - name: ssh - state: restarted diff --git a/ansible/archive/playbooks/proxmox/pve_update.yml b/ansible/archive/playbooks/proxmox/pve_update.yml deleted file mode 100644 index 3912390..0000000 --- a/ansible/archive/playbooks/proxmox/pve_update.yml +++ /dev/null @@ -1,111 +0,0 @@ ---- -# playbooks/proxmox/pve_update.yml -# Rolling Proxmox cluster package update with conditional kernel reboot. -# -# ───────────────────────────────────────────────────────────────────────────── -# ⚠️ HUMAN-TRIGGERED ONLY β€” do not automate or schedule. -# serial: 1 ensures one node is updated at a time to protect cluster quorum. -# ───────────────────────────────────────────────────────────────────────────── -# -# What this does: -# 1. Pre-checks cluster quorum β€” fails fast if quorum is degraded -# 2. Runs apt dist-upgrade on the target node -# 3. Reboots if a kernel update was applied (tags: reboot) -# 4. Waits for the node to return online (tags: reboot) -# 5. Re-verifies cluster quorum before proceeding to the next node -# -# Usage: -# # All nodes (rolling β€” pve01 β†’ pve02 β†’ pve03): -# ansible-playbook -i inventory/hosts.ini playbooks/proxmox/pve_update.yml -# -# # Single node: -# ansible-playbook -i inventory/hosts.ini playbooks/proxmox/pve_update.yml --limit pve01 -# -# # Dry-run (confirms serial order and reboot conditions without modifying): -# ansible-playbook -i inventory/hosts.ini playbooks/proxmox/pve_update.yml --check -# -# # Update packages but skip reboot even if kernel changed: -# ansible-playbook -i inventory/hosts.ini playbooks/proxmox/pve_update.yml --skip-tags reboot - -- name: Rolling Proxmox cluster update - hosts: proxmox_cluster - become: true - serial: 1 - - tasks: - - name: "Pre-flight: verify cluster quorum before updating this node" - block: - - name: Check cluster quorum status - ansible.builtin.command: pvecm status - register: pvecm_pre - changed_when: false - check_mode: false - - - name: Fail if cluster is not quorate before touching this node - ansible.builtin.assert: - that: - - "'Quorate:' in pvecm_pre.stdout" - - "'Quorate:' in pvecm_pre.stdout and 'Yes' in (pvecm_pre.stdout | regex_search('Quorate:.*') | default(''))" - fail_msg: | - β›” Cluster quorum is NOT healthy before updating {{ inventory_hostname }}. - Fix quorum before proceeding. - pvecm status: - {{ pvecm_pre.stdout }} - success_msg: "βœ… Cluster quorate β€” safe to update {{ inventory_hostname }}" - - - name: "Update packages" - block: - - name: Update apt cache - ansible.builtin.apt: - update_cache: true - cache_valid_time: 0 - - - name: Run apt dist-upgrade - ansible.builtin.apt: - upgrade: dist - update_cache: false - register: dist_upgrade_result - tags: [update] - - - name: Check if a newer kernel is installed but not yet booted - ansible.builtin.shell: | - LATEST=$(ls /boot/vmlinuz-* | sort -V | tail -1 | sed 's|/boot/vmlinuz-||') - RUNNING=$(uname -r) - if [ "$LATEST" != "$RUNNING" ]; then echo "reboot_needed"; fi - register: reboot_check - changed_when: false - check_mode: false - tags: [reboot] - - - name: Reboot if a newer kernel is installed - ansible.builtin.reboot: - msg: "Rebooting into {{ reboot_check.stdout | trim }} β€” initiated by pve_update.yml" - reboot_timeout: 600 - when: reboot_check.stdout | trim == 'reboot_needed' - tags: [reboot] - - - name: Wait for node to return post-reboot - ansible.builtin.wait_for_connection: - delay: 10 - timeout: 600 - when: reboot_check.stdout | trim == 'reboot_needed' - tags: [reboot] - - - name: "Post-flight: re-verify cluster quorum after node returns" - block: - - name: Check cluster quorum status post-update - ansible.builtin.command: pvecm status - register: pvecm_post - changed_when: false - check_mode: false - - - name: Assert cluster is quorate after update - ansible.builtin.assert: - that: - - "'Quorate:' in pvecm_post.stdout and 'Yes' in (pvecm_post.stdout | regex_search('Quorate:.*') | default(''))" - fail_msg: | - β›” Cluster quorum is degraded after updating {{ inventory_hostname }}. - Investigate before proceeding to the next node. - pvecm status: - {{ pvecm_post.stdout }} - success_msg: "βœ… {{ inventory_hostname }} updated β€” cluster quorum verified. Proceeding." diff --git a/ansible/archive/playbooks/proxmox/reconcile_cluster.yml b/ansible/archive/playbooks/proxmox/reconcile_cluster.yml deleted file mode 100644 index fe40bd6..0000000 --- a/ansible/archive/playbooks/proxmox/reconcile_cluster.yml +++ /dev/null @@ -1,187 +0,0 @@ ---- -# playbooks/proxmox/reconcile_cluster.yml -# Re-enable cluster services and reconcile Proxmox cluster membership. -# -# What this playbook does: -# 1. Ensures pve-cluster is running on all nodes -# 2. Creates a cluster on the primary node if missing -# 3. Joins remaining nodes if they are not yet members -# 4. Re-enables Corosync and HA services -# 5. Prints final cluster membership from the primary node -# -# Usage: -# ansible-playbook -i inventory/hosts.ini playbooks/proxmox/reconcile_cluster.yml -# -# Optional overrides: -# -e pve_cluster_name=homelab -# -e pve_primary_node=pve01 -# -e cluster_mode=auto|primary|join -# -e pve_existing_cluster_ip=10.0.0.201 - -# ======================================== -# PLAY 1: Setup root SSH trust (parallel) -# ======================================== -- name: Setup root SSH trust for cluster operations - hosts: proxmox_cluster - become: true - gather_facts: false - - tasks: - - name: Ensure root SSH key exists - ansible.builtin.stat: - path: /root/.ssh/id_rsa - register: root_ssh_key - - - name: Generate root SSH key if missing - ansible.builtin.command: ssh-keygen -t ed25519 -f /root/.ssh/id_ed25519 -N "" - args: - creates: /root/.ssh/id_ed25519 - when: not root_ssh_key.stat.exists - - - name: Fetch root's public SSH key - ansible.builtin.slurp: - src: "{{ '/root/.ssh/id_rsa.pub' if root_ssh_key.stat.exists else '/root/.ssh/id_ed25519.pub' }}" - register: root_pubkey - - - name: Distribute root SSH keys across all cluster nodes - ansible.builtin.authorized_key: - user: root - key: "{{ hostvars[item].root_pubkey.content | b64decode }}" - state: present - loop: "{{ groups['proxmox_cluster'] }}" - when: hostvars[item].root_pubkey is defined - -# ======================================== -# PLAY 2: Cluster reconciliation (serial) -# ======================================== -- name: Reconcile Proxmox cluster state - hosts: proxmox_cluster - become: true - gather_facts: true - serial: 1 - - vars: - pve_cluster_name: "homelab" - pve_primary_node: "{{ groups['proxmox_cluster'][0] }}" - pve_primary_ip: "{{ hostvars[pve_primary_node].ansible_host | default(pve_primary_node) }}" - # auto: create if needed on primary and join others - # primary: force primary-init behavior on target host(s) - # join: force join behavior on target host(s) - cluster_mode: "auto" - pve_existing_cluster_ip: "" - - tasks: - - name: Validate inventory has Proxmox nodes - ansible.builtin.assert: - that: - - groups['proxmox_cluster'] | length >= 1 - fail_msg: "Inventory group 'proxmox_cluster' is empty or undefined." - - - name: Validate cluster_mode input - ansible.builtin.assert: - that: - - cluster_mode in ['auto', 'primary', 'join'] - fail_msg: "cluster_mode must be one of: auto, primary, join" - - - name: Resolve join target IP - ansible.builtin.set_fact: - pve_join_target_ip: "{{ pve_existing_cluster_ip | default('') | trim if (pve_existing_cluster_ip | default('') | trim | length > 0) else pve_primary_ip }}" - - - name: Show reconcile plan - ansible.builtin.debug: - msg: - - "Primary node: {{ pve_primary_node }} ({{ pve_primary_ip }})" - - "Cluster name: {{ pve_cluster_name }}" - - "Cluster mode: {{ cluster_mode }}" - - "Join target IP: {{ pve_join_target_ip }}" - - "Target nodes: {{ groups['proxmox_cluster'] | join(', ') }}" - run_once: true - - - name: Ensure pve-cluster service is enabled and running - ansible.builtin.systemd: - name: pve-cluster - enabled: true - state: started - - - name: Check whether this node is already clustered - ansible.builtin.stat: - path: /etc/pve/corosync.conf - register: corosync_conf - - - name: Create cluster on primary node when missing - ansible.builtin.command: "pvecm create {{ pve_cluster_name }}" - register: pvecm_create - changed_when: pvecm_create.rc == 0 - when: - - cluster_mode in ['auto', 'primary'] - - inventory_hostname == pve_primary_node or cluster_mode == 'primary' - - not corosync_conf.stat.exists - - - name: Wait for corosync config to appear on primary - ansible.builtin.wait_for: - path: /etc/pve/corosync.conf - timeout: 60 - when: inventory_hostname == pve_primary_node - - - name: Test root SSH connectivity to primary node - ansible.builtin.command: "ssh -o BatchMode=yes root@{{ pve_join_target_ip }} hostname" - changed_when: false - failed_when: false - register: ssh_test - when: - - inventory_hostname != pve_primary_node - - not corosync_conf.stat.exists - - - name: Warn if root SSH test failed - ansible.builtin.debug: - msg: "WARNING: Root SSH to {{ pve_join_target_ip }} failed. Cluster join may hang. Error: {{ ssh_test.stderr }}" - when: - - ssh_test is defined - - ssh_test.rc is defined - - ssh_test.rc != 0 - - - name: Join non-primary node to cluster when missing - ansible.builtin.command: "pvecm add {{ pve_join_target_ip }} --use_ssh 1" - register: pvecm_add - changed_when: pvecm_add.rc == 0 - when: - - cluster_mode in ['auto', 'join'] - - inventory_hostname != pve_primary_node or cluster_mode == 'join' - - not corosync_conf.stat.exists - - - name: Re-check cluster membership config after create/join - ansible.builtin.stat: - path: /etc/pve/corosync.conf - register: corosync_conf_after - - - name: Ensure Corosync service is enabled and running on clustered nodes - ansible.builtin.systemd: - name: corosync - enabled: true - state: started - when: corosync_conf_after.stat.exists - - - name: Ensure pve-ha-lrm service is enabled and running on clustered nodes - ansible.builtin.systemd: - name: pve-ha-lrm - enabled: true - state: started - when: corosync_conf_after.stat.exists - - - name: Ensure pve-ha-crm service is enabled and running on clustered nodes - ansible.builtin.systemd: - name: pve-ha-crm - enabled: true - state: started - when: corosync_conf_after.stat.exists - - - name: Show cluster membership from primary - ansible.builtin.command: pvecm nodes - changed_when: false - register: pvecm_nodes - when: inventory_hostname == pve_primary_node - - - name: Print cluster membership output - ansible.builtin.debug: - var: pvecm_nodes.stdout_lines - when: inventory_hostname == pve_primary_node diff --git a/ansible/archive/playbooks/proxmox/reconcile_cluster_v2.yml b/ansible/archive/playbooks/proxmox/reconcile_cluster_v2.yml deleted file mode 100644 index e6addde..0000000 --- a/ansible/archive/playbooks/proxmox/reconcile_cluster_v2.yml +++ /dev/null @@ -1,21 +0,0 @@ ---- -# playbooks/proxmox/reconcile_cluster_v2.yml -# Clean, idempotent Proxmox cluster reconciliation workflow. -# -# Usage examples: -# Validate only: -# ansible-playbook -i inventory/hosts.ini playbooks/proxmox/reconcile_cluster_v2.yml \ -# -e "cluster_mode=validate" -# -# Join pve01 to existing cluster via pve02 anchor: -# ansible-playbook -i inventory/hosts.ini playbooks/proxmox/reconcile_cluster_v2.yml \ -# --limit localhost \ -# -e "cluster_mode=join" \ -# -e "join_node=pve01" \ -# -e "join_target_host=pve02" - -- name: Reconcile Proxmox cluster membership (v2) - hosts: localhost - gather_facts: false - roles: - - role: proxmox_cluster_reconcile_v2 diff --git a/ansible/archive/playbooks/proxmox/replace_proxmox_node_interactive.yml b/ansible/archive/playbooks/proxmox/replace_proxmox_node_interactive.yml deleted file mode 100644 index 059c560..0000000 --- a/ansible/archive/playbooks/proxmox/replace_proxmox_node_interactive.yml +++ /dev/null @@ -1,105 +0,0 @@ ---- -# playbooks/proxmox/replace_proxmox_node_interactive.yml -# Interactive wrapper for proxmox node replacement workflow. -# -# Usage: -# ansible-playbook -i inventory/hosts.ini playbooks/proxmox/replace_proxmox_node_interactive.yml - -- name: Interactive Proxmox physical node replacement - hosts: localhost - gather_facts: false - - vars_prompt: - - name: replacement_project_name - prompt: "Project name" - private: false - default: "node-replacement-{{ lookup('pipe', 'date +%Y%m%d') }}" - - - name: replacement_old_logical_host - prompt: "Logical host identity to preserve (example: pve01)" - private: false - default: "pve01" - - - name: replacement_old_ip - prompt: "Current logical host IP" - private: false - default: "10.0.0.201" - - - name: replacement_new_physical_host - prompt: "Physical donor host to take over identity (example: pve04)" - private: false - default: "pve04" - - - name: replacement_new_physical_ip - prompt: "Current donor host IP" - private: false - default: "10.0.0.204" - - - name: replacement_swarm_manager_name - prompt: "Swarm manager identity tied to the logical host" - private: false - default: "swarm-manager-1" - - - name: replacement_swarm_worker_name - prompt: "Swarm worker identity tied to the logical host" - private: false - default: "swarm-worker-1" - - - name: replacement_execute_cutover - prompt: "Enable execution mode? (true/false)" - private: false - default: "false" - - - name: replacement_phase2_rebuild_and_rejoin - prompt: "Run phase 2 rebuild and swarm rejoin? (true/false)" - private: false - default: "false" - - - name: replacement_phase3_identity_cutover - prompt: "Run phase 3 source-of-truth cutover? (true/false)" - private: false - default: "false" - - - name: replacement_phase4_validate_cutover - prompt: "Run phase 4 validation gates? (true/false)" - private: false - default: "true" - - - name: replacement_overwrite_existing_vmids - prompt: "Allow overwrite of conflicting VMIDs on target host? (true/false)" - private: false - default: "false" - - - name: replacement_poweroff_old_host - prompt: "Power off old logical host at the end? (true/false)" - private: false - default: "false" - - - name: replacement_confirm_phrase - prompt: "Confirmation phrase for execution mode (must be EXECUTE_NODE_REPLACEMENT)" - private: false - default: "" - - pre_tasks: - - name: Normalize interactive boolean inputs - ansible.builtin.set_fact: - replacement_execute_cutover: "{{ replacement_execute_cutover | bool }}" - replacement_phase2_rebuild_and_rejoin: "{{ replacement_phase2_rebuild_and_rejoin | bool }}" - replacement_phase3_identity_cutover: "{{ replacement_phase3_identity_cutover | bool }}" - replacement_phase4_validate_cutover: "{{ replacement_phase4_validate_cutover | bool }}" - replacement_overwrite_existing_vmids: "{{ replacement_overwrite_existing_vmids | bool }}" - replacement_poweroff_old_host: "{{ replacement_poweroff_old_host | bool }}" - - - name: Show interactive replacement selections - ansible.builtin.debug: - msg: - - "Project: {{ replacement_project_name }}" - - "Logical identity: {{ replacement_old_logical_host }} ({{ replacement_old_ip }})" - - "Donor physical host: {{ replacement_new_physical_host }} ({{ replacement_new_physical_ip }})" - - "Swarm identities: {{ replacement_swarm_manager_name }}, {{ replacement_swarm_worker_name }}" - - "Execute cutover: {{ replacement_execute_cutover }}" - - "Phases: 2={{ replacement_phase2_rebuild_and_rejoin }}, 3={{ replacement_phase3_identity_cutover }}, 4={{ replacement_phase4_validate_cutover }}" - - "Poweroff old host: {{ replacement_poweroff_old_host }}" - - roles: - - role: proxmox_node_replacement diff --git a/ansible/archive/playbooks/proxmox/replace_pve01_with_pve04.yml b/ansible/archive/playbooks/proxmox/replace_pve01_with_pve04.yml deleted file mode 100644 index 14b49ba..0000000 --- a/ansible/archive/playbooks/proxmox/replace_pve01_with_pve04.yml +++ /dev/null @@ -1,30 +0,0 @@ ---- -# playbooks/proxmox/replace_pve01_with_pve04.yml -# Framework playbook for replacing physical host backing logical pve01. -# -# Usage (preflight + baseline only): -# ansible-playbook -i inventory/hosts.ini playbooks/proxmox/replace_pve01_with_pve04.yml \ -# -e "replacement_project_name=node-replacement-2026" -# -# Usage (enable guarded execution mode): -# ansible-playbook -i inventory/hosts.ini playbooks/proxmox/replace_pve01_with_pve04.yml \ -# -e "replacement_project_name=node-replacement-2026" \ -# -e "replacement_execute_cutover=true" \ -# -e "replacement_confirm_phrase=EXECUTE_NODE_REPLACEMENT" -# -# Optional hard shutdown gate (only with execution mode): -# -e "replacement_poweroff_old_host=true" - -- name: Replace physical pve01 host with pve04 hardware - hosts: localhost - gather_facts: false - vars: - # Keep logical identity for pve01; move physical backing host. - replacement_old_logical_host: "pve01" - replacement_old_ip: "10.0.0.201" - replacement_new_physical_host: "pve04" - replacement_new_physical_ip: "10.0.0.204" - replacement_swarm_manager_name: "swarm-manager-1" - replacement_swarm_worker_name: "swarm-worker-1" - roles: - - role: proxmox_node_replacement diff --git a/ansible/archive/playbooks/security/distribute_keys.yml b/ansible/archive/playbooks/security/distribute_keys.yml deleted file mode 100644 index e7b5418..0000000 --- a/ansible/archive/playbooks/security/distribute_keys.yml +++ /dev/null @@ -1,17 +0,0 @@ ---- -- name: Distribute admin SSH keys - hosts: all_nodes - become: true - tasks: - - name: Ensure chester public key is present on all hosts - ansible.posix.authorized_key: - user: chester - state: present - key: "{{ lookup('file', '/home/chester/.ssh/id_ed25519.pub') }}" - exclusive: false - - - name: Trust the watchtower host key - ansible.posix.authorized_key: - user: chester - state: present - key: "ssh-ed25519 AAAAC3Nza... (Pi's public key) ... chester@watchtower" \ No newline at end of file diff --git a/ansible/archive/playbooks/security/enforce_access.yml b/ansible/archive/playbooks/security/enforce_access.yml deleted file mode 100644 index efc7945..0000000 --- a/ansible/archive/playbooks/security/enforce_access.yml +++ /dev/null @@ -1,14 +0,0 @@ ---- -# playbooks/enforce_access_identity.yml -# Enforces access, SSO, and MFA policies -- name: Enforce access and identity policies - hosts: all - gather_facts: false - tasks: - - name: Remind to use Authentik SSO for all services - ansible.builtin.debug: - msg: "[Reminder] All new services must integrate with Authentik SSO." - - - name: Remind to enforce MFA and strong passwords - ansible.builtin.debug: - msg: "[Reminder] MFA required for all admin/operator accounts. Use password vaults." diff --git a/ansible/archive/playbooks/self-heal/heimdall.yml b/ansible/archive/playbooks/self-heal/heimdall.yml deleted file mode 100644 index f1ae955..0000000 --- a/ansible/archive/playbooks/self-heal/heimdall.yml +++ /dev/null @@ -1,105 +0,0 @@ ---- -- name: "Heimdall" - hosts: heimdall # Targeted via your inventory - become: true - - vars: - stack_dir: "/opt/stacks/heimdall" - chester_user: "chester" - # Replace with Heimdall's actual static IP - heimdall_ip: "10.0.0.145" - cf_token: "{{ secrets.CF_HIEMDALL }}" - - tasks: - - name: "Gate -2: Install Docker & Tools (Ubuntu)" - apt: - name: [curl, git, jq, docker.io, docker-compose-v2, python3-pip] - state: present - update_cache: true - - - name: "Gate -1: Add chester to docker group" - user: - name: "{{ chester_user }}" - groups: docker - append: true - - - name: "Gate 0: Infrastructure Setup" - file: - path: "{{ item }}" - state: directory - owner: "{{ chester_user }}" - group: "{{ chester_user }}" - mode: '0755' - loop: - - "{{ stack_dir }}" - - "{{ stack_dir }}/traefik-certs" - - "{{ stack_dir }}/redis-data" - - "{{ stack_dir }}/runner-data" - - - name: "Gate 1: Deploy Heimdall Stack" - copy: - dest: "{{ stack_dir }}/docker-compose.yml" - owner: "{{ chester_user }}" - group: "{{ chester_user }}" - mode: '0644' - content: | - services: - redis: - image: redis:7-alpine - container_name: redis - restart: unless-stopped - volumes: - - ./redis-data:/data - command: redis-server --appendonly yes - healthcheck: - test: ["CMD", "redis-cli", "ping"] - - traefik: - image: traefik:v3.0 - container_name: traefik - restart: unless-stopped - ports: - - "80:80" - - "443:443" - - "8080:8080" - environment: - - CF_DNS_API_TOKEN=${CF_TOKEN} - volumes: - - /var/run/docker.sock:/var/run/docker.sock:ro - - ./traefik-certs:/letsencrypt - command: - - "--api.dashboard=true" - - "--providers.docker=true" - - "--providers.redis=true" - - "--providers.redis.endpoints=redis:6379" - - "--entrypoints.web.address=:80" - - "--entrypoints.websecure.address=:443" - - "--certificatesresolvers.myresolver.acme.dnschallenge.provider=cloudflare" - - "--certificatesresolvers.myresolver.acme.email=admin@castaldifamily.com" - - "--certificatesresolvers.myresolver.acme.storage=/letsencrypt/acme.json" - - traefik-kop: - image: ghcr.io/jittering/traefik-kop:latest - container_name: traefik-kop-edge - restart: unless-stopped - volumes: - - /var/run/docker.sock:/var/run/docker.sock:ro - environment: - - REDIS_ADDR=redis:6379 - - BIND_IP={{ heimdall_ip }} # reports Beelink's IP - - gitea-runner: - image: gitea/act_runner:latest - container_name: gitea-runner-heimdall - restart: always - volumes: - - ./runner-data:/data - - /var/run/docker.sock:/var/run/docker.sock - environment: - - GITEA_INSTANCE_URL=https://git.castaldifamily.com - - GITEA_RUNNER_REGISTRATION_TOKEN={{ secrets.HEIMDALL_GITEA_TOKEN }} - - - name: "Gate 2: Launch Stack" - community.docker.docker_compose_v2: - project_src: "{{ stack_dir }}" - state: present \ No newline at end of file diff --git a/ansible/archive/playbooks/self-heal/watchtower.yml b/ansible/archive/playbooks/self-heal/watchtower.yml deleted file mode 100644 index 7c82b2d..0000000 --- a/ansible/archive/playbooks/self-heal/watchtower.yml +++ /dev/null @@ -1,97 +0,0 @@ ---- -- name: Setup Watchtower - hosts: localhost - connection: local - # become: true <-- Removed: Rootless Docker prefers running as the user 'chester' - - vars: - stack_dir: "/mnt/appdata/watchtower" - chester_user: "chester" - heimdall_redis: "10.0.0.151:6379" - pi_ip: "10.0.0.200" - - tasks: - - name: Create monitoring directories - become: true - block: - - name: Ensure monitoring directories exist - ansible.builtin.file: - path: "{{ item }}" - state: directory - owner: "{{ chester_user }}" - group: "{{ chester_user }}" - mode: '0755' - loop: - - "{{ stack_dir }}" - - "{{ stack_dir }}/portainer-data" - - "{{ stack_dir }}/vscode-data" - - - name: Render compose specification - ansible.builtin.copy: - dest: "{{ stack_dir }}/docker-compose.yml" - owner: "{{ chester_user }}" - group: "{{ chester_user }}" - mode: '0644' - content: | - services: - traefik-kop: - image: ghcr.io/jittering/traefik-kop:latest - container_name: traefik-kop-agent - restart: unless-stopped - volumes: - - /var/run/docker.sock:/var/run/docker.sock:ro - environment: - - REDIS_ADDR={{ heimdall_redis }} - - BIND_IP={{ pi_ip }} - - portainer: - image: portainer/portainer-ce:latest - container_name: portainer - restart: unless-stopped - ports: - - "9443:9443" - volumes: - - /var/run/docker.sock:/var/run/docker.sock:ro - - {{ stack_dir }}/portainer-data:/data - labels: - - "traefik.enable=true" - - "traefik.http.routers.portainer.rule=Host(`portainer.castaldifamily.com`)" - - "traefik.http.routers.portainer.entrypoints=websecure" - - "traefik.http.routers.portainer.tls.certresolver=cloudflare" - - "traefik.http.services.portainer.loadbalancer.server.port=9443" - - "traefik.http.services.portainer.loadbalancer.server.scheme=https" - code-server: - image: lscr.io/linuxserver/code-server:latest - container_name: code-server - environment: - - PUID=1000 - - PGID=1000 - - TZ=Etc/UTC - - PASSWORD=password #optional - - HASHED_PASSWORD= #optional - - SUDO_PASSWORD=password #optional - - SUDO_PASSWORD_HASH= #optional - - PROXY_DOMAIN=code-server.my.domain #optional - - DEFAULT_WORKSPACE=/config/workspace #optional - - PWA_APPNAME=code-server #optional - volumes: - - {{ stack_dir }}/vscode-data:/config - ports: - - 8443:8443 - restart: unless-stopped - - # - name: Render watchtower environment file - # ansible.builtin.copy: - # dest: "{{ stack_dir }}/.env" - # owner: "{{ chester_user }}" - # group: "{{ chester_user }}" - # mode: '0600' - # content: | - # AUTHENTIK_OUTPOST_DOZZLE_TOKEN={{ authentik_outpost_dozzle_token }} - - - name: Launch stack - community.docker.docker_compose_v2: - project_src: "{{ stack_dir }}" - state: present - pull: always - docker_host: "unix:///var/run/docker.sock" diff --git a/ansible/archive/playbooks/storage/configure_nas.yml b/ansible/archive/playbooks/storage/configure_nas.yml deleted file mode 100644 index b33f7fc..0000000 --- a/ansible/archive/playbooks/storage/configure_nas.yml +++ /dev/null @@ -1,20 +0,0 @@ ---- -# playbooks/configure_nas_and_backups.yml -# Configures NAS devices and automates backup jobs -- name: Configure NAS devices and backups - hosts: storage - become: false - tasks: - - name: Ensure NAS is reachable - ansible.builtin.debug: - msg: "[Placeholder] Ping NAS device {{ inventory_hostname }}" - - - name: Configure rsync backup jobs - ansible.builtin.debug: - msg: "[Placeholder] Set up rsync from TerraMaster to Synology" - when: inventory_hostname == 'terramaster' - - - name: Configure cloud sync on Synology - ansible.builtin.debug: - msg: "[Placeholder] Set up daily cloud sync on Synology" - when: inventory_hostname == 'synology' diff --git a/ansible/archive/playbooks/storage/mount_nfs_shares.yml b/ansible/archive/playbooks/storage/mount_nfs_shares.yml deleted file mode 100644 index e7b8150..0000000 --- a/ansible/archive/playbooks/storage/mount_nfs_shares.yml +++ /dev/null @@ -1,20 +0,0 @@ ---- -# playbooks/storage/mount_nfs_shares.yml -# -# Thin wrapper β€” delegates all logic to the storage_mounts role. -# Mount definitions and NFS server are in roles/storage_mounts/defaults/main.yml. -# Override per-host via host_vars/.yml (storage_nfs_mounts list). -# -# Usage: -# All Swarm nodes: -# ansible-playbook -i inventory/hosts.ini playbooks/storage/mount_nfs_shares.yml -# -# Single node: -# ansible-playbook -i inventory/hosts.ini playbooks/storage/mount_nfs_shares.yml \ -# --limit swarm-manager-1 - -- name: Configure NFS mounts on Swarm and standalone Docker nodes - hosts: swarm_hosts:docker_hosts - become: true - roles: - - storage_mounts diff --git a/ansible/archive/playbooks/storage/terramaster_deploy_ssh_key.yml b/ansible/archive/playbooks/storage/terramaster_deploy_ssh_key.yml deleted file mode 100644 index b1240e6..0000000 --- a/ansible/archive/playbooks/storage/terramaster_deploy_ssh_key.yml +++ /dev/null @@ -1,49 +0,0 @@ ---- -# One-time run to deploy Watchtower's SSH public key to TerraMaster. -# After this succeeds, --ask-pass is no longer needed for terramaster playbooks. -# -# Usage: -# ansible-playbook playbooks/storage/terramaster_deploy_ssh_key.yml --ask-pass - -- name: Deploy SSH public key to TerraMaster - hosts: terramaster - gather_facts: false - become: false - - vars: - ssh_public_key_path: "/home/chester/.ssh/id_ed25519.pub" - - tasks: - - name: Verify public key file exists on control node - ansible.builtin.stat: - path: "{{ ssh_public_key_path }}" - register: pubkey_stat - delegate_to: localhost - failed_when: not pubkey_stat.stat.exists - - - name: Read public key content from control node - ansible.builtin.slurp: - src: "{{ ssh_public_key_path }}" - register: pubkey_content - delegate_to: localhost - - - name: Ensure ~/.ssh directory exists on TerraMaster - ansible.builtin.raw: "mkdir -p ~/.ssh && chmod 700 ~/.ssh" - changed_when: false - - - name: Deploy public key to TerraMaster authorized_keys - ansible.builtin.raw: | - key="{{ pubkey_content.content | b64decode | trim }}" - if ! grep -qF "$key" ~/.ssh/authorized_keys 2>/dev/null; then - echo "$key" >> ~/.ssh/authorized_keys - chmod 600 ~/.ssh/authorized_keys - echo "KEY_ADDED" - else - echo "KEY_ALREADY_PRESENT" - fi - register: key_deploy_result - changed_when: "'KEY_ADDED' in key_deploy_result.stdout" - - - name: Report key deployment result - ansible.builtin.debug: - msg: "{{ key_deploy_result.stdout | trim }}" diff --git a/ansible/archive/playbooks/storage/terramaster_health_inventory.yml b/ansible/archive/playbooks/storage/terramaster_health_inventory.yml deleted file mode 100644 index 580ef79..0000000 --- a/ansible/archive/playbooks/storage/terramaster_health_inventory.yml +++ /dev/null @@ -1,82 +0,0 @@ ---- -- name: TerraMaster NAS read-only health inventory - hosts: terramaster - gather_facts: false - become: false - - vars: - nas_ssh_timeout_seconds: 10 - - tasks: - - name: Verify SSH port is reachable from control node - ansible.builtin.wait_for: - host: "{{ ansible_host | default(inventory_hostname) }}" - port: 22 - timeout: "{{ nas_ssh_timeout_seconds }}" - connect_timeout: 3 - delegate_to: localhost - - - name: Verify command execution over SSH - ansible.builtin.raw: "echo NAS_SSH_OK" - register: nas_ssh_check - changed_when: false - failed_when: false - - - name: Collect operating system summary - ansible.builtin.raw: "uname -a" - register: nas_uname - changed_when: false - failed_when: false - - - name: Collect uptime summary - ansible.builtin.raw: "uptime" - register: nas_uptime - changed_when: false - failed_when: false - - - name: Collect root filesystem utilization - ansible.builtin.raw: "df -h /" - register: nas_root_df - changed_when: false - failed_when: false - - - name: Collect memory utilization summary - ansible.builtin.raw: "free -m" - register: nas_memory - changed_when: false - failed_when: false - - - name: Collect failed systemd unit count (if systemd exists) - ansible.builtin.raw: "systemctl --failed --no-legend --no-pager 2>/dev/null | wc -l || true" - register: nas_failed_units - changed_when: false - failed_when: false - - - name: Build TerraMaster health summary - ansible.builtin.set_fact: - terramaster_health_summary: - host: "{{ inventory_hostname }}" - address: "{{ ansible_host | default(inventory_hostname) }}" - ssh_check: - rc: "{{ nas_ssh_check.rc | default('n/a') }}" - stdout: "{{ nas_ssh_check.stdout | default('') | trim }}" - stderr: "{{ nas_ssh_check.stderr | default('') | trim }}" - os_summary: - rc: "{{ nas_uname.rc | default('n/a') }}" - stdout: "{{ nas_uname.stdout | default('') | trim }}" - uptime_summary: - rc: "{{ nas_uptime.rc | default('n/a') }}" - stdout: "{{ nas_uptime.stdout | default('') | trim }}" - root_filesystem: - rc: "{{ nas_root_df.rc | default('n/a') }}" - stdout: "{{ nas_root_df.stdout | default('') | trim }}" - memory_summary: - rc: "{{ nas_memory.rc | default('n/a') }}" - stdout: "{{ nas_memory.stdout | default('') | trim }}" - failed_units_count: - rc: "{{ nas_failed_units.rc | default('n/a') }}" - stdout: "{{ nas_failed_units.stdout | default('') | trim }}" - - - name: Print TerraMaster health summary - ansible.builtin.debug: - var: terramaster_health_summary \ No newline at end of file diff --git a/ansible/archive/requirements-dev.txt b/ansible/archive/requirements-dev.txt deleted file mode 100644 index a4f164f..0000000 --- a/ansible/archive/requirements-dev.txt +++ /dev/null @@ -1,3 +0,0 @@ -ansible-core>=2.16,<2.19 -ansible-lint>=24.7.0 -yamllint>=1.35.0 diff --git a/ansible/archive/requirements.yml b/ansible/archive/requirements.yml deleted file mode 100644 index 805adcc..0000000 --- a/ansible/archive/requirements.yml +++ /dev/null @@ -1,33 +0,0 @@ ---- -# Ansible Galaxy requirements -# Install with: ansible-galaxy install -r requirements.yml -# -# This file tracks all external collections and roles required by this repository. -# Version pinning ensures reproducible deployments. -# -# Last updated: 2026-01-10 - -collections: - # Community General Collection - # Used for: proxmox modules, docker modules, general utilities - # Docs: https://docs.ansible.com/ansible/latest/collections/community/general/ - - name: community.general - version: ">=8.0.0" - - # Community Docker Collection - # Used for: docker_swarm, docker_container, docker_network modules - # Docs: https://docs.ansible.com/ansible/latest/collections/community/docker/ - - name: community.docker - version: ">=3.0.0" - - # Ansible POSIX Collection - # Used for: authorized_key, synchronize, sysctl modules - # Docs: https://docs.ansible.com/ansible/latest/collections/ansible/posix/ - - name: ansible.posix - version: ">=1.5.0" - -# roles: - # Add external roles here as needed - # Example: - # - name: geerlingguy.docker - # version: "6.1.0" diff --git a/ansible/archive/roles/README.md b/ansible/archive/roles/README.md deleted file mode 100644 index 8a123e5..0000000 --- a/ansible/archive/roles/README.md +++ /dev/null @@ -1,253 +0,0 @@ -# Ansible Monitoring Roles - -## Overview - -This directory contains modular Ansible roles for deploying a complete observability stack across a Docker Swarm cluster. The architecture follows the **separation of concerns** principle, with each role handling a specific monitoring component. - -## Architecture - -``` -β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” -β”‚ WATCHTOWER (Controller) β”‚ -β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ -β”‚ β”‚ Prometheus β”‚ β”‚ Grafana β”‚ β”‚ Loki β”‚ β”‚ Uptime Kuma β”‚ β”‚ -β”‚ β”‚ (Metrics DB) β”‚ β”‚ (Dashboards)β”‚(Logs) β”‚ β”‚ (Health) β”‚ β”‚ -β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ -β”‚ β–² β–² β–² β”‚ -β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ - β”‚ Scrape β”‚ Query β”‚ Push - β”‚ β”‚ β”‚ -β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” -β”‚ SWARM CLUSTER β”‚ -β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ -β”‚ β”‚ Manager Node 1 β”‚ β”‚ Worker Node 1 β”‚ β”‚ -β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ [... more] β”‚ -β”‚ β”‚ β”‚node-exporter β”‚ β”‚ β”‚ β”‚node-exporter β”‚ β”‚ β”‚ -β”‚ β”‚ β”‚ (Host CPU, β”‚ β”‚ β”‚ β”‚ (Host CPU, β”‚ β”‚ β”‚ -β”‚ β”‚ β”‚ RAM, Disk) β”‚ β”‚ β”‚ β”‚ RAM, Disk) β”‚ β”‚ β”‚ -β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ -β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ -β”‚ β”‚ β”‚ cAdvisor β”‚ β”‚ β”‚ β”‚ cAdvisor β”‚ β”‚ β”‚ -β”‚ β”‚ β”‚ (Container β”‚ β”‚ β”‚ β”‚ (Container β”‚ β”‚ β”‚ -β”‚ β”‚ β”‚ Metrics) β”‚ β”‚ β”‚ β”‚ Metrics) β”‚ β”‚ β”‚ -β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ -β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ -β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ -``` - -## Roles - -### 1. `swarm_node_exporter` - -**Purpose:** Deploys Prometheus node-exporter on each swarm node to collect host-level metrics. - -**Metrics Collected:** -- CPU usage (per-core and aggregate) -- Memory usage (total, available, cached) -- Disk I/O and space -- Network traffic -- System load averages - -**Configuration:** -- Port: 9100 (default) -- Network: Host mode (for full system visibility) -- Security: Read-only filesystem, dropped capabilities - -**Files:** -- `defaults/main.yml`: Configurable variables -- `tasks/main.yml`: Deployment logic - -### 2. `swarm_cadvisor` - -**Purpose:** Deploys cAdvisor (Container Advisor) on each node to collect container-level resource usage. - -**Metrics Collected:** -- Per-container CPU usage -- Per-container memory usage -- Container network I/O -- Container disk I/O -- Container restart counts - -**Configuration:** -- Port: 8080 (default) -- Requires: Privileged mode (for cgroup access) - -**Why cAdvisor + node-exporter?** -- node-exporter: "The host used 80% CPU" (host-level aggregates) -- cAdvisor: "Container X used 60% of that CPU" (per-container breakdown) - -### 3. `monitoring_stack` - -**Purpose:** Deploys the complete monitoring infrastructure on Watchtower (controller node). - -**Components:** -- **Prometheus:** Metrics time-series database with service discovery -- **Grafana:** Visualization and dashboarding -- **Loki:** Log aggregation and indexing -- **Promtail:** Log shipper (sends logs from Docker to Loki) -- **Uptime Kuma:** HTTP/TCP health monitoring -- **Dozzle:** Real-time Docker log viewer -- **traefik-kop:** Traefik configuration sync - -**Key Features:** -- **Dynamic Target Discovery:** Prometheus scrape configs are generated from Ansible inventory -- **Alert Rules:** Pre-configured alerts for CPU, memory, disk, and node availability -- **Security:** Dozzle protected by Authentik SSO -- **Retention:** Configurable data retention policies - -**Configuration:** -- `defaults/main.yml`: Ports, domains, retention periods -- `templates/prometheus.yml.j2`: Scrape configuration with inventory loop -- `templates/alert-rules.yml.j2`: Alerting rules -- `templates/loki-config.yml.j2`: Log retention and indexing -- `templates/docker-compose.yml.j2`: Complete stack definition - -## Usage - -### Deploy Complete Monitoring Stack - -```bash -cd /home/chester/homelab/ansible -ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml -``` - -### Deploy Only to Swarm Nodes - -```bash -ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml --tags swarm -``` - -### Deploy Only Watchtower Stack - -```bash -ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml --tags watchtower -``` - -### Update Prometheus Configuration - -```bash -ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml --tags watchtower -``` - -## Key Concepts - -### 1. Idempotency - -All roles are **idempotent**β€”running the playbook multiple times produces the same result. This is achieved by: -- Using `docker_container` module with `state: started` (not `state: restarted`) -- Using handlers for configuration changes -- Checking for existing resources before creation - -### 2. Service Discovery - -Instead of hardcoding IP addresses, Prometheus discovers targets dynamically: - -```yaml -# Static approach (bad - manual updates required) -targets: ['10.0.0.211:9100', '10.0.0.212:9100'] - -# Dynamic approach (good - auto-scales with inventory) -{% for host in groups['swarm_managers'] %} - - '{{ hostvars[host].ansible_host }}:9100' -{% endfor %} -``` - -### 3. Security Hardening - -- **Read-only filesystems:** Exporters can't modify system files -- **Dropped capabilities:** Containers run with minimal permissions -- **No new privileges:** Prevents privilege escalation -- **SSO integration:** Dozzle protected by Authentik - -### 4. Desired State - -Docker Compose defines the **desired state**. Docker continuously reconciles: -- **Actual State:** "Container X crashed" -- **Desired State:** "Container X should be running" -- **Reconciliation:** Docker restarts container X - -## Troubleshooting - -### Exporter Not Reachable - -```bash -# Check if exporters are running -ansible swarm_hosts -i inventory/hosts.ini -a "docker ps | grep -E 'node-exporter|cadvisor'" - -# Test from Watchtower -curl http://10.0.0.211:9100/metrics -curl http://10.0.0.211:8080/metrics -``` - -### Prometheus Shows Target Down - -1. Check firewall rules -2. Verify exporter is running: `docker ps` -3. Check exporter logs: `docker logs node-exporter` -4. Test connectivity: `curl http://:9100/metrics` - -### Grafana Can't Connect to Prometheus - -Grafana runs inside Docker, so use Docker DNS: -- βœ… Data source URL: `http://prometheus:9090` -- ❌ Don't use: `http://localhost:9090` - -### Loki Not Receiving Logs - -1. Check Promtail is running: `docker ps | grep promtail` -2. Check Promtail logs: `docker logs promtail` -3. Verify Loki connectivity: `curl http://localhost:3100/ready` - -## Maintenance - -### Add New Swarm Node - -1. Add node to `inventory/hosts.ini` under `[swarm_managers]` or `[swarm_workers]` -2. Run the playbook: `ansible-playbook ... deploy_swarm_monitoring.yml` -3. Prometheus will automatically discover the new node - -### Update Monitoring Stack - -```bash -# Pull latest images and restart -ansible-playbook -i inventory/hosts.ini playbooks/monitoring/deploy_swarm_monitoring.yml -``` - -### View Current Configuration - -```bash -# Prometheus config -cat /opt/stacks/watchtower/prometheus-config/prometheus.yml - -# Alert rules -cat /opt/stacks/watchtower/prometheus-config/alerts/homelab.yml - -# Docker Compose -cat /opt/stacks/watchtower/docker-compose.yml -``` - -## Recommended Grafana Dashboards - -Import these dashboards by ID in Grafana: - -| ID | Name | Purpose | -|----|------|---------| -| 1860 | Node Exporter Full | Complete host metrics | -| 893 | Docker & System Monitoring | Container resource usage | -| 13639 | Loki Dashboard | Log exploration | -| 14282 | cAdvisor | Detailed container metrics | - -## Best Practices - -1. **Never hardcode secrets:** Use `ansible-vault` or environment variables -2. **Use labels extensively:** Makes filtering in Prometheus/Loki easier -3. **Set resource limits:** Prevent monitoring from consuming excessive resources -4. **Test before deploying:** Use `--check` mode to preview changes -5. **Version control everything:** Commit all configuration changes - -## Further Reading - -- [Prometheus Documentation](https://prometheus.io/docs/) -- [Grafana Loki](https://grafana.com/docs/loki/latest/) -- [Ansible Best Practices](https://docs.ansible.com/ansible/latest/user_guide/playbooks_best_practices.html) -- [Docker Swarm Monitoring](https://docs.docker.com/engine/swarm/swarm-tutorial/) diff --git a/ansible/archive/roles/control_node_sanity/defaults/main.yml b/ansible/archive/roles/control_node_sanity/defaults/main.yml deleted file mode 100644 index 905fec0..0000000 --- a/ansible/archive/roles/control_node_sanity/defaults/main.yml +++ /dev/null @@ -1,8 +0,0 @@ ---- -# roles/control_node_sanity/defaults/main.yml -# Default minimum versions and behavior for control node preflight checks. - -control_node_sanity_min_ansible_version: "2.18.0" -control_node_sanity_expected_python_major: 3 -control_node_sanity_expected_python_minor_min: 11 -control_node_sanity_require_lint: false diff --git a/ansible/archive/roles/control_node_sanity/tasks/main.yml b/ansible/archive/roles/control_node_sanity/tasks/main.yml deleted file mode 100644 index 81e09a5..0000000 --- a/ansible/archive/roles/control_node_sanity/tasks/main.yml +++ /dev/null @@ -1,140 +0,0 @@ ---- -# roles/control_node_sanity/tasks/main.yml -# Non-invasive control node checks for Ansible runtime health. - -- name: Collect kernel information - ansible.builtin.command: uname -a - register: sanity_uname - changed_when: false - -- name: Compute repository root path - ansible.builtin.set_fact: - sanity_repo_root: "{{ playbook_dir | dirname | dirname }}" - -- name: Collect distribution metadata - ansible.builtin.command: cat /etc/os-release - register: sanity_os_release - changed_when: false - -- name: Gather ansible core version details - ansible.builtin.command: ansible --version - register: sanity_ansible_version - changed_when: false - -- name: Gather ansible-playbook version details - ansible.builtin.command: ansible-playbook --version - register: sanity_ansible_playbook_version - changed_when: false - -- name: Gather python version details - ansible.builtin.command: python3 --version - register: sanity_python_version - changed_when: false - -- name: Check if ansible-lint is available - ansible.builtin.command: ansible-lint --version - register: sanity_ansible_lint - changed_when: false - failed_when: false - -- name: Determine ansible.cfg source path - ansible.builtin.command: ansible --version - register: sanity_cfg_source - changed_when: false - args: - chdir: "{{ sanity_repo_root }}" - -- name: Capture effective Ansible config overrides - ansible.builtin.command: ansible-config dump --only-changed - register: sanity_config_dump - changed_when: false - args: - chdir: "{{ sanity_repo_root }}" - -- name: Validate inventory graph parses - ansible.builtin.command: ansible-inventory -i inventory/hosts.ini --graph - register: sanity_inventory_graph - changed_when: false - args: - chdir: "{{ sanity_repo_root }}" - -- name: Validate onboarding playbook syntax - ansible.builtin.command: >- - ansible-playbook -i inventory/hosts.ini - playbooks/onboarding/generic_host.yml --syntax-check - register: sanity_syntax_generic_host - changed_when: false - args: - chdir: "{{ sanity_repo_root }}" - -- name: Validate docker management playbook syntax - ansible.builtin.command: >- - ansible-playbook -i inventory/hosts.ini - playbooks/docker/manage_containers.yml --syntax-check - register: sanity_syntax_manage_containers - changed_when: false - args: - chdir: "{{ sanity_repo_root }}" - -- name: Parse ansible version number from output - ansible.builtin.set_fact: - sanity_ansible_version_number: "{{ ansible_version.full | default('0.0.0') }}" - -- name: Normalize python version text - ansible.builtin.set_fact: - sanity_python_version_text: >- - {{ - (sanity_python_version.stdout | default('') | trim) - if (sanity_python_version.stdout | default('') | trim | length > 0) - else (sanity_python_version.stderr | default('') | trim) - }} - -- name: Split python version parts - ansible.builtin.set_fact: - sanity_python_major: "{{ sanity_python_version_text | regex_search('([0-9]+)\\.([0-9]+)', '\\1') | first | default('0') | int }}" - sanity_python_minor: "{{ sanity_python_version_text | regex_search('([0-9]+)\\.([0-9]+)', '\\2') | first | default('0') | int }}" - -- name: Set status flags - ansible.builtin.set_fact: - sanity_ansible_ok: "{{ sanity_ansible_version_number is version(control_node_sanity_min_ansible_version, '>=') }}" - sanity_python_ok: >- - {{ - (sanity_python_major == control_node_sanity_expected_python_major) - and - (sanity_python_minor >= control_node_sanity_expected_python_minor_min) - }} - sanity_lint_ok: "{{ sanity_ansible_lint.rc == 0 }}" - sanity_cfg_loaded: "{{ 'config file = ' in sanity_cfg_source.stdout and 'config file = None' not in sanity_cfg_source.stdout }}" - -- name: Optionally enforce ansible-lint availability - ansible.builtin.assert: - that: - - sanity_lint_ok - fail_msg: "ansible-lint is required but not installed on this control node" - success_msg: "ansible-lint is installed" - when: control_node_sanity_require_lint | bool - -- name: Assert minimum sanity gates - ansible.builtin.assert: - that: - - sanity_ansible_ok - - sanity_python_ok - - sanity_cfg_loaded - - sanity_inventory_graph.rc == 0 - - sanity_syntax_generic_host.rc == 0 - - sanity_syntax_manage_containers.rc == 0 - fail_msg: "Control node sanity gates failed. Review summary output." - success_msg: "Control node sanity gates passed" - -- name: Print control node sanity summary - ansible.builtin.debug: - msg: - - "System: {{ sanity_uname.stdout }}" - - "Ansible core: {{ sanity_ansible_version_number }} (min {{ control_node_sanity_min_ansible_version }})" - - "Python: {{ sanity_python_version.stdout }}" - - "Ansible config loaded: {{ sanity_cfg_loaded }}" - - "Inventory parse: {{ sanity_inventory_graph.rc == 0 }}" - - "Syntax generic_host.yml: {{ sanity_syntax_generic_host.rc == 0 }}" - - "Syntax manage_containers.yml: {{ sanity_syntax_manage_containers.rc == 0 }}" - - "ansible-lint installed: {{ sanity_lint_ok }}" - - "Reality note: host_key_checking is {{ 'disabled' if ('HOST_KEY_CHECKING' in sanity_config_dump.stdout and ' = False' in sanity_config_dump.stdout) else 'not explicitly disabled' }}" diff --git a/ansible/archive/roles/disk_grow/defaults/main.yml b/ansible/archive/roles/disk_grow/defaults/main.yml deleted file mode 100644 index 2acf8e2..0000000 --- a/ansible/archive/roles/disk_grow/defaults/main.yml +++ /dev/null @@ -1,23 +0,0 @@ ---- -# roles/disk_grow/defaults/main.yml -# -# Defaults for in-guest root disk expansion. -# Override per-host in host_vars/.yml if the device path differs -# (e.g. /dev/vda on KVM hosts with virtio-blk, /dev/sda on virtio-scsi). - -# Block device backing the root filesystem -disk_grow_device: "/dev/sda" - -# Partition number to grow (the root partition) -disk_grow_partition_number: "1" - -# Filesystem block device (usually device + partition number) -disk_grow_filesystem: "/dev/sda1" - -# Mount point to verify after grow -disk_grow_mount_point: "/" - -# Minimum acceptable root filesystem size in GB after grow. -# Assertion fails if the filesystem is still below this after growpart + resize2fs, -# which indicates qm resize was not run on the Proxmox host first. -disk_grow_min_gb: 20 diff --git a/ansible/archive/roles/disk_grow/tasks/main.yml b/ansible/archive/roles/disk_grow/tasks/main.yml deleted file mode 100644 index 813f806..0000000 --- a/ansible/archive/roles/disk_grow/tasks/main.yml +++ /dev/null @@ -1,90 +0,0 @@ ---- -# roles/disk_grow/tasks/main.yml -# -# Idempotently grows the root partition and filesystem to fill the allocated -# virtual disk. Safe to re-run on already-expanded nodes (no-op). -# -# Prerequisites: -# The Proxmox virtual disk must already be resized via `qm resize` before -# this role runs. This role only handles the in-guest partition/fs layer. -# -# Idempotency contract: -# - growpart: exits 0 + prints CHANGED when it grew; exits 1 + prints NOCHANGE -# when already at disk boundary. failed_when excludes NOCHANGE. -# - resize2fs: prints "Nothing to do!" when filesystem already fills partition. -# Always exits 0. changed_when detects that message. - -# -------------------------------------------------- -# STEP 1: Install growpart (ships in cloud-guest-utils) -# WHY check first: on a 100%-full root disk, apt-get update writes to -# /var/cache/apt and will fail before installing anything. If growpart -# is already present (cloud images ship it), skip the apt step entirely. -# This avoids a chicken-and-egg failure where we need disk space to -# install the tool that creates disk space. -# -------------------------------------------------- - -- name: Check if growpart is already present - ansible.builtin.command: which growpart - register: disk_grow_growpart_check - changed_when: false - failed_when: false - tags: [disk_grow, packages] - -- name: Install cloud-guest-utils if growpart is missing - ansible.builtin.apt: - name: cloud-guest-utils - state: present - when: disk_grow_growpart_check.rc != 0 - tags: [disk_grow, packages] - -# -------------------------------------------------- -# STEP 2: Grow partition to fill allocated disk -# WHY growpart not parted module: growpart is the canonical safe tool for -# live root partition expansion without unmounting. -# -------------------------------------------------- - -- name: Grow root partition to fill disk boundary - ansible.builtin.command: > - growpart {{ disk_grow_device }} {{ disk_grow_partition_number }} - register: disk_grow_growpart_result - changed_when: "'CHANGED' in disk_grow_growpart_result.stdout" - failed_when: > - disk_grow_growpart_result.rc != 0 - and 'NOCHANGE' not in disk_grow_growpart_result.stdout - tags: [disk_grow, partition] - -# -------------------------------------------------- -# STEP 3: Grow filesystem to fill the now-expanded partition -# WHY always run: resize2fs is safe on a live ext4 root filesystem and -# does nothing ("Nothing to do!") when already at full size. -# -------------------------------------------------- - -- name: Grow ext4 filesystem to fill partition - ansible.builtin.command: resize2fs {{ disk_grow_filesystem }} - register: disk_grow_resize2fs_result - changed_when: "'Nothing to do' not in disk_grow_resize2fs_result.stderr" - tags: [disk_grow, filesystem] - -# -------------------------------------------------- -# STEP 4: Assert result meets minimum size threshold -# WHY assert not debug: a still-small root means qm resize was skipped; -# hard fail prevents downstream tasks from hitting disk-full errors. -# -------------------------------------------------- - -- name: Get root filesystem total size - ansible.builtin.command: df -BG {{ disk_grow_mount_point }} --output=size - register: disk_grow_df_result - changed_when: false - tags: [disk_grow, verify] - -- name: Assert root filesystem meets minimum size ({{ disk_grow_min_gb }}G) - ansible.builtin.assert: - that: - - disk_grow_df_result.stdout_lines[1] | regex_replace('[^0-9]', '') | int >= disk_grow_min_gb - fail_msg: >- - Root filesystem at {{ disk_grow_mount_point }} is - {{ disk_grow_df_result.stdout_lines[1] | trim }} β€” below the minimum {{ disk_grow_min_gb }}G. - Run 'qm resize scsi0 32G' on the Proxmox host first, then re-run this playbook. - success_msg: >- - Root filesystem: {{ disk_grow_df_result.stdout_lines[1] | trim }} βœ“ - tags: [disk_grow, verify] diff --git a/ansible/archive/roles/monitoring_stack/defaults/main.yml b/ansible/archive/roles/monitoring_stack/defaults/main.yml deleted file mode 100644 index 0761938..0000000 --- a/ansible/archive/roles/monitoring_stack/defaults/main.yml +++ /dev/null @@ -1,116 +0,0 @@ ---- -# roles/monitoring_stack/defaults/main.yml -# Watchtower monitoring stack configuration -# Environment-specific values should be defined in group_vars or inventory - -# === DEPLOYMENT SETTINGS === -stack_dir: "/opt/stacks/watchtower" -chester_user: "{{ (monitoring | default({})).get('stack_user', 'chester') }}" - -# Focused rollout controls: deploy one service at a time when enabled. -monitoring_focus_mode: false -monitoring_focus_service: "prometheus" - -# === NETWORK CONFIGURATION === -heimdall_redis: "{{ (monitoring | default({})).get('heimdall_redis', '10.0.0.151:6379') }}" -watchtower_ip: "{{ (monitoring | default({})).get('watchtower_ip', '10.0.0.200') }}" - -# === PROMETHEUS SETTINGS === -prometheus_retention: "15d" -prometheus_scrape_interval: "15s" -prometheus_port: 9090 -prometheus_host_port: 9091 - -# === GRAFANA SETTINGS === -grafana_port: 3000 -grafana_domain: "{{ (monitoring | default({})).get('grafana_domain', 'grafana.castaldifamily.com') }}" -grafana_admin_user: "admin" -# grafana_admin_password: MUST be defined in inventory (vault-encrypted recommended) -grafana_prometheus_datasource_name: "Prometheus" -grafana_prometheus_datasource_uid: "fffcnxoznd2bkc" -grafana_prometheus_url: "http://prometheus:9090" -grafana_loki_datasource_name: "Loki" -grafana_loki_datasource_uid: "loki-homelab" -grafana_loki_url: "http://loki:3100" -grafana_dashboards_folder: "Homelab" - -# === LOKI SETTINGS (Log Aggregation) === -loki_port: 3100 -loki_retention: "168h" # 7 days - -# === BLACKBOX SETTINGS (Endpoint / Network Probing) === -blackbox_port: 9115 -blackbox_exporter_image: "prom/blackbox-exporter:latest" - -# Targets probed from Watchtower for network and service reachability. -# Scheme examples: -# - ICMP: 10.0.0.2 -# - TCP: 10.0.0.151:443 -# - HTTP: https://grafana.castaldifamily.com -monitoring_probe_targets: - - name: omada-er7212pc-gateway - module: icmp - target: "10.0.0.2" - - name: edge-traefik-https - module: tcp_connect - target: "10.0.0.151:443" - - name: watchtower-http-prometheus - module: http_2xx - target: "http://{{ watchtower_ip }}:{{ prometheus_host_port }}/-/ready" - # === PROXMOX CLUSTER REACHABILITY === - - name: pve01-icmp - module: icmp - target: "10.0.0.201" - - name: pve02-icmp - module: icmp - target: "10.0.0.202" - - name: pve03-icmp - module: icmp - target: "10.0.0.203" - - name: pve01-web - module: http_2xx - target: "https://10.0.0.201:8006" - - name: pve02-web - module: http_2xx - target: "https://10.0.0.202:8006" - - name: pve03-web - module: http_2xx - target: "https://10.0.0.203:8006" - -# === PROXMOX API EXPORTER SETTINGS === -pve_exporter_port: 9221 -pve_exporter_config_dir: "{{ stack_dir }}/pve-exporter-config" -pve_exporter_token_name: "monitoring" -# Resolved in playbook pre_tasks from vault_vars.vault_pve_exporter_token -# (or PVE_EXPORTER_TOKEN environment variable fallback). -pve_exporter_token: "" -pve_exporter_verify_ssl: false - -# === UPTIME-KUMA SETTINGS === -uptime_kuma_port: 3001 -uptime_domain: "{{ (monitoring | default({})).get('uptime_domain', 'status.castaldifamily.com') }}" - -# === DOZZLE SETTINGS === -dozzle_port: 8080 -dozzle_domain: "{{ (monitoring | default({})).get('dozzle_domain', 'logs.castaldifamily.com') }}" -dozzle_agent_port: 7007 -monitoring_enable_dozzle: true -# Temporary operating mode: Authentik is offline, so keep outpost disabled. -monitoring_enable_authentik_outpost: false -# Keep Dozzle externally reachable while Authentik is unavailable. -dozzle_expose_via_traefik: true - -# === SECURITY: Authentik Integration === -authentik_host: "{{ (monitoring | default({})).get('authentik_host', 'https://sso.castaldifamily.com') }}" -authentik_outpost_port: 9000 -authentik_outpost_dozzle_token: "" # Set via group_vars or environment variable - -# === PORTAINER SETTINGS === -portainer_http_port: 9000 -portainer_https_port: 9443 -portainer_edge_port: 8000 -portainer_domain: "{{ (monitoring | default({})).get('portainer_domain', 'portainer.castaldifamily.com') }}" - -# === PRO-TIP: Scrape Target Discovery === -# We'll dynamically generate Prometheus targets from Ansible inventory -# This eliminates manual IP management and enables auto-scaling diff --git a/ansible/archive/roles/monitoring_stack/handlers/main.yml b/ansible/archive/roles/monitoring_stack/handlers/main.yml deleted file mode 100644 index 9c6acb8..0000000 --- a/ansible/archive/roles/monitoring_stack/handlers/main.yml +++ /dev/null @@ -1,10 +0,0 @@ ---- -# roles/monitoring_stack/handlers/main.yml -# Handlers for monitoring stack updates - -- name: Restart monitoring stack - community.docker.docker_compose_v2: - project_src: "{{ stack_dir }}" - state: restarted - docker_host: "unix:///run/user/1000/docker.sock" - listen: Restart monitoring stack diff --git a/ansible/archive/roles/monitoring_stack/tasks/main.yml b/ansible/archive/roles/monitoring_stack/tasks/main.yml deleted file mode 100644 index 7d8d315..0000000 --- a/ansible/archive/roles/monitoring_stack/tasks/main.yml +++ /dev/null @@ -1,246 +0,0 @@ ---- -# roles/monitoring_stack/tasks/main.yml -# Deploy and configure the complete monitoring stack on Watchtower - -- name: Resolve focused deployment selection - ansible.builtin.set_fact: - monitoring_selected_services: >- - {{ - [monitoring_focus_service] - if (monitoring_focus_mode | bool) - else [ - 'traefik-kop', - 'prometheus', - 'grafana', - 'uptime-kuma', - 'node-exporter', - 'watchtower-cadvisor', - 'blackbox-exporter', - 'dozzle', - 'authentik-outpost-dozzle', - 'loki', - 'promtail', - 'portainer' - ] - }} - -- name: Show selected monitoring services - ansible.builtin.debug: - msg: - - "Focus mode: {{ monitoring_focus_mode | bool }}" - - "Selected service set: {{ monitoring_selected_services }}" - -- name: Validate supported focused service target - ansible.builtin.assert: - that: - - monitoring_focus_service in ['prometheus', 'node-exporter', 'watchtower-cadvisor', 'blackbox-exporter'] - fail_msg: >- - Unsupported monitoring_focus_service='{{ monitoring_focus_service }}'. - Supported focused services: prometheus, node-exporter, watchtower-cadvisor, blackbox-exporter. - when: monitoring_focus_mode | bool - -- name: Validate Grafana admin password is defined - ansible.builtin.assert: - that: - - grafana_admin_password is defined - - grafana_admin_password | length > 0 - - grafana_admin_password not in ['change-me-now', 'changeme', 'admin', 'password'] - fail_msg: "grafana_admin_password must be defined in inventory with a secure value (not a default placeholder)" - success_msg: "Grafana password validation passed" - when: "'grafana' in monitoring_selected_services" - -- name: Validate Authentik outpost token is defined - ansible.builtin.assert: - that: - - authentik_outpost_dozzle_token is defined - - authentik_outpost_dozzle_token | trim | length > 0 - - authentik_outpost_dozzle_token != 'your-authentik-token-here' - fail_msg: "authentik_outpost_dozzle_token is required (vault or environment) and cannot be empty" - success_msg: "Authentik outpost token validation passed" - when: monitoring_enable_authentik_outpost | bool - and 'authentik-outpost-dozzle' in monitoring_selected_services - no_log: true - -- name: Create monitoring directories - become: true - ansible.builtin.file: - path: "{{ item }}" - state: directory - owner: "{{ chester_user }}" - group: "{{ chester_user }}" - mode: '0755' - loop: - - "{{ stack_dir }}" - - "{{ stack_dir }}/prometheus-data" - - "{{ stack_dir }}/prometheus-config" - - "{{ stack_dir }}/prometheus-config/alerts" - - "{{ stack_dir }}/grafana-data" - - "{{ stack_dir }}/grafana-provisioning" - - "{{ stack_dir }}/grafana-provisioning/datasources" - - "{{ stack_dir }}/grafana-provisioning/plugins" - - "{{ stack_dir }}/grafana-provisioning/alerting" - - "{{ stack_dir }}/grafana-provisioning/dashboards" - - "{{ stack_dir }}/grafana-provisioning/dashboards/homelab" - - "{{ stack_dir }}/uptime-kuma-data" - - "{{ stack_dir }}/dozzle-data" - - "{{ stack_dir }}/loki-data" - - "{{ stack_dir }}/loki-config" - - "{{ stack_dir }}/promtail-data" - - "{{ stack_dir }}/promtail-config" - - "{{ stack_dir }}/blackbox-config" - - "{{ stack_dir }}/portainer-data" - - "{{ pve_exporter_config_dir }}" - -- name: Render Prometheus configuration - ansible.builtin.template: - src: prometheus.yml.j2 - dest: "{{ stack_dir }}/prometheus-config/prometheus.yml" - owner: "{{ chester_user }}" - group: "{{ chester_user }}" - mode: '0644' - notify: Restart monitoring stack - -- name: Render Prometheus alert rules - ansible.builtin.template: - src: alert-rules.yml.j2 - dest: "{{ stack_dir }}/prometheus-config/alerts/homelab.yml" - owner: "{{ chester_user }}" - group: "{{ chester_user }}" - mode: '0644' - notify: Restart monitoring stack - -- name: Render Grafana datasource provisioning - ansible.builtin.template: - src: grafana-datasource.yml.j2 - dest: "{{ stack_dir }}/grafana-provisioning/datasources/prometheus.yml" - owner: "{{ chester_user }}" - group: "{{ chester_user }}" - mode: '0644' - notify: Restart monitoring stack - -- name: Render Grafana dashboard provider provisioning - ansible.builtin.template: - src: grafana-dashboard-provider.yml.j2 - dest: "{{ stack_dir }}/grafana-provisioning/dashboards/homelab-provider.yml" - owner: "{{ chester_user }}" - group: "{{ chester_user }}" - mode: '0644' - notify: Restart monitoring stack - -- name: Render Grafana homelab overview dashboard - ansible.builtin.template: - src: grafana-homelab-overview.json.j2 - dest: "{{ stack_dir }}/grafana-provisioning/dashboards/homelab/homelab-overview.json" - owner: "{{ chester_user }}" - group: "{{ chester_user }}" - mode: '0644' - notify: Restart monitoring stack - -- name: Render Grafana swarm health dashboard - ansible.builtin.template: - src: grafana-swarm-health.json.j2 - dest: "{{ stack_dir }}/grafana-provisioning/dashboards/homelab/swarm-health.json" - owner: "{{ chester_user }}" - group: "{{ chester_user }}" - mode: '0644' - notify: Restart monitoring stack - -- name: Render Grafana blackbox reachability dashboard - ansible.builtin.template: - src: grafana-blackbox-reachability.json.j2 - dest: "{{ stack_dir }}/grafana-provisioning/dashboards/homelab/blackbox-reachability.json" - owner: "{{ chester_user }}" - group: "{{ chester_user }}" - mode: '0644' - notify: Restart monitoring stack - -- name: Render Grafana monitoring coverage dashboard - ansible.builtin.template: - src: grafana-monitoring-coverage.json.j2 - dest: "{{ stack_dir }}/grafana-provisioning/dashboards/homelab/monitoring-coverage.json" - owner: "{{ chester_user }}" - group: "{{ chester_user }}" - mode: '0644' - notify: Restart monitoring stack - -- name: Render Loki configuration - ansible.builtin.template: - src: loki-config.yml.j2 - dest: "{{ stack_dir }}/loki-config/loki-config.yml" - owner: "{{ chester_user }}" - group: "{{ chester_user }}" - mode: '0644' - notify: Restart monitoring stack - -- name: Render Promtail configuration - ansible.builtin.template: - src: promtail-config.yml.j2 - dest: "{{ stack_dir }}/promtail-config/promtail-config.yml" - owner: "{{ chester_user }}" - group: "{{ chester_user }}" - mode: '0644' - notify: Restart monitoring stack - -- name: Render Blackbox exporter configuration - ansible.builtin.template: - src: blackbox.yml.j2 - dest: "{{ stack_dir }}/blackbox-config/blackbox.yml" - owner: "{{ chester_user }}" - group: "{{ chester_user }}" - mode: '0644' - notify: Restart monitoring stack - -- name: Render pve-exporter configuration - ansible.builtin.template: - src: pve-exporter.yml.j2 - dest: "{{ pve_exporter_config_dir }}/pve.yml" - owner: "{{ chester_user }}" - group: "{{ chester_user }}" - mode: '0600' - no_log: true - notify: Restart monitoring stack - -- name: Render watchtower compose specification - ansible.builtin.template: - src: "{{ 'docker-compose.focus.j2' if (monitoring_focus_mode | bool) else 'docker-compose.yml.j2' }}" - dest: "{{ stack_dir }}/docker-compose.yml" - owner: "{{ chester_user }}" - group: "{{ chester_user }}" - mode: '0644' - notify: Restart monitoring stack - -- name: Render watchtower environment file - ansible.builtin.template: - src: env.j2 - dest: "{{ stack_dir }}/.env" - owner: "{{ chester_user }}" - group: "{{ chester_user }}" - mode: '0600' - no_log: true - notify: Restart monitoring stack - -- name: Launch watchtower monitoring stack - community.docker.docker_compose_v2: - project_src: "{{ stack_dir }}" - state: present - pull: always - docker_host: "unix:///run/user/1000/docker.sock" - remove_orphans: false - register: compose_result - -- name: Display deployed services - ansible.builtin.debug: - msg: - - "🎯 Monitoring Stack Deployed Successfully!" - - " Selected services: {{ monitoring_selected_services }}" - - " πŸ“Š Prometheus: http://{{ watchtower_ip }}:{{ prometheus_host_port }}" - - " πŸ“ˆ Grafana: {{ 'enabled' if 'grafana' in monitoring_selected_services else 'skipped in focus mode' }}" - - " βœ… Uptime Kuma: {{ 'enabled' if 'uptime-kuma' in monitoring_selected_services else 'skipped in focus mode' }}" - - " πŸ“‹ Dozzle: {{ 'enabled' if 'dozzle' in monitoring_selected_services else 'skipped in focus mode' }}" - - " πŸ“ Loki: {{ 'enabled' if 'loki' in monitoring_selected_services else 'skipped in focus mode' }}" - - " 🌐 Blackbox: {{ 'enabled' if 'blackbox-exporter' in monitoring_selected_services else 'skipped in focus mode' }}" - - "" - - "πŸ” Next Steps:" - - " 1. Access Grafana and verify Prometheus + Loki datasources" - - " 2. Review the '{{ grafana_dashboards_folder }}' dashboard folder" - - " 3. Configure Uptime Kuma health checks" diff --git a/ansible/archive/roles/monitoring_stack/templates/alert-rules.yml.j2 b/ansible/archive/roles/monitoring_stack/templates/alert-rules.yml.j2 deleted file mode 100644 index 75a343f..0000000 --- a/ansible/archive/roles/monitoring_stack/templates/alert-rules.yml.j2 +++ /dev/null @@ -1,111 +0,0 @@ ---- -# roles/monitoring_stack/templates/alert-rules.yml.j2 -# Prometheus alerting rules for homelab monitoring - -# Jinja2 escaping: Prometheus template syntax is wrapped in raw blocks -{% raw %} -groups: - - name: node_health - interval: 30s - rules: - # === ALERT: High CPU Usage === - - alert: HighCPUUsage - expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 - for: 5m - labels: - severity: warning - annotations: - summary: "High CPU usage on {{ $labels.instance }}" - description: "CPU usage is above 80% for 5 minutes (current: {{ $value }}%)" - - # === ALERT: High Memory Usage === - - alert: HighMemoryUsage - expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85 - for: 5m - labels: - severity: warning - annotations: - summary: "High memory usage on {{ $labels.instance }}" - description: "Memory usage is above 85% (current: {{ $value }}%)" - - # === ALERT: Low Disk Space === - - alert: LowDiskSpace - expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15 - for: 5m - labels: - severity: critical - annotations: - summary: "Low disk space on {{ $labels.instance }}" - description: "Root filesystem has less than 15% free space (current: {{ $value }}%)" - - # === ALERT: Node Down === - - alert: NodeDown - expr: up{job=~"swarm-.*|watchtower-node"} == 0 - for: 2m - labels: - severity: critical - annotations: - summary: "Node {{ $labels.instance }} is down" - description: "The node has been unreachable for 2 minutes" - - - name: swarm_health - interval: 30s - rules: - # === ALERT: Swarm Manager Down === - - alert: SwarmManagerDown - expr: up{job="swarm-managers-node"} == 0 - for: 1m - labels: - severity: critical - annotations: - summary: "Swarm manager {{ $labels.instance }} is down" - description: "A swarm manager node has been unreachable for 1 minute. Check cluster quorum!" - - # === ALERT: High Container Memory Usage === - - alert: HighContainerMemory - expr: (container_memory_usage_bytes / container_spec_memory_limit_bytes) * 100 > 90 - for: 5m - labels: - severity: warning - annotations: - summary: "Container {{ $labels.name }} high memory usage" - description: "Container is using over 90% of its memory limit (current: {{ $value }}%)" - - - name: proxmox_health - interval: 30s - rules: - # === ALERT: Proxmox node unreachable via pve_exporter === - - alert: ProxmoxNodeDown - expr: up{job="proxmox"} == 0 - for: 2m - labels: - severity: critical - annotations: - summary: "Proxmox node {{ $labels.instance }} unreachable" - description: "pve_exporter cannot reach {{ $labels.instance }} for 2 minutes. Verify API token and network path." - - # === ALERT: QEMU VM in non-running state === - - alert: ProxmoxVMDown - expr: pve_up{type="qemu"} == 0 - for: 5m - labels: - severity: warning - annotations: - summary: "VM {{ $labels.name }} is stopped ({{ $labels.id }})" - description: "A QEMU VM has been in a non-running state for 5 minutes on cluster pve." - - # === ALERT: Proxmox datastore filling up === - - alert: ProxmoxStorageFull - expr: pve_disk_usage_bytes / pve_disk_size_bytes > 0.85 - for: 5m - labels: - severity: warning - annotations: - summary: "Proxmox storage {{ $labels.id }} is above 85% full" - description: "Datastore {{ $labels.id }} usage is at {{ $value | humanizePercentage }}. Review and prune old backups/snapshots." -{% endraw %} - -# === PRO-TIP: Alert Routing === -# Connect these alerts to Alertmanager for notifications -# (Email, Slack, PagerDuty, etc.) -# See: https://prometheus.io/docs/alerting/latest/configuration/ diff --git a/ansible/archive/roles/monitoring_stack/templates/blackbox.yml.j2 b/ansible/archive/roles/monitoring_stack/templates/blackbox.yml.j2 deleted file mode 100644 index 80c7bf3..0000000 --- a/ansible/archive/roles/monitoring_stack/templates/blackbox.yml.j2 +++ /dev/null @@ -1,19 +0,0 @@ -modules: - http_2xx: - prober: http - timeout: 5s - http: - method: GET - preferred_ip_protocol: ip4 - - tcp_connect: - prober: tcp - timeout: 5s - tcp: - preferred_ip_protocol: ip4 - - icmp: - prober: icmp - timeout: 5s - icmp: - preferred_ip_protocol: ip4 diff --git a/ansible/archive/roles/monitoring_stack/templates/docker-compose.focus.j2 b/ansible/archive/roles/monitoring_stack/templates/docker-compose.focus.j2 deleted file mode 100644 index 759b33c..0000000 --- a/ansible/archive/roles/monitoring_stack/templates/docker-compose.focus.j2 +++ /dev/null @@ -1,98 +0,0 @@ -# Focused single-service compose for stepwise monitoring rollout. -# Generated when monitoring_focus_mode=true. - -services: -{% if monitoring_focus_service == 'prometheus' %} - prometheus: - image: prom/prometheus:latest - container_name: prometheus - user: "0:0" - restart: unless-stopped - ports: - - "{{ prometheus_host_port }}:9090" - volumes: - - ./prometheus-config:/etc/prometheus - - ./prometheus-data:/prometheus - command: - - '--config.file=/etc/prometheus/prometheus.yml' - - '--storage.tsdb.retention.time={{ prometheus_retention }}' - - '--storage.tsdb.path=/prometheus' - - '--web.enable-lifecycle' - networks: - - monitoring - labels: - - "traefik.enable=false" -{% elif monitoring_focus_service == 'node-exporter' %} - node-exporter: - image: prom/node-exporter:latest - container_name: node-exporter - restart: unless-stopped - command: - - '--path.procfs=/host/proc' - - '--path.sysfs=/host/sys' - - '--path.rootfs=/rootfs' - - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)' - volumes: - - /proc:/host/proc:ro - - /sys:/host/sys:ro - - /:/rootfs:ro - ports: - - "9100:9100" - networks: - - monitoring - security_opt: - - no-new-privileges:true - read_only: true -{% elif monitoring_focus_service == 'watchtower-cadvisor' %} - watchtower-cadvisor: - image: gcr.io/cadvisor/cadvisor:latest - container_name: watchtower-cadvisor - restart: unless-stopped - command: - - '--housekeeping_interval=30s' - - '--docker_only=true' - - '--store_container_labels=false' - - '--disable_metrics=advtcp,udp,process,sched,referenced_memory,resctrl' - volumes: - - /:/rootfs:ro - - /var/run:/var/run:ro - - /sys:/sys:ro - - /var/lib/docker:/var/lib/docker:ro - - /dev/disk:/dev/disk:ro - ports: - - "18080:8080" - networks: - - monitoring - labels: - - "traefik.enable=false" -{% elif monitoring_focus_service == 'blackbox-exporter' %} - blackbox-exporter: - image: {{ blackbox_exporter_image }} - container_name: blackbox-exporter - restart: unless-stopped - command: - - '--config.file=/etc/blackbox_exporter/blackbox.yml' - ports: - - "{{ blackbox_port }}:9115" - volumes: - - ./blackbox-config:/etc/blackbox_exporter:ro - networks: - - monitoring - labels: - - "traefik.enable=false" -{% else %} - # Unsupported focus target selected. - # Keep compose valid and fail early with an obvious no-op service. - focus-placeholder: - image: alpine:3.20 - container_name: focus-placeholder - command: ["sh", "-c", "echo Unsupported monitoring_focus_service='{{ monitoring_focus_service }}'; sleep 3600"] - restart: "no" - networks: - - monitoring -{% endif %} - -networks: - monitoring: - driver: bridge - name: monitoring diff --git a/ansible/archive/roles/monitoring_stack/templates/docker-compose.yml.j2 b/ansible/archive/roles/monitoring_stack/templates/docker-compose.yml.j2 deleted file mode 100644 index 950e79b..0000000 --- a/ansible/archive/roles/monitoring_stack/templates/docker-compose.yml.j2 +++ /dev/null @@ -1,294 +0,0 @@ -# roles/monitoring_stack/templates/docker-compose.yml.j2 -# Complete Watchtower monitoring stack with swarm observability - -# === CONCEPT: Docker Compose for Orchestration === -# This file defines the DESIRED STATE of our monitoring infrastructure -# Docker will continuously reconcile to maintain this state - -services: - # === TRAFFIC ROUTER: traefik-kop === - # Syncs Traefik configuration from Heimdall Redis KV store - traefik-kop: - image: ghcr.io/jittering/traefik-kop:latest - container_name: traefik-kop-agent - restart: unless-stopped - volumes: - - /run/user/1000/docker.sock:/var/run/docker.sock:ro - environment: - - REDIS_ADDR={{ heimdall_redis }} - - BIND_IP={{ watchtower_ip }} - networks: - - monitoring - - # === METRICS STORAGE: Prometheus === - prometheus: - image: prom/prometheus:latest - container_name: prometheus - user: "0:0" - restart: unless-stopped - ports: - - "{{ prometheus_host_port }}:9090" - volumes: - - ./prometheus-config:/etc/prometheus - - ./prometheus-data:/prometheus - command: - - '--config.file=/etc/prometheus/prometheus.yml' - - '--storage.tsdb.retention.time={{ prometheus_retention }}' - - '--storage.tsdb.path=/prometheus' - - '--web.enable-lifecycle' - networks: - - monitoring - labels: - - "traefik.enable=false" - - # === VISUALIZATION: Grafana === - grafana: - image: grafana/grafana-oss:latest - container_name: grafana - user: "0:0" - restart: unless-stopped - ports: - - "{{ grafana_port }}:3000" - environment: - - GF_SERVER_ROOT_URL=https://{{ grafana_domain }} - - GF_SECURITY_ADMIN_USER=${GF_ADMIN_USER:-admin} - - GF_SECURITY_ADMIN_PASSWORD=${GF_ADMIN_PASSWORD:-admin} - volumes: - - ./grafana-data:/var/lib/grafana - - ./grafana-provisioning:/etc/grafana/provisioning:ro - networks: - - monitoring - labels: - - "traefik.enable=true" - - "traefik.http.routers.grafana.rule=Host(`{{ grafana_domain }}`)" - - "traefik.http.routers.grafana.entrypoints=websecure" - - "traefik.http.routers.grafana.tls.certresolver=myresolver" - - "traefik.http.services.grafana.loadbalancer.server.port={{ grafana_port }}" - - # === UPTIME MONITORING: Uptime Kuma === - uptime-kuma: - image: louislam/uptime-kuma:1 - container_name: uptime-kuma - user: "0:0" - restart: unless-stopped - ports: - - "{{ uptime_kuma_port }}:3001" - volumes: - - ./uptime-kuma-data:/app/data - networks: - - monitoring - labels: - - "traefik.enable=true" - - "traefik.http.routers.uptime.rule=Host(`{{ uptime_domain }}`)" - - "traefik.http.routers.uptime.entrypoints=websecure" - - "traefik.http.routers.uptime.tls.certresolver=myresolver" - - "traefik.http.services.uptime.loadbalancer.server.port={{ uptime_kuma_port }}" - - # === HOST METRICS: Node Exporter === - # Collects metrics from Watchtower itself - node-exporter: - image: prom/node-exporter:latest - container_name: node-exporter - restart: unless-stopped - command: - - '--path.procfs=/host/proc' - - '--path.sysfs=/host/sys' - - '--path.rootfs=/rootfs' - - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)' - volumes: - - /proc:/host/proc:ro - - /sys:/host/sys:ro - - /:/rootfs:ro - ports: - - "9100:9100" - networks: - - monitoring - security_opt: - - no-new-privileges:true - read_only: true - - # === WATCHTOWER CONTAINER METRICS: cAdvisor === - # Captures local container resource usage on Watchtower itself. - watchtower-cadvisor: - image: gcr.io/cadvisor/cadvisor:latest - container_name: watchtower-cadvisor - restart: unless-stopped - command: - - '--housekeeping_interval=30s' - - '--docker_only=true' - - '--store_container_labels=false' - - '--disable_metrics=advtcp,udp,process,sched,referenced_memory,resctrl' - volumes: - - /:/rootfs:ro - - /var/run:/var/run:ro - - /sys:/sys:ro - - /var/lib/docker:/var/lib/docker:ro - - /dev/disk:/dev/disk:ro - ports: - - "18080:8080" - networks: - - monitoring - labels: - - "traefik.enable=false" - - # === NETWORK/ENDPOINT PROBING: Blackbox Exporter === - blackbox-exporter: - image: {{ blackbox_exporter_image }} - container_name: blackbox-exporter - restart: unless-stopped - command: - - '--config.file=/etc/blackbox_exporter/blackbox.yml' - ports: - - "{{ blackbox_port }}:9115" - volumes: - - ./blackbox-config:/etc/blackbox_exporter:ro - networks: - - monitoring - labels: - - "traefik.enable=false" - - # === PROXMOX API METRICS: pve_exporter === - # Authenticates to Proxmox API via read-only token and exposes VM/node/storage metrics. - # Credentials are stored in pve-exporter-config/pve.yml (mode 0600, vault-sourced). - pve-exporter: - image: prompve/prometheus-pve-exporter:latest - container_name: pve-exporter - user: "0:0" - restart: unless-stopped - ports: - - "{{ pve_exporter_port }}:9221" - volumes: - - ./pve-exporter-config:/etc/prometheus:ro - networks: - - monitoring - labels: - - "traefik.enable=false" - - # === LOG VIEWER: Dozzle === -{% if monitoring_enable_dozzle | bool %} - dozzle: - image: amir20/dozzle:v9.0.1 - container_name: dozzle - user: "0:0" - restart: unless-stopped - ports: - - "{{ dozzle_port }}:8080" - read_only: true - volumes: - - /run/user/1000/docker.sock:/var/run/docker.sock:ro - - ./dozzle-data:/data - environment: - - TZ=America/New_York - - "DOZZLE_REMOTE_AGENT={% for host in groups['swarm_hosts'] %}{{ hostvars[host].ansible_host }}:{{ dozzle_agent_port }}{% if not loop.last %},{% endif %}{% endfor %}" - logging: - driver: "json-file" - options: - max-size: "10m" - max-file: "3" - tmpfs: - - /tmp - security_opt: - - no-new-privileges:true - cap_drop: - - ALL - networks: - - monitoring - labels: - - "traefik.enable={{ 'true' if dozzle_expose_via_traefik | bool else 'false' }}" -{% if dozzle_expose_via_traefik | bool %} - - "traefik.http.routers.dozzle.rule=Host(`{{ dozzle_domain }}`)" - - "traefik.http.routers.dozzle.entrypoints=websecure" - - "traefik.http.routers.dozzle.tls.certresolver=myresolver" - - "traefik.http.services.dozzle.loadbalancer.server.port={{ dozzle_port }}" -{% if monitoring_enable_authentik_outpost | bool %} - - "traefik.http.routers.dozzle.middlewares=authentik-outpost-dozzle@redis" -{% endif %} -{% endif %} - - # === AUTHENTICATION: Authentik Outpost === -{% if monitoring_enable_authentik_outpost | bool %} - authentik-outpost-dozzle: - image: ghcr.io/goauthentik/proxy:2025.10.3 - container_name: authentik-outpost-dozzle - restart: unless-stopped - ports: - - "{{ authentik_outpost_port }}:9000" - environment: - - AUTHENTIK_HOST={{ authentik_host }} - - AUTHENTIK_INSECURE=false - - AUTHENTIK_TOKEN=${AUTHENTIK_OUTPOST_DOZZLE_TOKEN} - - AUTHENTIK_HOST_BROWSER={{ authentik_host }} - networks: - - monitoring - labels: - - "traefik.enable=true" - - "traefik.http.routers.authentik-outpost-dozzle.rule=Host(`{{ dozzle_domain }}`) && PathPrefix(`/outpost.goauthentik.io/`)" - - "traefik.http.routers.authentik-outpost-dozzle.entrypoints=websecure" - - "traefik.http.routers.authentik-outpost-dozzle.tls.certresolver=myresolver" - - "traefik.http.middlewares.authentik-outpost-dozzle.forwardauth.address=http://{{ watchtower_ip }}:{{ authentik_outpost_port }}/outpost.goauthentik.io/auth/traefik" - - "traefik.http.middlewares.authentik-outpost-dozzle.forwardauth.trustforwardheader=true" - - "traefik.http.middlewares.authentik-outpost-dozzle.forwardauth.authresponseheaders=X-authentik-username,X-authentik-groups,X-authentik-email,X-authentik-name,X-authentik-uid" - - "traefik.http.services.authentik-outpost-dozzle.loadbalancer.server.port={{ authentik_outpost_port }}" -{% endif %} -{% endif %} - - # === CONTAINER MANAGEMENT: Portainer === - portainer: - image: portainer/portainer-ce:latest - container_name: portainer - restart: unless-stopped - ports: - - "{{ portainer_http_port }}:9000" - - "{{ portainer_https_port }}:9443" - - "{{ portainer_edge_port }}:8000" - volumes: - - /run/user/1000/docker.sock:/var/run/docker.sock:ro - - ./portainer-data:/data - networks: - - monitoring - labels: - - "traefik.enable=true" - - "traefik.http.routers.portainer.rule=Host(`{{ portainer_domain }}`)" - - "traefik.http.routers.portainer.entrypoints=websecure" - - "traefik.http.routers.portainer.tls.certresolver=myresolver" - - "traefik.http.services.portainer.loadbalancer.server.port={{ portainer_http_port }}" - - # === LOG AGGREGATION: Loki === - # "Prometheus for logs" - indexes labels, not content - loki: - image: grafana/loki:latest - container_name: loki - user: "0:0" - restart: unless-stopped - ports: - - "{{ loki_port }}:3100" - volumes: - - ./loki-config:/etc/loki - - ./loki-data:/loki - command: -config.file=/etc/loki/loki-config.yml - networks: - - monitoring - labels: - - "traefik.enable=false" - - # === LOG SHIPPER: Promtail === - # Reads Docker logs and ships to Loki - promtail: - image: grafana/promtail:latest - container_name: promtail - restart: unless-stopped - volumes: - - ./promtail-config:/etc/promtail - - /var/run/docker.sock:/var/run/docker.sock:ro - command: -config.file=/etc/promtail/promtail-config.yml - networks: - - monitoring - depends_on: - - loki - -# === BEST PRACTICE: Dedicated Network === -# Isolates monitoring traffic from production workloads -networks: - monitoring: - driver: bridge - name: monitoring diff --git a/ansible/archive/roles/monitoring_stack/templates/env.j2 b/ansible/archive/roles/monitoring_stack/templates/env.j2 deleted file mode 100644 index 591c4f2..0000000 --- a/ansible/archive/roles/monitoring_stack/templates/env.j2 +++ /dev/null @@ -1,5 +0,0 @@ -{% if monitoring_enable_authentik_outpost | bool %} -AUTHENTIK_OUTPOST_DOZZLE_TOKEN={{ authentik_outpost_dozzle_token }} -{% endif %} -GF_ADMIN_USER={{ grafana_admin_user }} -GF_ADMIN_PASSWORD={{ grafana_admin_password }} diff --git a/ansible/archive/roles/monitoring_stack/templates/grafana-blackbox-reachability.json.j2 b/ansible/archive/roles/monitoring_stack/templates/grafana-blackbox-reachability.json.j2 deleted file mode 100644 index d42f2c9..0000000 --- a/ansible/archive/roles/monitoring_stack/templates/grafana-blackbox-reachability.json.j2 +++ /dev/null @@ -1,276 +0,0 @@ -{ - "annotations": { - "list": [ - { - "builtIn": 1, - "datasource": { - "type": "grafana", - "uid": "-- Grafana --" - }, - "enable": true, - "hide": true, - "iconColor": "rgba(0, 211, 255, 1)", - "name": "Annotations & Alerts", - "type": "dashboard" - } - ] - }, - "editable": true, - "fiscalYearStartMonth": 0, - "graphTooltip": 0, - "id": null, - "links": [], - "liveNow": false, - "panels": [ - { - "datasource": { - "type": "prometheus", - "uid": "{{ grafana_prometheus_datasource_uid }}" - }, - "fieldConfig": { - "defaults": { - "color": { - "mode": "thresholds" - }, - "thresholds": { - "mode": "absolute", - "steps": [ - { - "color": "red", - "value": null - }, - { - "color": "green", - "value": 1 - } - ] - } - }, - "overrides": [] - }, - "gridPos": { - "h": 5, - "w": 12, - "x": 0, - "y": 0 - }, - "id": 1, - "options": { - "colorMode": "background", - "graphMode": "none", - "justifyMode": "center", - "orientation": "auto", - "reduceOptions": { - "calcs": [ - "lastNotNull" - ], - "fields": "", - "values": false - }, - "textMode": "value" - }, - "pluginVersion": "11.0.0", - "targets": [ - { - "expr": "sum(probe_success{job=\"blackbox-probes\"})", - "instant": true, - "refId": "A" - } - ], - "title": "Successful probes", - "type": "stat" - }, - { - "datasource": { - "type": "prometheus", - "uid": "{{ grafana_prometheus_datasource_uid }}" - }, - "fieldConfig": { - "defaults": { - "color": { - "mode": "thresholds" - }, - "thresholds": { - "mode": "absolute", - "steps": [ - { - "color": "green", - "value": null - }, - { - "color": "red", - "value": 1 - } - ] - } - }, - "overrides": [] - }, - "gridPos": { - "h": 5, - "w": 12, - "x": 12, - "y": 0 - }, - "id": 2, - "options": { - "colorMode": "background", - "graphMode": "none", - "justifyMode": "center", - "orientation": "auto", - "reduceOptions": { - "calcs": [ - "lastNotNull" - ], - "fields": "", - "values": false - }, - "textMode": "value" - }, - "pluginVersion": "11.0.0", - "targets": [ - { - "expr": "count(probe_success{job=\"blackbox-probes\"} == 0)", - "instant": true, - "refId": "A" - } - ], - "title": "Failed probes", - "type": "stat" - }, - { - "datasource": { - "type": "prometheus", - "uid": "{{ grafana_prometheus_datasource_uid }}" - }, - "fieldConfig": { - "defaults": { - "color": { - "mode": "palette-classic" - }, - "unit": "ms" - }, - "overrides": [] - }, - "gridPos": { - "h": 10, - "w": 24, - "x": 0, - "y": 5 - }, - "id": 3, - "options": { - "legend": { - "displayMode": "list", - "placement": "bottom", - "showLegend": true - }, - "tooltip": { - "mode": "multi", - "sort": "none" - } - }, - "pluginVersion": "11.0.0", - "targets": [ - { - "expr": "probe_duration_seconds{job=\"blackbox-probes\"} * 1000", - "legendFormat": "{{ '{{' }}instance{{ '}}' }}", - "refId": "A" - } - ], - "title": "Probe duration", - "type": "timeseries" - }, - { - "datasource": { - "type": "prometheus", - "uid": "{{ grafana_prometheus_datasource_uid }}" - }, - "fieldConfig": { - "defaults": { - "mappings": [], - "thresholds": { - "mode": "absolute", - "steps": [ - { - "color": "green", - "value": null - } - ] - } - }, - "overrides": [] - }, - "gridPos": { - "h": 9, - "w": 24, - "x": 0, - "y": 15 - }, - "id": 4, - "options": { - "cellHeight": "sm", - "footer": { - "countRows": false, - "fields": "", - "reducer": [ - "sum" - ], - "show": false - }, - "showHeader": true - }, - "pluginVersion": "11.0.0", - "targets": [ - { - "editorMode": "code", - "expr": "probe_success{job=\"blackbox-probes\"}", - "format": "table", - "instant": true, - "legendFormat": "", - "refId": "A" - } - ], - "title": "Current probe state", - "transformations": [ - { - "id": "organize", - "options": { - "excludeByName": { - "Time": true, - "Value #A": false, - "__name__": true, - "job": true - }, - "renameByName": { - "Value #A": "success", - "instance": "target" - } - } - } - ], - "type": "table" - } - ], - "refresh": "30s", - "schemaVersion": 39, - "style": "dark", - "tags": [ - "homelab", - "blackbox", - "reachability" - ], - "templating": { - "list": [] - }, - "time": { - "from": "now-6h", - "to": "now" - }, - "timepicker": {}, - "timezone": "browser", - "title": "Blackbox Reachability", - "uid": "blackbox-reachability", - "version": 1, - "weekStart": "" -} diff --git a/ansible/archive/roles/monitoring_stack/templates/grafana-dashboard-provider.yml.j2 b/ansible/archive/roles/monitoring_stack/templates/grafana-dashboard-provider.yml.j2 deleted file mode 100644 index a56e074..0000000 --- a/ansible/archive/roles/monitoring_stack/templates/grafana-dashboard-provider.yml.j2 +++ /dev/null @@ -1,13 +0,0 @@ -apiVersion: 1 - -providers: - - name: homelab - orgId: 1 - folder: {{ grafana_dashboards_folder }} - folderUid: homelab - type: file - disableDeletion: false - allowUiUpdates: true - updateIntervalSeconds: 30 - options: - path: /etc/grafana/provisioning/dashboards/homelab diff --git a/ansible/archive/roles/monitoring_stack/templates/grafana-datasource.yml.j2 b/ansible/archive/roles/monitoring_stack/templates/grafana-datasource.yml.j2 deleted file mode 100644 index 10d2696..0000000 --- a/ansible/archive/roles/monitoring_stack/templates/grafana-datasource.yml.j2 +++ /dev/null @@ -1,37 +0,0 @@ -# roles/monitoring_stack/templates/grafana-datasource.yml.j2 -# Provision Grafana datasources for deterministic dashboard wiring. - -apiVersion: 1 -prune: true - -deleteDatasources: - - name: Prometheus - orgId: 1 - - name: prometheus - orgId: 1 - - name: Loki - orgId: 1 - - name: loki - orgId: 1 - -datasources: - - name: {{ grafana_prometheus_datasource_name }} - uid: {{ grafana_prometheus_datasource_uid }} - type: prometheus - access: proxy - url: {{ grafana_prometheus_url }} - isDefault: true - editable: true - jsonData: - httpMethod: POST - timeInterval: {{ prometheus_scrape_interval }} - - - name: {{ grafana_loki_datasource_name }} - uid: {{ grafana_loki_datasource_uid }} - type: loki - access: proxy - url: {{ grafana_loki_url }} - isDefault: false - editable: true - jsonData: - maxLines: 1000 diff --git a/ansible/archive/roles/monitoring_stack/templates/grafana-homelab-overview.json.j2 b/ansible/archive/roles/monitoring_stack/templates/grafana-homelab-overview.json.j2 deleted file mode 100644 index 412945e..0000000 --- a/ansible/archive/roles/monitoring_stack/templates/grafana-homelab-overview.json.j2 +++ /dev/null @@ -1,720 +0,0 @@ -{ - "annotations": { - "list": [ - { - "builtIn": 1, - "datasource": { - "type": "grafana", - "uid": "-- Grafana --" - }, - "enable": true, - "hide": true, - "iconColor": "rgba(0, 211, 255, 1)", - "name": "Annotations & Alerts", - "type": "dashboard" - } - ] - }, - "editable": true, - "fiscalYearStartMonth": 0, - "graphTooltip": 0, - "id": null, - "links": [], - "liveNow": false, - "panels": [ - { - "datasource": { - "type": "prometheus", - "uid": "{{ grafana_prometheus_datasource_uid }}" - }, - "fieldConfig": { - "defaults": { - "color": { - "mode": "thresholds" - }, - "mappings": [], - "thresholds": { - "mode": "absolute", - "steps": [ - { - "color": "red", - "value": null - }, - { - "color": "green", - "value": 1 - } - ] - } - }, - "overrides": [] - }, - "gridPos": { - "h": 5, - "w": 6, - "x": 0, - "y": 0 - }, - "id": 1, - "options": { - "colorMode": "background", - "graphMode": "none", - "justifyMode": "center", - "orientation": "auto", - "reduceOptions": { - "calcs": [ - "lastNotNull" - ], - "fields": "", - "values": false - }, - "textMode": "value" - }, - "pluginVersion": "11.0.0", - "targets": [ - { - "datasource": { - "type": "prometheus", - "uid": "{{ grafana_prometheus_datasource_uid }}" - }, - "expr": "count(count by (instance) (up{job=~\"watchtower-node|docker-hosts-node|swarm-.*-node|proxmox\"}))", - "instant": true, - "legendFormat": "", - "refId": "A" - } - ], - "title": "Monitored hosts", - "type": "stat" - }, - { - "datasource": { - "type": "prometheus", - "uid": "{{ grafana_prometheus_datasource_uid }}" - }, - "fieldConfig": { - "defaults": { - "color": { - "mode": "thresholds" - }, - "mappings": [], - "thresholds": { - "mode": "absolute", - "steps": [ - { - "color": "red", - "value": null - }, - { - "color": "green", - "value": 1 - } - ] - } - }, - "overrides": [] - }, - "gridPos": { - "h": 5, - "w": 6, - "x": 6, - "y": 0 - }, - "id": 2, - "options": { - "colorMode": "background", - "graphMode": "none", - "justifyMode": "center", - "orientation": "auto", - "reduceOptions": { - "calcs": [ - "lastNotNull" - ], - "fields": "", - "values": false - }, - "textMode": "value" - }, - "pluginVersion": "11.0.0", - "targets": [ - { - "datasource": { - "type": "prometheus", - "uid": "{{ grafana_prometheus_datasource_uid }}" - }, - "expr": "sum(pve_up{id=~\"node/.*\"})", - "instant": true, - "legendFormat": "", - "refId": "A" - } - ], - "title": "Proxmox nodes up", - "type": "stat" - }, - { - "datasource": { - "type": "prometheus", - "uid": "{{ grafana_prometheus_datasource_uid }}" - }, - "fieldConfig": { - "defaults": { - "color": { - "mode": "thresholds" - }, - "mappings": [], - "thresholds": { - "mode": "absolute", - "steps": [ - { - "color": "orange", - "value": null - }, - { - "color": "green", - "value": 1 - } - ] - } - }, - "overrides": [] - }, - "gridPos": { - "h": 5, - "w": 6, - "x": 12, - "y": 0 - }, - "id": 3, - "options": { - "colorMode": "background", - "graphMode": "none", - "justifyMode": "center", - "orientation": "auto", - "reduceOptions": { - "calcs": [ - "lastNotNull" - ], - "fields": "", - "values": false - }, - "textMode": "value" - }, - "pluginVersion": "11.0.0", - "targets": [ - { - "datasource": { - "type": "prometheus", - "uid": "{{ grafana_prometheus_datasource_uid }}" - }, - "expr": "count(pve_up{id=~\"qemu/.*\"})", - "instant": true, - "legendFormat": "", - "refId": "A" - } - ], - "title": "Tracked Proxmox VMs", - "type": "stat" - }, - { - "datasource": { - "type": "prometheus", - "uid": "{{ grafana_prometheus_datasource_uid }}" - }, - "fieldConfig": { - "defaults": { - "color": { - "mode": "thresholds" - }, - "mappings": [], - "thresholds": { - "mode": "absolute", - "steps": [ - { - "color": "green", - "value": null - }, - { - "color": "red", - "value": 1 - } - ] - } - }, - "overrides": [] - }, - "gridPos": { - "h": 5, - "w": 6, - "x": 18, - "y": 0 - }, - "id": 4, - "options": { - "colorMode": "background", - "graphMode": "none", - "justifyMode": "center", - "orientation": "auto", - "reduceOptions": { - "calcs": [ - "lastNotNull" - ], - "fields": "", - "values": false - }, - "textMode": "value" - }, - "pluginVersion": "11.0.0", - "targets": [ - { - "datasource": { - "type": "prometheus", - "uid": "{{ grafana_prometheus_datasource_uid }}" - }, - "expr": "count(probe_success{job=\"blackbox-probes\"} == 0)", - "instant": true, - "legendFormat": "", - "refId": "A" - } - ], - "title": "Probe failures", - "type": "stat" - }, - { - "datasource": { - "type": "prometheus", - "uid": "{{ grafana_prometheus_datasource_uid }}" - }, - "fieldConfig": { - "defaults": { - "color": { - "mode": "palette-classic" - }, - "custom": { - "axisBorderShow": false, - "axisCenteredZero": false, - "axisColorMode": "text", - "axisLabel": "Percent", - "axisPlacement": "auto", - "barAlignment": 0, - "drawStyle": "line", - "fillOpacity": 10, - "gradientMode": "none", - "hideFrom": { - "legend": false, - "tooltip": false, - "viz": false - }, - "insertNulls": false, - "lineInterpolation": "linear", - "lineWidth": 2, - "pointSize": 4, - "scaleDistribution": { - "type": "linear" - }, - "showPoints": "never", - "spanNulls": false, - "stacking": { - "group": "A", - "mode": "none" - }, - "thresholdsStyle": { - "mode": "off" - } - }, - "mappings": [], - "max": 100, - "min": 0, - "thresholds": { - "mode": "absolute", - "steps": [ - { - "color": "green", - "value": null - }, - { - "color": "orange", - "value": 75 - }, - { - "color": "red", - "value": 90 - } - ] - }, - "unit": "percent" - }, - "overrides": [] - }, - "gridPos": { - "h": 8, - "w": 12, - "x": 0, - "y": 5 - }, - "id": 5, - "options": { - "legend": { - "calcs": [], - "displayMode": "list", - "placement": "bottom", - "showLegend": true - }, - "tooltip": { - "mode": "multi", - "sort": "none" - } - }, - "pluginVersion": "11.0.0", - "targets": [ - { - "datasource": { - "type": "prometheus", - "uid": "{{ grafana_prometheus_datasource_uid }}" - }, - "expr": "100 * pve_disk_usage_bytes{id=~\"storage/.*\"} / pve_disk_size_bytes{id=~\"storage/.*\"}", - "legendFormat": "{{ '{{' }}id{{ '}}' }}", - "range": true, - "refId": "A" - } - ], - "title": "Proxmox storage utilization", - "type": "timeseries" - }, - { - "datasource": { - "type": "prometheus", - "uid": "{{ grafana_prometheus_datasource_uid }}" - }, - "fieldConfig": { - "defaults": { - "color": { - "mode": "palette-classic" - }, - "custom": { - "axisBorderShow": false, - "axisCenteredZero": false, - "axisColorMode": "text", - "axisLabel": "Percent", - "axisPlacement": "auto", - "barAlignment": 0, - "drawStyle": "line", - "fillOpacity": 10, - "gradientMode": "none", - "hideFrom": { - "legend": false, - "tooltip": false, - "viz": false - }, - "insertNulls": false, - "lineInterpolation": "linear", - "lineWidth": 2, - "pointSize": 4, - "scaleDistribution": { - "type": "linear" - }, - "showPoints": "never", - "spanNulls": false, - "stacking": { - "group": "A", - "mode": "none" - }, - "thresholdsStyle": { - "mode": "off" - } - }, - "mappings": [], - "max": 100, - "min": 0, - "thresholds": { - "mode": "absolute", - "steps": [ - { - "color": "green", - "value": null - }, - { - "color": "orange", - "value": 70 - }, - { - "color": "red", - "value": 90 - } - ] - }, - "unit": "percent" - }, - "overrides": [] - }, - "gridPos": { - "h": 8, - "w": 12, - "x": 12, - "y": 5 - }, - "id": 6, - "options": { - "legend": { - "calcs": [], - "displayMode": "list", - "placement": "bottom", - "showLegend": true - }, - "tooltip": { - "mode": "multi", - "sort": "none" - } - }, - "pluginVersion": "11.0.0", - "targets": [ - { - "datasource": { - "type": "prometheus", - "uid": "{{ grafana_prometheus_datasource_uid }}" - }, - "expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{job=~\"watchtower-node|docker-hosts-node|swarm-.*-node\", mode=\"idle\"}[5m])) * 100)", - "legendFormat": "{{ '{{' }}instance{{ '}}' }}", - "range": true, - "refId": "A" - } - ], - "title": "Host CPU busy", - "type": "timeseries" - }, - { - "datasource": { - "type": "prometheus", - "uid": "{{ grafana_prometheus_datasource_uid }}" - }, - "fieldConfig": { - "defaults": { - "color": { - "mode": "palette-classic" - }, - "custom": { - "axisBorderShow": false, - "axisCenteredZero": false, - "axisColorMode": "text", - "axisLabel": "Percent", - "axisPlacement": "auto", - "barAlignment": 0, - "drawStyle": "line", - "fillOpacity": 10, - "gradientMode": "none", - "hideFrom": { - "legend": false, - "tooltip": false, - "viz": false - }, - "insertNulls": false, - "lineInterpolation": "linear", - "lineWidth": 2, - "pointSize": 4, - "scaleDistribution": { - "type": "linear" - }, - "showPoints": "never", - "spanNulls": false, - "stacking": { - "group": "A", - "mode": "none" - }, - "thresholdsStyle": { - "mode": "off" - } - }, - "mappings": [], - "max": 100, - "min": 0, - "thresholds": { - "mode": "absolute", - "steps": [ - { - "color": "green", - "value": null - }, - { - "color": "orange", - "value": 75 - }, - { - "color": "red", - "value": 90 - } - ] - }, - "unit": "percent" - }, - "overrides": [] - }, - "gridPos": { - "h": 8, - "w": 12, - "x": 0, - "y": 13 - }, - "id": 7, - "options": { - "legend": { - "calcs": [], - "displayMode": "list", - "placement": "bottom", - "showLegend": true - }, - "tooltip": { - "mode": "multi", - "sort": "none" - } - }, - "pluginVersion": "11.0.0", - "targets": [ - { - "datasource": { - "type": "prometheus", - "uid": "{{ grafana_prometheus_datasource_uid }}" - }, - "expr": "100 * (1 - (node_memory_MemAvailable_bytes{job=~\"watchtower-node|docker-hosts-node|swarm-.*-node\"} / node_memory_MemTotal_bytes{job=~\"watchtower-node|docker-hosts-node|swarm-.*-node\"}))", - "legendFormat": "{{ '{{' }}instance{{ '}}' }}", - "range": true, - "refId": "A" - } - ], - "title": "Host memory used", - "type": "timeseries" - }, - { - "datasource": { - "type": "prometheus", - "uid": "{{ grafana_prometheus_datasource_uid }}" - }, - "fieldConfig": { - "defaults": { - "color": { - "mode": "palette-classic" - }, - "custom": { - "axisBorderShow": false, - "axisCenteredZero": false, - "axisColorMode": "text", - "axisLabel": "Success", - "axisPlacement": "auto", - "barAlignment": 0, - "drawStyle": "line", - "fillOpacity": 10, - "gradientMode": "none", - "hideFrom": { - "legend": false, - "tooltip": false, - "viz": false - }, - "insertNulls": false, - "lineInterpolation": "stepAfter", - "lineWidth": 2, - "pointSize": 4, - "scaleDistribution": { - "type": "linear" - }, - "showPoints": "never", - "spanNulls": false, - "stacking": { - "group": "A", - "mode": "none" - }, - "thresholdsStyle": { - "mode": "off" - } - }, - "mappings": [], - "max": 1, - "min": 0, - "thresholds": { - "mode": "absolute", - "steps": [ - { - "color": "red", - "value": null - }, - { - "color": "green", - "value": 1 - } - ] - } - }, - "overrides": [] - }, - "gridPos": { - "h": 8, - "w": 12, - "x": 12, - "y": 13 - }, - "id": 8, - "options": { - "legend": { - "calcs": [], - "displayMode": "list", - "placement": "bottom", - "showLegend": true - }, - "tooltip": { - "mode": "multi", - "sort": "none" - } - }, - "pluginVersion": "11.0.0", - "targets": [ - { - "datasource": { - "type": "prometheus", - "uid": "{{ grafana_prometheus_datasource_uid }}" - }, - "expr": "probe_success{job=\"blackbox-probes\"}", - "legendFormat": "{{ '{{' }}instance{{ '}}' }}", - "range": true, - "refId": "A" - } - ], - "title": "Blackbox probe success", - "type": "timeseries" - }, - { - "datasource": null, - "gridPos": { - "h": 4, - "w": 24, - "x": 0, - "y": 21 - }, - "id": 9, - "options": { - "content": "### Logs are now wired into Grafana\nUse Explore with the **{{ grafana_loki_datasource_name }}** datasource to query container logs by labels such as `container`, `project`, and `service`.\n\nStarter query: `{project=~\".+\"}`", - "mode": "markdown" - }, - "pluginVersion": "11.0.0", - "title": "Logs", - "type": "text" - } - ], - "refresh": "30s", - "schemaVersion": 39, - "style": "dark", - "tags": [ - "homelab", - "proxmox", - "swarm", - "logs" - ], - "templating": { - "list": [] - }, - "time": { - "from": "now-6h", - "to": "now" - }, - "timepicker": {}, - "timezone": "browser", - "title": "Homelab Platform Overview", - "uid": "homelab-platform-overview", - "version": 1, - "weekStart": "" -} diff --git a/ansible/archive/roles/monitoring_stack/templates/grafana-monitoring-coverage.json.j2 b/ansible/archive/roles/monitoring_stack/templates/grafana-monitoring-coverage.json.j2 deleted file mode 100644 index 3ba10e7..0000000 --- a/ansible/archive/roles/monitoring_stack/templates/grafana-monitoring-coverage.json.j2 +++ /dev/null @@ -1,288 +0,0 @@ -{ - "annotations": { - "list": [ - { - "builtIn": 1, - "datasource": { - "type": "grafana", - "uid": "-- Grafana --" - }, - "enable": true, - "hide": true, - "iconColor": "rgba(0, 211, 255, 1)", - "name": "Annotations & Alerts", - "type": "dashboard" - } - ] - }, - "editable": true, - "fiscalYearStartMonth": 0, - "graphTooltip": 0, - "id": null, - "links": [], - "liveNow": false, - "panels": [ - { - "datasource": { - "type": "prometheus", - "uid": "{{ grafana_prometheus_datasource_uid }}" - }, - "fieldConfig": { - "defaults": { - "color": { - "mode": "thresholds" - }, - "thresholds": { - "mode": "absolute", - "steps": [ - { - "color": "green", - "value": null - }, - { - "color": "red", - "value": 1 - } - ] - } - }, - "overrides": [] - }, - "gridPos": { - "h": 5, - "w": 8, - "x": 0, - "y": 0 - }, - "id": 1, - "options": { - "colorMode": "background", - "graphMode": "none", - "justifyMode": "center", - "orientation": "auto", - "reduceOptions": { - "calcs": [ - "lastNotNull" - ], - "fields": "", - "values": false - }, - "textMode": "value" - }, - "pluginVersion": "11.0.0", - "targets": [ - { - "expr": "count(up{job=~\"watchtower-node|docker-hosts-node|swarm-managers-node|swarm-workers-node|proxmox\"} == 0)", - "instant": true, - "refId": "A" - } - ], - "title": "Host metrics targets down", - "type": "stat" - }, - { - "datasource": { - "type": "prometheus", - "uid": "{{ grafana_prometheus_datasource_uid }}" - }, - "fieldConfig": { - "defaults": { - "color": { - "mode": "thresholds" - }, - "thresholds": { - "mode": "absolute", - "steps": [ - { - "color": "green", - "value": null - }, - { - "color": "red", - "value": 1 - } - ] - } - }, - "overrides": [] - }, - "gridPos": { - "h": 5, - "w": 8, - "x": 8, - "y": 0 - }, - "id": 2, - "options": { - "colorMode": "background", - "graphMode": "none", - "justifyMode": "center", - "orientation": "auto", - "reduceOptions": { - "calcs": [ - "lastNotNull" - ], - "fields": "", - "values": false - }, - "textMode": "value" - }, - "pluginVersion": "11.0.0", - "targets": [ - { - "expr": "count(up{job=~\"swarm-managers-containers|swarm-workers-containers|watchtower-containers\"} == 0)", - "instant": true, - "refId": "A" - } - ], - "title": "Container metrics targets down", - "type": "stat" - }, - { - "datasource": { - "type": "prometheus", - "uid": "{{ grafana_prometheus_datasource_uid }}" - }, - "fieldConfig": { - "defaults": { - "color": { - "mode": "thresholds" - }, - "thresholds": { - "mode": "absolute", - "steps": [ - { - "color": "green", - "value": null - }, - { - "color": "red", - "value": 1 - } - ] - } - }, - "overrides": [] - }, - "gridPos": { - "h": 5, - "w": 8, - "x": 16, - "y": 0 - }, - "id": 3, - "options": { - "colorMode": "background", - "graphMode": "none", - "justifyMode": "center", - "orientation": "auto", - "reduceOptions": { - "calcs": [ - "lastNotNull" - ], - "fields": "", - "values": false - }, - "textMode": "value" - }, - "pluginVersion": "11.0.0", - "targets": [ - { - "expr": "count(probe_success{job=\"blackbox-probes\"} == 0)", - "instant": true, - "refId": "A" - } - ], - "title": "Probe targets down", - "type": "stat" - }, - { - "datasource": { - "type": "prometheus", - "uid": "{{ grafana_prometheus_datasource_uid }}" - }, - "fieldConfig": { - "defaults": { - "mappings": [], - "thresholds": { - "mode": "absolute", - "steps": [ - { - "color": "green", - "value": null - } - ] - } - }, - "overrides": [] - }, - "gridPos": { - "h": 10, - "w": 24, - "x": 0, - "y": 5 - }, - "id": 4, - "options": { - "cellHeight": "sm", - "footer": { - "countRows": false, - "fields": "", - "reducer": [ - "sum" - ], - "show": false - }, - "showHeader": true - }, - "pluginVersion": "11.0.0", - "targets": [ - { - "editorMode": "code", - "expr": "up{job=~\"watchtower-node|docker-hosts-node|swarm-managers-node|swarm-workers-node|swarm-managers-containers|swarm-workers-containers|watchtower-containers\"}", - "format": "table", - "instant": true, - "refId": "A" - } - ], - "title": "Active scrape targets", - "transformations": [ - { - "id": "organize", - "options": { - "excludeByName": { - "Time": true, - "Value #A": false, - "__name__": true - }, - "renameByName": { - "Value #A": "up" - } - } - } - ], - "type": "table" - } - ], - "refresh": "30s", - "schemaVersion": 39, - "style": "dark", - "tags": [ - "homelab", - "coverage", - "targets" - ], - "templating": { - "list": [] - }, - "time": { - "from": "now-6h", - "to": "now" - }, - "timepicker": {}, - "timezone": "browser", - "title": "Monitoring Coverage", - "uid": "monitoring-coverage", - "version": 1, - "weekStart": "" -} diff --git a/ansible/archive/roles/monitoring_stack/templates/grafana-swarm-health.json.j2 b/ansible/archive/roles/monitoring_stack/templates/grafana-swarm-health.json.j2 deleted file mode 100644 index a853702..0000000 --- a/ansible/archive/roles/monitoring_stack/templates/grafana-swarm-health.json.j2 +++ /dev/null @@ -1,307 +0,0 @@ -{ - "annotations": { - "list": [ - { - "builtIn": 1, - "datasource": { - "type": "grafana", - "uid": "-- Grafana --" - }, - "enable": true, - "hide": true, - "iconColor": "rgba(0, 211, 255, 1)", - "name": "Annotations & Alerts", - "type": "dashboard" - } - ] - }, - "editable": true, - "fiscalYearStartMonth": 0, - "graphTooltip": 0, - "id": null, - "links": [], - "liveNow": false, - "panels": [ - { - "datasource": { - "type": "prometheus", - "uid": "{{ grafana_prometheus_datasource_uid }}" - }, - "fieldConfig": { - "defaults": { - "color": { - "mode": "thresholds" - }, - "thresholds": { - "mode": "absolute", - "steps": [ - { - "color": "red", - "value": null - }, - { - "color": "green", - "value": 1 - } - ] - } - }, - "overrides": [] - }, - "gridPos": { - "h": 5, - "w": 8, - "x": 0, - "y": 0 - }, - "id": 1, - "options": { - "colorMode": "background", - "graphMode": "none", - "justifyMode": "center", - "orientation": "auto", - "reduceOptions": { - "calcs": [ - "lastNotNull" - ], - "fields": "", - "values": false - }, - "textMode": "value" - }, - "pluginVersion": "11.0.0", - "targets": [ - { - "expr": "sum(up{job=~\"swarm-managers-node|swarm-workers-node\"})", - "instant": true, - "refId": "A" - } - ], - "title": "Swarm nodes up", - "type": "stat" - }, - { - "datasource": { - "type": "prometheus", - "uid": "{{ grafana_prometheus_datasource_uid }}" - }, - "fieldConfig": { - "defaults": { - "color": { - "mode": "thresholds" - }, - "thresholds": { - "mode": "absolute", - "steps": [ - { - "color": "red", - "value": null - }, - { - "color": "green", - "value": 1 - } - ] - } - }, - "overrides": [] - }, - "gridPos": { - "h": 5, - "w": 8, - "x": 8, - "y": 0 - }, - "id": 2, - "options": { - "colorMode": "background", - "graphMode": "none", - "justifyMode": "center", - "orientation": "auto", - "reduceOptions": { - "calcs": [ - "lastNotNull" - ], - "fields": "", - "values": false - }, - "textMode": "value" - }, - "pluginVersion": "11.0.0", - "targets": [ - { - "expr": "count(up{job=~\"swarm-managers-node|swarm-workers-node\"} == 0)", - "instant": true, - "refId": "A" - } - ], - "title": "Swarm node exporters down", - "type": "stat" - }, - { - "datasource": { - "type": "prometheus", - "uid": "{{ grafana_prometheus_datasource_uid }}" - }, - "fieldConfig": { - "defaults": { - "color": { - "mode": "thresholds" - }, - "thresholds": { - "mode": "absolute", - "steps": [ - { - "color": "green", - "value": null - }, - { - "color": "red", - "value": 1 - } - ] - } - }, - "overrides": [] - }, - "gridPos": { - "h": 5, - "w": 8, - "x": 16, - "y": 0 - }, - "id": 3, - "options": { - "colorMode": "background", - "graphMode": "none", - "justifyMode": "center", - "orientation": "auto", - "reduceOptions": { - "calcs": [ - "lastNotNull" - ], - "fields": "", - "values": false - }, - "textMode": "value" - }, - "pluginVersion": "11.0.0", - "targets": [ - { - "expr": "count(up{job=~\"swarm-managers-containers|swarm-workers-containers\"} == 0)", - "instant": true, - "refId": "A" - } - ], - "title": "Swarm cAdvisor targets down", - "type": "stat" - }, - { - "datasource": { - "type": "prometheus", - "uid": "{{ grafana_prometheus_datasource_uid }}" - }, - "fieldConfig": { - "defaults": { - "color": { - "mode": "palette-classic" - }, - "unit": "percent" - }, - "overrides": [] - }, - "gridPos": { - "h": 9, - "w": 12, - "x": 0, - "y": 5 - }, - "id": 4, - "options": { - "legend": { - "displayMode": "list", - "placement": "bottom", - "showLegend": true - }, - "tooltip": { - "mode": "multi", - "sort": "none" - } - }, - "pluginVersion": "11.0.0", - "targets": [ - { - "expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{job=~\"swarm-managers-node|swarm-workers-node\", mode=\"idle\"}[5m])) * 100)", - "legendFormat": "{{ '{{' }}instance{{ '}}' }}", - "refId": "A" - } - ], - "title": "Swarm host CPU busy", - "type": "timeseries" - }, - { - "datasource": { - "type": "prometheus", - "uid": "{{ grafana_prometheus_datasource_uid }}" - }, - "fieldConfig": { - "defaults": { - "color": { - "mode": "palette-classic" - }, - "unit": "percent" - }, - "overrides": [] - }, - "gridPos": { - "h": 9, - "w": 12, - "x": 12, - "y": 5 - }, - "id": 5, - "options": { - "legend": { - "displayMode": "list", - "placement": "bottom", - "showLegend": true - }, - "tooltip": { - "mode": "multi", - "sort": "none" - } - }, - "pluginVersion": "11.0.0", - "targets": [ - { - "expr": "100 * (1 - (node_memory_MemAvailable_bytes{job=~\"swarm-managers-node|swarm-workers-node\"} / node_memory_MemTotal_bytes{job=~\"swarm-managers-node|swarm-workers-node\"}))", - "legendFormat": "{{ '{{' }}instance{{ '}}' }}", - "refId": "A" - } - ], - "title": "Swarm host memory used", - "type": "timeseries" - } - ], - "refresh": "30s", - "schemaVersion": 39, - "style": "dark", - "tags": [ - "homelab", - "swarm", - "health" - ], - "templating": { - "list": [] - }, - "time": { - "from": "now-6h", - "to": "now" - }, - "timepicker": {}, - "timezone": "browser", - "title": "Swarm Health", - "uid": "swarm-health", - "version": 1, - "weekStart": "" -} diff --git a/ansible/archive/roles/monitoring_stack/templates/loki-config.yml.j2 b/ansible/archive/roles/monitoring_stack/templates/loki-config.yml.j2 deleted file mode 100644 index a118ea4..0000000 --- a/ansible/archive/roles/monitoring_stack/templates/loki-config.yml.j2 +++ /dev/null @@ -1,55 +0,0 @@ ---- -# roles/monitoring_stack/templates/loki-config.yml.j2 -# Loki configuration for centralized log aggregation - -# === CONCEPT: Loki vs Traditional Logging === -# Traditional: Index entire log content (expensive, slow) -# Loki: Index only metadata labels (cheap, fast) -# Think of Loki as "Prometheus for logs" - -auth_enabled: false - -server: - http_listen_port: {{ loki_port }} - grpc_listen_port: 9096 - -common: - path_prefix: /loki - storage: - filesystem: - chunks_directory: /loki/chunks - rules_directory: /loki/rules - replication_factor: 1 - ring: - instance_addr: 127.0.0.1 - kvstore: - store: inmemory - -schema_config: - configs: - - from: 2024-01-01 - store: tsdb - object_store: filesystem - schema: v13 - index: - prefix: index_ - period: 24h - -# === RETENTION: Automatic Log Cleanup === -# Keeps logs for {{ loki_retention }} (default: 7 days) -# Prevents disk from filling up with old logs -limits_config: - retention_period: {{ loki_retention }} - reject_old_samples: true - reject_old_samples_max_age: 168h - -# === PRO-TIP: Query Optimization === -# Limit concurrent queries to prevent overload -query_scheduler: - max_outstanding_requests_per_tenant: 100 - -# === LABEL EXTRACTION === -# Automatically extract structured fields from logs -# Example: {"level":"error"} becomes label level="error" -ruler: - alertmanager_url: http://localhost:9093 diff --git a/ansible/archive/roles/monitoring_stack/templates/prometheus.yml.j2 b/ansible/archive/roles/monitoring_stack/templates/prometheus.yml.j2 deleted file mode 100644 index b06c2cb..0000000 --- a/ansible/archive/roles/monitoring_stack/templates/prometheus.yml.j2 +++ /dev/null @@ -1,156 +0,0 @@ ---- -# roles/monitoring_stack/templates/prometheus.yml.j2 -# Prometheus configuration with dynamic swarm cluster discovery - -global: - scrape_interval: {{ prometheus_scrape_interval }} - evaluation_interval: {{ prometheus_scrape_interval }} - external_labels: - cluster: 'homelab' - environment: 'production' - -# === BEST PRACTICE: Alerting Rules === -# Separate alert rules into external files for maintainability -rule_files: - - '/etc/prometheus/alerts/*.yml' - -# === CONCEPT: Scrape Configs === -# Each job defines a set of targets to monitor -# Prometheus will scrape /metrics from each endpoint -scrape_configs: - # Monitor Prometheus itself (meta-monitoring) - - job_name: 'prometheus' - static_configs: - - targets: ['localhost:{{ prometheus_port }}'] - labels: - role: 'monitoring' - host: 'watchtower' - - # === WATCHTOWER NODE METRICS === - - job_name: 'watchtower-node' - static_configs: - - targets: ['node-exporter:9100'] - labels: - role: 'controller' - host: 'watchtower' - - # === WATCHTOWER LOCAL CONTAINER METRICS === - - job_name: 'watchtower-containers' - static_configs: - - targets: ['watchtower-cadvisor:8080'] - labels: - role: 'controller' - host: 'watchtower' - metric_source: 'cadvisor' - - # === SWARM MANAGER NODE METRICS === - # Generated dynamically from [swarm_managers] inventory group - - job_name: 'swarm-managers-node' - static_configs: - - targets: -{% for host in groups['swarm_managers'] %} - - '{{ hostvars[host].ansible_host }}:9100' -{% endfor %} - labels: - role: 'manager' - cluster: 'swarm' - - # === SWARM WORKER NODE METRICS === - - job_name: 'swarm-workers-node' - static_configs: - - targets: -{% for host in groups['swarm_workers'] %} - - '{{ hostvars[host].ansible_host }}:9100' -{% endfor %} - labels: - role: 'worker' - cluster: 'swarm' - - # === CONTAINER METRICS (cAdvisor) === - - job_name: 'swarm-managers-containers' - static_configs: - - targets: -{% for host in groups['swarm_managers'] %} - - '{{ hostvars[host].ansible_host }}:8080' -{% endfor %} - labels: - role: 'manager' - cluster: 'swarm' - - - job_name: 'swarm-workers-containers' - static_configs: - - targets: -{% for host in groups['swarm_workers'] %} - - '{{ hostvars[host].ansible_host }}:8080' -{% endfor %} - labels: - role: 'worker' - cluster: 'swarm' - - # === PRO-TIP: Docker Hosts === - # Monitor standalone Docker hosts (heimdall, waldorf) -{% if groups['docker_hosts'] is defined %} - - job_name: 'docker-hosts-node' - static_configs: - - targets: -{% for host in groups['docker_hosts'] %} - - '{{ hostvars[host].ansible_host }}:9100' -{% endfor %} - labels: - role: 'standalone' -{% endif %} - - # === BLACKBOX PROBES (NETWORK / ENDPOINT HEALTH) === - - job_name: 'blackbox-probes' - metrics_path: /probe - params: - module: [http_2xx] - static_configs: -{% for probe in monitoring_probe_targets %} - - targets: ['{{ probe.target }}'] - labels: - probe_name: '{{ probe.name }}' - module: '{{ probe.module }}' -{% endfor %} - relabel_configs: - - source_labels: [__address__] - target_label: __param_target - - source_labels: [module] - target_label: __param_module - - source_labels: [__param_target] - target_label: instance - - target_label: __address__ - replacement: 'blackbox-exporter:{{ blackbox_port }}' - - # === PROXMOX CLUSTER METRICS (via pve_exporter) === - # pve_exporter authenticates to the Proxmox API using a read-only PVEAuditor token. - # Each PVE node is passed as ?target= and the request is routed through the exporter. - - job_name: 'proxmox' - metrics_path: /pve - params: - module: [default] - static_configs: - - targets: -{% for host in groups['proxmox_cluster'] %} - - '{{ hostvars[host].ansible_host }}' -{% endfor %} - labels: - cluster: 'pve' - relabel_configs: - - source_labels: [__address__] - target_label: __param_target - - source_labels: [__param_target] - target_label: instance - - target_label: __address__ - replacement: 'pve-exporter:9221' - - # === FUTURE: Swarm Service Discovery === - # Uncomment to enable automatic discovery of swarm services - # Requires Docker API to be exposed on managers - # - job_name: 'swarm-services' - # dockerswarm_sd_configs: - # - host: unix:///var/run/docker.sock - # role: tasks - # relabel_configs: - # - source_labels: [__meta_dockerswarm_service_name] - # target_label: service diff --git a/ansible/archive/roles/monitoring_stack/templates/promtail-config.yml.j2 b/ansible/archive/roles/monitoring_stack/templates/promtail-config.yml.j2 deleted file mode 100644 index bed8851..0000000 --- a/ansible/archive/roles/monitoring_stack/templates/promtail-config.yml.j2 +++ /dev/null @@ -1,39 +0,0 @@ ---- -# roles/monitoring_stack/templates/promtail-config.yml.j2 -# Promtail: Log shipper that sends logs to Loki - -server: - http_listen_port: 9080 - grpc_listen_port: 0 - -# === WHERE TO SEND LOGS === -clients: - - url: http://{{ watchtower_ip }}:{{ loki_port }}/loki/api/v1/push - -# === WHAT LOGS TO COLLECT === -scrape_configs: - # Collect logs from all Docker containers - - job_name: docker - docker_sd_configs: - - host: unix:///var/run/docker.sock - refresh_interval: 5s - relabel_configs: - # === LABEL: Extract container name === - - source_labels: ['__meta_docker_container_name'] - target_label: 'container' - # === LABEL: Extract compose project === - - source_labels: ['__meta_docker_container_label_com_docker_compose_project'] - target_label: 'project' - # === LABEL: Extract service name === - - source_labels: ['__meta_docker_container_label_com_docker_compose_service'] - target_label: 'service' - - # === PRO-TIP: Add System Logs === - # Uncomment to collect syslog entries - # - job_name: syslog - # static_configs: - # - targets: - # - localhost - # labels: - # job: varlogs - # __path__: /var/log/*log diff --git a/ansible/archive/roles/monitoring_stack/templates/pve-exporter.yml.j2 b/ansible/archive/roles/monitoring_stack/templates/pve-exporter.yml.j2 deleted file mode 100644 index 705407d..0000000 --- a/ansible/archive/roles/monitoring_stack/templates/pve-exporter.yml.j2 +++ /dev/null @@ -1,16 +0,0 @@ -# roles/monitoring_stack/templates/pve-exporter.yml.j2 -# pve_exporter credentials configuration -# -# This file is rendered with mode 0600 (owner-read only). -# vault_pve_exporter_token must be set in group_vars/vault/all.yml. -# -# To create the required read-only API token on any Proxmox node: -# pveum user add pve-exporter@pve --comment "Prometheus read-only exporter" -# pveum aclmod / -user pve-exporter@pve -role PVEAuditor -# pveum user token add pve-exporter@pve monitoring --privsep 0 - -default: - user: pve-exporter@pve - token_name: {{ pve_exporter_token_name }} - token_value: {{ pve_exporter_token }} - verify_ssl: {{ pve_exporter_verify_ssl | lower }} diff --git a/ansible/archive/roles/proxmox_cluster_reconcile_v2/defaults/main.yml b/ansible/archive/roles/proxmox_cluster_reconcile_v2/defaults/main.yml deleted file mode 100644 index 9a3e64d..0000000 --- a/ansible/archive/roles/proxmox_cluster_reconcile_v2/defaults/main.yml +++ /dev/null @@ -1,23 +0,0 @@ ---- -# Role defaults for clean cluster reconcile workflow. - -# INITIATING ARCHITECT context -cluster_project_name: "node-replacement-mar13-2026" - -# Mode: validate (read-only), join (join node if needed), auto (join+validate) -cluster_mode: "auto" - -# Target node to join existing cluster and anchor node already in quorum. -join_node: "pve01" -join_target_host: "pve02" -join_target_ip: "" - -# Service behavior after successful join. -cluster_enable_ha_services: true -cluster_join_allow_existing_guests: true - -# Optional force path. Normally keep false for strict idempotency. -cluster_force_rejoin: false - -# Artifact output path on controller. -cluster_reconcile_output_root: "{{ playbook_dir }}/../../outputs/cluster-reconcile" diff --git a/ansible/archive/roles/proxmox_cluster_reconcile_v2/tasks/join_node.yml b/ansible/archive/roles/proxmox_cluster_reconcile_v2/tasks/join_node.yml deleted file mode 100644 index 0d055a8..0000000 --- a/ansible/archive/roles/proxmox_cluster_reconcile_v2/tasks/join_node.yml +++ /dev/null @@ -1,75 +0,0 @@ ---- -- name: Compute whether join node already appears in anchor node list - ansible.builtin.set_fact: - proxmox_cluster_reconcile_v2_join_needed: >- - {{ - cluster_force_rejoin | bool or - (join_node not in (proxmox_cluster_reconcile_v2_anchor_node_names | default([]))) - }} - -- name: Ensure pve-cluster is enabled and running on join node before add - ansible.builtin.systemd: - name: pve-cluster - enabled: true - state: started - become: true - delegate_to: "{{ join_node }}" - when: proxmox_cluster_reconcile_v2_join_needed | bool - -- name: Ensure join node hostname resolves to current management IP locally - ansible.builtin.lineinfile: - path: /etc/hosts - regexp: '^\s*\d+\.\d+\.\d+\.\d+\s+{{ join_node }}\.local\s+{{ join_node }}\s*$' - line: "{{ proxmox_cluster_reconcile_v2_join_node_ip }} {{ join_node }}.local {{ join_node }}" - state: present - become: true - delegate_to: "{{ join_node }}" - when: proxmox_cluster_reconcile_v2_join_needed | bool - -- name: Join node to existing cluster via anchor - ansible.builtin.command: >- - pvecm add {{ proxmox_cluster_reconcile_v2_anchor_ip }} --use_ssh 1 - {% if cluster_join_allow_existing_guests | bool %} --force 1{% endif %} - register: proxmox_cluster_reconcile_v2_join_cmd - become: true - delegate_to: "{{ join_node }}" - changed_when: >- - 'successfully added node' in - ((proxmox_cluster_reconcile_v2_join_cmd.stdout | default('')) | lower) - failed_when: - - proxmox_cluster_reconcile_v2_join_cmd.rc != 0 - - "'already defined' not in (proxmox_cluster_reconcile_v2_join_cmd.stderr | default(''))" - when: proxmox_cluster_reconcile_v2_join_needed | bool - -- name: Wait for corosync configuration to appear after join - ansible.builtin.wait_for: - path: /etc/pve/corosync.conf - timeout: 60 - become: true - delegate_to: "{{ join_node }}" - when: proxmox_cluster_reconcile_v2_join_needed | bool - -- name: Ensure corosync and pve-ha-lrm are running on joined node - ansible.builtin.systemd: - name: "{{ item }}" - enabled: true - state: started - become: true - delegate_to: "{{ join_node }}" - loop: - - corosync - - pve-ha-lrm - when: - - cluster_enable_ha_services | bool - - proxmox_cluster_reconcile_v2_join_needed | bool - -- name: Ensure pve-ha-crm is enabled on joined node - ansible.builtin.systemd: - name: pve-ha-crm - enabled: true - state: started - become: true - delegate_to: "{{ join_node }}" - when: - - cluster_enable_ha_services | bool - - proxmox_cluster_reconcile_v2_join_needed | bool diff --git a/ansible/archive/roles/proxmox_cluster_reconcile_v2/tasks/main.yml b/ansible/archive/roles/proxmox_cluster_reconcile_v2/tasks/main.yml deleted file mode 100644 index 1f4f7d6..0000000 --- a/ansible/archive/roles/proxmox_cluster_reconcile_v2/tasks/main.yml +++ /dev/null @@ -1,25 +0,0 @@ ---- -- name: Validate reconcile mode input - ansible.builtin.assert: - that: - - cluster_mode in ['validate', 'join', 'auto'] - fail_msg: "cluster_mode must be one of: validate, join, auto" - -- name: Build output paths and run context - ansible.builtin.set_fact: - proxmox_cluster_reconcile_v2_timestamp: "{{ lookup('pipe', 'date +%Y%m%dT%H%M%S') }}" - proxmox_cluster_reconcile_v2_output_dir: "{{ cluster_reconcile_output_root }}/{{ cluster_project_name | regex_replace('[^a-zA-Z0-9_-]', '_') }}-{{ lookup('pipe', 'date +%Y%m%dT%H%M%S') }}" - -- name: Include preflight checks - ansible.builtin.include_tasks: preflight.yml - -- name: Include root SSH trust tasks for join path - ansible.builtin.include_tasks: root_ssh_trust.yml - when: cluster_mode in ['join', 'auto'] - -- name: Include guarded join tasks - ansible.builtin.include_tasks: join_node.yml - when: cluster_mode in ['join', 'auto'] - -- name: Include post-check validation tasks - ansible.builtin.include_tasks: validate.yml diff --git a/ansible/archive/roles/proxmox_cluster_reconcile_v2/tasks/preflight.yml b/ansible/archive/roles/proxmox_cluster_reconcile_v2/tasks/preflight.yml deleted file mode 100644 index b70f19f..0000000 --- a/ansible/archive/roles/proxmox_cluster_reconcile_v2/tasks/preflight.yml +++ /dev/null @@ -1,80 +0,0 @@ ---- -- name: Assert proxmox_cluster inventory group is populated - ansible.builtin.assert: - that: - - groups['proxmox_cluster'] is defined - - groups['proxmox_cluster'] | length >= 2 - fail_msg: "Inventory group proxmox_cluster must contain at least two nodes." - -- name: Assert join node and anchor node are valid and distinct - ansible.builtin.assert: - that: - - join_node in groups['proxmox_cluster'] - - join_target_host in groups['proxmox_cluster'] - - join_node != join_target_host - fail_msg: "join_node and join_target_host must both be valid proxmox_cluster hosts and cannot be identical." - -- name: Resolve join anchor IP - ansible.builtin.set_fact: - proxmox_cluster_reconcile_v2_anchor_ip: >- - {{ - join_target_ip | trim - if join_target_ip | trim | length > 0 - else (hostvars[join_target_host].ansible_host | default(join_target_host)) - }} - proxmox_cluster_reconcile_v2_join_node_ip: "{{ hostvars[join_node].ansible_host | default(join_node) }}" - -- name: Ensure join anchor SSH is reachable from controller - ansible.builtin.wait_for: - host: "{{ proxmox_cluster_reconcile_v2_anchor_ip }}" - port: 22 - timeout: 10 - connect_timeout: 3 - state: started - delegate_to: localhost - -- name: Capture cluster status from anchor - ansible.builtin.command: pvecm status - register: proxmox_cluster_reconcile_v2_anchor_status - changed_when: false - become: true - delegate_to: "{{ join_target_host }}" - -- name: Assert anchor cluster is quorate before join - ansible.builtin.assert: - that: - - proxmox_cluster_reconcile_v2_anchor_status.rc == 0 - - proxmox_cluster_reconcile_v2_anchor_status.stdout is search('Quorate:\\s+Yes') - fail_msg: >- - Anchor node {{ join_target_host }} is not quorate. Resolve quorum before joining {{ join_node }}. - -- name: Capture current cluster nodes from anchor - ansible.builtin.command: pvecm nodes - register: proxmox_cluster_reconcile_v2_anchor_nodes - changed_when: false - become: true - delegate_to: "{{ join_target_host }}" - -- name: Extract current cluster node names from anchor - ansible.builtin.shell: "pvecm nodes | awk 'NR>2 && NF>=3 {print $3}'" - register: proxmox_cluster_reconcile_v2_anchor_node_names_cmd - changed_when: false - become: true - delegate_to: "{{ join_target_host }}" - -- name: Build normalized anchor node name list - ansible.builtin.set_fact: - proxmox_cluster_reconcile_v2_anchor_node_names: >- - {{ - proxmox_cluster_reconcile_v2_anchor_node_names_cmd.stdout_lines - | map('trim') - | reject('equalto', '') - | list - }} - -- name: Capture join node cluster membership file state - ansible.builtin.stat: - path: /etc/pve/corosync.conf - register: proxmox_cluster_reconcile_v2_join_corosync_file - become: true - delegate_to: "{{ join_node }}" diff --git a/ansible/archive/roles/proxmox_cluster_reconcile_v2/tasks/root_ssh_trust.yml b/ansible/archive/roles/proxmox_cluster_reconcile_v2/tasks/root_ssh_trust.yml deleted file mode 100644 index d82d92b..0000000 --- a/ansible/archive/roles/proxmox_cluster_reconcile_v2/tasks/root_ssh_trust.yml +++ /dev/null @@ -1,31 +0,0 @@ ---- -- name: Ensure root SSH key exists on join node - ansible.builtin.command: ssh-keygen -t ed25519 -f /root/.ssh/id_ed25519 -N "" - args: - creates: /root/.ssh/id_ed25519 - become: true - delegate_to: "{{ join_node }}" - -- name: Read root public key from join node - ansible.builtin.slurp: - src: /root/.ssh/id_ed25519.pub - register: proxmox_cluster_reconcile_v2_join_root_pubkey - become: true - delegate_to: "{{ join_node }}" - -- name: Authorize join node root public key on anchor node - ansible.posix.authorized_key: - user: root - key: "{{ proxmox_cluster_reconcile_v2_join_root_pubkey.content | b64decode }}" - state: present - become: true - delegate_to: "{{ join_target_host }}" - -- name: Ensure join node trusts anchor host key - ansible.builtin.known_hosts: - path: /root/.ssh/known_hosts - name: "{{ proxmox_cluster_reconcile_v2_anchor_ip }}" - key: "{{ lookup('pipe', 'ssh-keyscan -H ' ~ proxmox_cluster_reconcile_v2_anchor_ip ~ ' 2>/dev/null') }}" - state: present - become: true - delegate_to: "{{ join_node }}" diff --git a/ansible/archive/roles/proxmox_cluster_reconcile_v2/tasks/validate.yml b/ansible/archive/roles/proxmox_cluster_reconcile_v2/tasks/validate.yml deleted file mode 100644 index 491f448..0000000 --- a/ansible/archive/roles/proxmox_cluster_reconcile_v2/tasks/validate.yml +++ /dev/null @@ -1,60 +0,0 @@ ---- -- name: Capture final cluster node table from anchor - ansible.builtin.command: pvecm nodes - register: proxmox_cluster_reconcile_v2_final_nodes - changed_when: false - become: true - delegate_to: "{{ join_target_host }}" - -- name: Capture final cluster status from anchor - ansible.builtin.command: pvecm status - register: proxmox_cluster_reconcile_v2_final_status - changed_when: false - become: true - delegate_to: "{{ join_target_host }}" - -- name: Assert join node is present and cluster is quorate - ansible.builtin.assert: - that: - - proxmox_cluster_reconcile_v2_final_nodes.stdout is search('\\b' ~ join_node ~ '\\b') - - proxmox_cluster_reconcile_v2_final_status.stdout is search('Quorate:\\s+Yes') - fail_msg: >- - Cluster validation failed. {{ join_node }} is missing or quorum is not healthy. - success_msg: "Cluster validation passed." - -- name: Capture service state on joined node - ansible.builtin.command: systemctl is-active pve-cluster corosync pve-ha-lrm - register: proxmox_cluster_reconcile_v2_join_service_state - changed_when: false - failed_when: false - become: true - delegate_to: "{{ join_node }}" - -- name: Create output directory for reconciliation artifact - ansible.builtin.file: - path: "{{ proxmox_cluster_reconcile_v2_output_dir }}" - state: directory - mode: '0755' - delegate_to: localhost - -- name: Write cluster reconciliation summary artifact - ansible.builtin.copy: - dest: "{{ proxmox_cluster_reconcile_v2_output_dir }}/cluster-reconcile-summary.txt" - mode: '0644' - content: | - Project: {{ cluster_project_name }} - Mode: {{ cluster_mode }} - Join node: {{ join_node }} - Join anchor host: {{ join_target_host }} - Join anchor IP: {{ proxmox_cluster_reconcile_v2_anchor_ip }} - Timestamp: {{ proxmox_cluster_reconcile_v2_timestamp }} - - === pvecm nodes (anchor) === - {{ proxmox_cluster_reconcile_v2_final_nodes.stdout | default('') }} - - === pvecm status (anchor) === - {{ proxmox_cluster_reconcile_v2_final_status.stdout | default('') }} - - === service state on join node === - {{ proxmox_cluster_reconcile_v2_join_service_state.stdout | default('unknown') }} - delegate_to: localhost diff --git a/ansible/archive/roles/proxmox_node_replacement/defaults/main.yml b/ansible/archive/roles/proxmox_node_replacement/defaults/main.yml deleted file mode 100644 index 71aa307..0000000 --- a/ansible/archive/roles/proxmox_node_replacement/defaults/main.yml +++ /dev/null @@ -1,56 +0,0 @@ ---- -# Safe defaults for physical node replacement workflow. -replacement_project_name: "" - -# Logical identity being preserved. -replacement_old_logical_host: "pve01" -replacement_old_ip: "10.0.0.201" - -# Physical donor host that will take over pve01 identity. -replacement_new_physical_host: "pve04" -replacement_new_physical_ip: "10.0.0.204" - -# Swarm node identities tied to current pve01. -replacement_swarm_manager_name: "swarm-manager-1" -replacement_swarm_worker_name: "swarm-worker-1" - -# Automation behavior switches. -replacement_capture_baseline: true -replacement_execute_cutover: false -replacement_poweroff_old_host: false -replacement_confirm_phrase: "" -replacement_skip_runtime_checks: false -replacement_old_host_may_be_offline: false - -# Phase 2 controls: rebuild and swarm rejoin on replacement host. -replacement_phase2_rebuild_and_rejoin: false -replacement_manage_existing_swarm_nodes: true -replacement_overwrite_existing_vmids: false -replacement_swarm_seed_manager: "" - -# Phase 3 controls: source-of-truth identity cutover with rollback backups. -replacement_phase3_identity_cutover: false -replacement_remove_new_physical_from_cluster: true -replacement_inventory_file_path: "{{ playbook_dir }}/../../inventory/hosts.ini" -replacement_group_vars_file_path: "{{ playbook_dir }}/../../group_vars/all.yml" - -# Phase 4 controls: post-cutover validation gates before any shutdown action. -replacement_phase4_validate_cutover: true -replacement_phase4_validation_urls: [] -replacement_phase4_url_timeout_seconds: 8 - -# VM and cloud-init defaults for replacement manager/worker identities. -replacement_template_vmid: 9004 -replacement_manager_vmid: 101 -replacement_worker_vmid: 102 -replacement_vm_memory_mb: 4096 -replacement_vm_cores: 2 -replacement_vm_user: "chester" -replacement_vm_ssh_key_path: "/home/chester/.ssh/id_ed25519.pub" -replacement_network_cidr: "24" -replacement_gateway_ip: "10.0.0.2" -replacement_dns_primary: "10.0.0.2" -replacement_search_domain: "local" - -# Controller paths. -replacement_output_root: "{{ playbook_dir }}/../../outputs/node-replacement" diff --git a/ansible/archive/roles/proxmox_node_replacement/tasks/main.yml b/ansible/archive/roles/proxmox_node_replacement/tasks/main.yml deleted file mode 100644 index 6f7bea3..0000000 --- a/ansible/archive/roles/proxmox_node_replacement/tasks/main.yml +++ /dev/null @@ -1,206 +0,0 @@ ---- -- name: Validate required replacement inputs - ansible.builtin.assert: - that: - - replacement_project_name | trim | length > 0 - - replacement_old_logical_host in groups['proxmox_cluster'] - - replacement_phase2_rebuild_and_rejoin | bool == false or replacement_new_physical_host in groups['proxmox_cluster'] - - replacement_swarm_manager_name in groups['swarm_managers'] - - replacement_swarm_worker_name in groups['swarm_workers'] - fail_msg: >- - Missing replacement inputs or inventory groups. Ensure project name is set and - proxmox/swarm host groups contain the expected hosts. - success_msg: "Replacement input validation passed." - -- name: Build replacement context values - ansible.builtin.set_fact: - proxmox_node_replacement_timestamp: "{{ lookup('pipe', 'date +%Y%m%dT%H%M%S') }}" - proxmox_node_replacement_output_dir: "{{ replacement_output_root }}/{{ replacement_project_name | regex_replace('[^a-zA-Z0-9_-]', '_') }}-{{ lookup('pipe', 'date +%Y%m%dT%H%M%S') }}" - -- name: Print replacement plan summary - ansible.builtin.debug: - msg: - - "Project: {{ replacement_project_name }}" - - "Logical identity: {{ replacement_old_logical_host }} ({{ replacement_old_ip }})" - - "Replacement hardware: {{ replacement_new_physical_host }} ({{ replacement_new_physical_ip }})" - - "Swarm identities: {{ replacement_swarm_manager_name }}, {{ replacement_swarm_worker_name }}" - - "Execute cutover: {{ replacement_execute_cutover }}" - - "Power off old host: {{ replacement_poweroff_old_host }}" - -- name: Preflight network reachability from control node - ansible.builtin.wait_for: - host: >- - {{ - replacement_new_physical_ip - if item == replacement_new_physical_host and (hostvars[item] is not defined) - else (hostvars[item].ansible_host | default(item)) - }} - port: 22 - timeout: 5 - connect_timeout: 2 - state: started - delegate_to: localhost - loop: - - "{{ replacement_old_logical_host }}" - - "{{ replacement_new_physical_host }}" - - "{{ replacement_swarm_manager_name }}" - - "{{ replacement_swarm_worker_name }}" - when: - - not replacement_skip_runtime_checks | bool - - item != replacement_old_logical_host or not replacement_old_host_may_be_offline | bool - - item != replacement_new_physical_host or replacement_capture_baseline | bool or replacement_phase2_rebuild_and_rejoin | bool - -- name: Capture swarm quorum state from manager host - ansible.builtin.command: docker node ls - register: proxmox_node_replacement_swarm_node_ls - changed_when: false - become: true - delegate_to: "{{ replacement_swarm_manager_name }}" - when: not replacement_skip_runtime_checks | bool - -- name: Assert swarm quorum output is available - ansible.builtin.assert: - that: - - proxmox_node_replacement_swarm_node_ls.rc == 0 - - proxmox_node_replacement_swarm_node_ls.stdout is search('Leader|Reachable') - fail_msg: "Swarm control plane is not healthy enough for a node replacement cutover." - success_msg: "Swarm quorum check passed." - when: not replacement_skip_runtime_checks | bool - -- name: Create output directory for baseline artifacts - ansible.builtin.file: - path: "{{ proxmox_node_replacement_output_dir }}" - state: directory - mode: '0755' - delegate_to: localhost - when: replacement_capture_baseline | bool or replacement_execute_cutover | bool - -- name: Capture old logical host VM list - ansible.builtin.command: /usr/sbin/qm list - register: proxmox_node_replacement_old_qm_list - changed_when: false - become: true - delegate_to: "{{ replacement_old_logical_host }}" - when: replacement_capture_baseline | bool - -- name: Capture replacement physical host VM list - ansible.builtin.command: /usr/sbin/qm list - register: proxmox_node_replacement_new_qm_list - changed_when: false - become: true - delegate_to: "{{ replacement_new_physical_host }}" - when: replacement_capture_baseline | bool - -- name: Capture old logical host cluster state - ansible.builtin.command: pvecm status - register: proxmox_node_replacement_old_cluster_status - changed_when: false - failed_when: false - become: true - delegate_to: "{{ replacement_old_logical_host }}" - when: replacement_capture_baseline | bool - -- name: Capture replacement physical host cluster state - ansible.builtin.command: pvecm status - register: proxmox_node_replacement_new_cluster_status - changed_when: false - failed_when: false - become: true - delegate_to: "{{ replacement_new_physical_host }}" - when: replacement_capture_baseline | bool - -- name: Write baseline artifact to controller - ansible.builtin.copy: - dest: "{{ proxmox_node_replacement_output_dir }}/baseline-summary.txt" - mode: '0644' - content: | - Project: {{ replacement_project_name }} - Timestamp: {{ proxmox_node_replacement_timestamp }} - Logical identity host: {{ replacement_old_logical_host }} - Logical identity IP: {{ replacement_old_ip }} - Replacement physical host: {{ replacement_new_physical_host }} - Replacement physical IP: {{ replacement_new_physical_ip }} - - === Swarm node ls (from {{ replacement_swarm_manager_name }}) === - {{ proxmox_node_replacement_swarm_node_ls.stdout | default('') }} - - === QM list ({{ replacement_old_logical_host }}) === - {{ proxmox_node_replacement_old_qm_list.stdout | default('not-captured') }} - - === QM list ({{ replacement_new_physical_host }}) === - {{ proxmox_node_replacement_new_qm_list.stdout | default('not-captured') }} - - === pvecm status ({{ replacement_old_logical_host }}) === - {{ proxmox_node_replacement_old_cluster_status.stdout | default('not-captured') }} - - === pvecm status ({{ replacement_new_physical_host }}) === - {{ proxmox_node_replacement_new_cluster_status.stdout | default('not-captured') }} - delegate_to: localhost - when: replacement_capture_baseline | bool - -- name: Explain cutover execution gate - ansible.builtin.debug: - msg: >- - Cutover actions are disabled. Set replacement_execute_cutover=true and - replacement_confirm_phrase=EXECUTE_NODE_REPLACEMENT to continue. - when: not replacement_execute_cutover | bool - -- name: Enforce explicit confirmation phrase for cutover - ansible.builtin.assert: - that: - - replacement_confirm_phrase == 'EXECUTE_NODE_REPLACEMENT' - fail_msg: >- - Cutover requested without explicit confirmation phrase. - Set replacement_confirm_phrase=EXECUTE_NODE_REPLACEMENT. - when: replacement_execute_cutover | bool - -- name: Build cutover TODO artifact - ansible.builtin.copy: - dest: "{{ proxmox_node_replacement_output_dir }}/cutover-todo.txt" - mode: '0644' - content: | - EXECUTION MODE ENABLED - - Phase 2 execution switch: - - replacement_phase2_rebuild_and_rejoin={{ replacement_phase2_rebuild_and_rejoin }} - - Phase 3 execution switch: - - replacement_phase3_identity_cutover={{ replacement_phase3_identity_cutover }} - - Phase 4 execution switch: - - replacement_phase4_validate_cutover={{ replacement_phase4_validate_cutover }} - - Manual steps still required around identity cutover: - 1. If phase 2 enabled, rebuild and rejoin replacement swarm nodes on {{ replacement_new_physical_host }}. - 2. If phase 3 enabled, update inventory/group_vars source-of-truth with rollback snapshots. - 3. If phase 4 enabled, validate swarm quorum and optional service endpoints. - 4. Move network identity {{ replacement_old_ip }} to replacement physical host. - 5. If stable and approved, power off old host. - delegate_to: localhost - when: replacement_execute_cutover | bool - -- name: Execute phase 2 rebuild and swarm rejoin on replacement host - ansible.builtin.include_tasks: phase2_rebuild_and_rejoin.yml - when: - - replacement_execute_cutover | bool - - replacement_phase2_rebuild_and_rejoin | bool - -- name: Execute phase 3 identity cutover updates with rollback snapshots - ansible.builtin.include_tasks: phase3_identity_cutover.yml - when: - - replacement_execute_cutover | bool - - replacement_phase3_identity_cutover | bool - -- name: Execute phase 4 post-cutover validation gates - ansible.builtin.include_tasks: phase4_validate_cutover.yml - when: - - replacement_execute_cutover | bool - - replacement_phase4_validate_cutover | bool - -- name: Power off old logical host after explicit approval - ansible.builtin.command: systemctl poweroff - become: true - delegate_to: "{{ replacement_old_logical_host }}" - when: - - replacement_execute_cutover | bool - - replacement_poweroff_old_host | bool diff --git a/ansible/archive/roles/proxmox_node_replacement/tasks/phase2_rebuild_and_rejoin.yml b/ansible/archive/roles/proxmox_node_replacement/tasks/phase2_rebuild_and_rejoin.yml deleted file mode 100644 index ae14d15..0000000 --- a/ansible/archive/roles/proxmox_node_replacement/tasks/phase2_rebuild_and_rejoin.yml +++ /dev/null @@ -1,323 +0,0 @@ ---- -- name: Select swarm seed manager for join operations - ansible.builtin.set_fact: - proxmox_node_replacement_swarm_seed_manager: >- - {{ - replacement_swarm_seed_manager - if replacement_swarm_seed_manager | trim | length > 0 - else (groups['swarm_managers'] | difference([replacement_swarm_manager_name]) | first) - }} - -- name: Assert selected swarm seed manager is valid - ansible.builtin.assert: - that: - - proxmox_node_replacement_swarm_seed_manager is defined - - proxmox_node_replacement_swarm_seed_manager | trim | length > 0 - - proxmox_node_replacement_swarm_seed_manager in groups['swarm_managers'] - fail_msg: >- - Unable to determine swarm seed manager. Set replacement_swarm_seed_manager - to a healthy manager (example: swarm-manager-2). - -- name: Read join target advertise address - ansible.builtin.shell: ip -4 route get 1.1.1.1 | awk '{print $7; exit}' - args: - executable: /bin/bash - register: proxmox_node_replacement_seed_advertise_ip - changed_when: false - become: true - delegate_to: "{{ proxmox_node_replacement_swarm_seed_manager }}" - -- name: Capture manager join token - ansible.builtin.command: docker swarm join-token -q manager - register: proxmox_node_replacement_manager_join_token - changed_when: false - become: true - delegate_to: "{{ proxmox_node_replacement_swarm_seed_manager }}" - -- name: Capture worker join token - ansible.builtin.command: docker swarm join-token -q worker - register: proxmox_node_replacement_worker_join_token - changed_when: false - become: true - delegate_to: "{{ proxmox_node_replacement_swarm_seed_manager }}" - -- name: Drain and demote existing old manager/worker identities - when: replacement_manage_existing_swarm_nodes | bool - block: - - name: Drain replacement manager identity if present - ansible.builtin.command: docker node update --availability drain {{ replacement_swarm_manager_name }} - register: proxmox_node_replacement_drain_manager - changed_when: proxmox_node_replacement_drain_manager.rc == 0 - failed_when: false - become: true - delegate_to: "{{ proxmox_node_replacement_swarm_seed_manager }}" - - - name: Drain replacement worker identity if present - ansible.builtin.command: docker node update --availability drain {{ replacement_swarm_worker_name }} - register: proxmox_node_replacement_drain_worker - changed_when: proxmox_node_replacement_drain_worker.rc == 0 - failed_when: false - become: true - delegate_to: "{{ proxmox_node_replacement_swarm_seed_manager }}" - - - name: Demote manager identity before removal - ansible.builtin.command: docker node demote {{ replacement_swarm_manager_name }} - register: proxmox_node_replacement_demote_manager - changed_when: proxmox_node_replacement_demote_manager.rc == 0 - failed_when: false - become: true - delegate_to: "{{ proxmox_node_replacement_swarm_seed_manager }}" - - - name: Remove old manager identity from swarm if present - ansible.builtin.command: docker node rm --force {{ replacement_swarm_manager_name }} - register: proxmox_node_replacement_rm_manager - changed_when: proxmox_node_replacement_rm_manager.rc == 0 - failed_when: false - become: true - delegate_to: "{{ proxmox_node_replacement_swarm_seed_manager }}" - - - name: Remove old worker identity from swarm if present - ansible.builtin.command: docker node rm --force {{ replacement_swarm_worker_name }} - register: proxmox_node_replacement_rm_worker - changed_when: proxmox_node_replacement_rm_worker.rc == 0 - failed_when: false - become: true - delegate_to: "{{ proxmox_node_replacement_swarm_seed_manager }}" - -- name: Ensure replacement template exists on new physical host - ansible.builtin.command: /usr/sbin/qm status {{ replacement_template_vmid }} - register: proxmox_node_replacement_template_check - changed_when: false - failed_when: proxmox_node_replacement_template_check.rc != 0 - become: true - delegate_to: "{{ replacement_new_physical_host }}" - -- name: Check whether replacement manager VM already exists - ansible.builtin.command: /usr/sbin/qm status {{ replacement_manager_vmid }} - register: proxmox_node_replacement_manager_vmid_check - changed_when: false - failed_when: false - become: true - delegate_to: "{{ replacement_new_physical_host }}" - -- name: Check whether replacement worker VM already exists - ansible.builtin.command: /usr/sbin/qm status {{ replacement_worker_vmid }} - register: proxmox_node_replacement_worker_vmid_check - changed_when: false - failed_when: false - become: true - delegate_to: "{{ replacement_new_physical_host }}" - -- name: Stop existing replacement manager VM when overwrite enabled - ansible.builtin.command: /usr/sbin/qm stop {{ replacement_manager_vmid }} - register: proxmox_node_replacement_stop_manager - changed_when: proxmox_node_replacement_stop_manager.rc == 0 - failed_when: false - become: true - delegate_to: "{{ replacement_new_physical_host }}" - when: - - replacement_overwrite_existing_vmids | bool - - proxmox_node_replacement_manager_vmid_check.rc == 0 - -- name: Stop existing replacement worker VM when overwrite enabled - ansible.builtin.command: /usr/sbin/qm stop {{ replacement_worker_vmid }} - register: proxmox_node_replacement_stop_worker - changed_when: proxmox_node_replacement_stop_worker.rc == 0 - failed_when: false - become: true - delegate_to: "{{ replacement_new_physical_host }}" - when: - - replacement_overwrite_existing_vmids | bool - - proxmox_node_replacement_worker_vmid_check.rc == 0 - -- name: Destroy existing replacement manager VM when overwrite enabled - ansible.builtin.command: /usr/sbin/qm destroy {{ replacement_manager_vmid }} --purge 1 - register: proxmox_node_replacement_destroy_manager - changed_when: proxmox_node_replacement_destroy_manager.rc == 0 - failed_when: false - become: true - delegate_to: "{{ replacement_new_physical_host }}" - when: - - replacement_overwrite_existing_vmids | bool - - proxmox_node_replacement_manager_vmid_check.rc == 0 - -- name: Destroy existing replacement worker VM when overwrite enabled - ansible.builtin.command: /usr/sbin/qm destroy {{ replacement_worker_vmid }} --purge 1 - register: proxmox_node_replacement_destroy_worker - changed_when: proxmox_node_replacement_destroy_worker.rc == 0 - failed_when: false - become: true - delegate_to: "{{ replacement_new_physical_host }}" - when: - - replacement_overwrite_existing_vmids | bool - - proxmox_node_replacement_worker_vmid_check.rc == 0 - -- name: Guard against existing VMID conflicts when overwrite disabled - ansible.builtin.assert: - that: - - proxmox_node_replacement_manager_vmid_check.rc != 0 - - proxmox_node_replacement_worker_vmid_check.rc != 0 - fail_msg: >- - VMID conflict on {{ replacement_new_physical_host }}. Enable - replacement_overwrite_existing_vmids=true only if you intend to replace - existing VMIDs {{ replacement_manager_vmid }} and {{ replacement_worker_vmid }}. - when: not replacement_overwrite_existing_vmids | bool - -- name: Clone template to replacement manager VM - ansible.builtin.command: >- - /usr/sbin/qm clone {{ replacement_template_vmid }} {{ replacement_manager_vmid }} - --name {{ replacement_swarm_manager_name }} --full - register: proxmox_node_replacement_clone_manager - changed_when: proxmox_node_replacement_clone_manager.rc == 0 - become: true - delegate_to: "{{ replacement_new_physical_host }}" - -- name: Configure replacement manager VM resources and cloud-init - ansible.builtin.command: >- - /usr/sbin/qm set {{ replacement_manager_vmid }} - --memory {{ replacement_vm_memory_mb }} - --cores {{ replacement_vm_cores }} - --onboot 1 - --agent enabled=1 - --ciuser {{ replacement_vm_user }} - --sshkeys {{ replacement_vm_ssh_key_path }} - --ipconfig0 ip={{ hostvars[replacement_swarm_manager_name].ansible_host }}/{{ replacement_network_cidr }},gw={{ replacement_gateway_ip }} - --nameserver {{ replacement_dns_primary }} - --searchdomain {{ replacement_search_domain }} - register: proxmox_node_replacement_set_manager - changed_when: proxmox_node_replacement_set_manager.rc == 0 - become: true - delegate_to: "{{ replacement_new_physical_host }}" - -- name: Clone template to replacement worker VM - ansible.builtin.command: >- - /usr/sbin/qm clone {{ replacement_template_vmid }} {{ replacement_worker_vmid }} - --name {{ replacement_swarm_worker_name }} --full - register: proxmox_node_replacement_clone_worker - changed_when: proxmox_node_replacement_clone_worker.rc == 0 - become: true - delegate_to: "{{ replacement_new_physical_host }}" - -- name: Configure replacement worker VM resources and cloud-init - ansible.builtin.command: >- - /usr/sbin/qm set {{ replacement_worker_vmid }} - --memory {{ replacement_vm_memory_mb }} - --cores {{ replacement_vm_cores }} - --onboot 1 - --agent enabled=1 - --ciuser {{ replacement_vm_user }} - --sshkeys {{ replacement_vm_ssh_key_path }} - --ipconfig0 ip={{ hostvars[replacement_swarm_worker_name].ansible_host }}/{{ replacement_network_cidr }},gw={{ replacement_gateway_ip }} - --nameserver {{ replacement_dns_primary }} - --searchdomain {{ replacement_search_domain }} - register: proxmox_node_replacement_set_worker - changed_when: proxmox_node_replacement_set_worker.rc == 0 - become: true - delegate_to: "{{ replacement_new_physical_host }}" - -- name: Start replacement manager VM - ansible.builtin.command: /usr/sbin/qm start {{ replacement_manager_vmid }} - register: proxmox_node_replacement_start_manager - changed_when: proxmox_node_replacement_start_manager.rc == 0 - become: true - delegate_to: "{{ replacement_new_physical_host }}" - -- name: Start replacement worker VM - ansible.builtin.command: /usr/sbin/qm start {{ replacement_worker_vmid }} - register: proxmox_node_replacement_start_worker - changed_when: proxmox_node_replacement_start_worker.rc == 0 - become: true - delegate_to: "{{ replacement_new_physical_host }}" - -- name: Wait for replacement manager SSH to become reachable - ansible.builtin.wait_for: - host: "{{ hostvars[replacement_swarm_manager_name].ansible_host }}" - port: 22 - timeout: 300 - connect_timeout: 5 - state: started - delegate_to: localhost - -- name: Wait for replacement worker SSH to become reachable - ansible.builtin.wait_for: - host: "{{ hostvars[replacement_swarm_worker_name].ansible_host }}" - port: 22 - timeout: 300 - connect_timeout: 5 - state: started - delegate_to: localhost - -- name: Ensure manager node leaves any existing swarm state - ansible.builtin.command: docker swarm leave --force - register: proxmox_node_replacement_manager_leave - failed_when: false - changed_when: proxmox_node_replacement_manager_leave.rc == 0 - become: true - delegate_to: "{{ replacement_swarm_manager_name }}" - -- name: Ensure worker node leaves any existing swarm state - ansible.builtin.command: docker swarm leave --force - register: proxmox_node_replacement_worker_leave - failed_when: false - changed_when: proxmox_node_replacement_worker_leave.rc == 0 - become: true - delegate_to: "{{ replacement_swarm_worker_name }}" - -- name: Join replacement manager to swarm as manager - ansible.builtin.command: >- - docker swarm join - --token {{ proxmox_node_replacement_manager_join_token.stdout }} - {{ proxmox_node_replacement_seed_advertise_ip.stdout | trim }}:2377 - register: proxmox_node_replacement_join_manager - failed_when: - - proxmox_node_replacement_join_manager.rc != 0 - - "'This node is already part of a swarm' not in proxmox_node_replacement_join_manager.stderr" - changed_when: proxmox_node_replacement_join_manager.rc == 0 - become: true - delegate_to: "{{ replacement_swarm_manager_name }}" - -- name: Join replacement worker to swarm as worker - ansible.builtin.command: >- - docker swarm join - --token {{ proxmox_node_replacement_worker_join_token.stdout }} - {{ proxmox_node_replacement_seed_advertise_ip.stdout | trim }}:2377 - register: proxmox_node_replacement_join_worker - failed_when: - - proxmox_node_replacement_join_worker.rc != 0 - - "'This node is already part of a swarm' not in proxmox_node_replacement_join_worker.stderr" - changed_when: proxmox_node_replacement_join_worker.rc == 0 - become: true - delegate_to: "{{ replacement_swarm_worker_name }}" - -- name: Activate replacement manager and worker scheduling - ansible.builtin.command: docker node update --availability active {{ item }} - register: proxmox_node_replacement_activate_nodes - changed_when: proxmox_node_replacement_activate_nodes.rc == 0 - become: true - delegate_to: "{{ proxmox_node_replacement_swarm_seed_manager }}" - loop: - - "{{ replacement_swarm_manager_name }}" - - "{{ replacement_swarm_worker_name }}" - -- name: Capture post-phase2 swarm node table - ansible.builtin.command: docker node ls - register: proxmox_node_replacement_post_phase2_node_ls - changed_when: false - become: true - delegate_to: "{{ proxmox_node_replacement_swarm_seed_manager }}" - -- name: Write phase2 summary artifact - ansible.builtin.copy: - dest: "{{ proxmox_node_replacement_output_dir }}/phase2-rebuild-summary.txt" - mode: '0644' - content: | - Project: {{ replacement_project_name }} - Replacement host: {{ replacement_new_physical_host }} - Template VMID: {{ replacement_template_vmid }} - Manager VMID: {{ replacement_manager_vmid }} - Worker VMID: {{ replacement_worker_vmid }} - Seed manager: {{ proxmox_node_replacement_swarm_seed_manager }} - - === docker node ls === - {{ proxmox_node_replacement_post_phase2_node_ls.stdout | default('') }} - delegate_to: localhost diff --git a/ansible/archive/roles/proxmox_node_replacement/tasks/phase3_identity_cutover.yml b/ansible/archive/roles/proxmox_node_replacement/tasks/phase3_identity_cutover.yml deleted file mode 100644 index d8cd53a..0000000 --- a/ansible/archive/roles/proxmox_node_replacement/tasks/phase3_identity_cutover.yml +++ /dev/null @@ -1,114 +0,0 @@ ---- -- name: Build phase 3 file paths and backup paths - ansible.builtin.set_fact: - proxmox_node_replacement_inventory_file: "{{ replacement_inventory_file_path }}" - proxmox_node_replacement_group_vars_file: "{{ replacement_group_vars_file_path }}" - proxmox_node_replacement_rollback_dir: "{{ proxmox_node_replacement_output_dir }}/rollback" - proxmox_node_replacement_inventory_backup: "{{ proxmox_node_replacement_output_dir }}/rollback/hosts.ini.pre-cutover" - proxmox_node_replacement_group_vars_backup: "{{ proxmox_node_replacement_output_dir }}/rollback/all.yml.pre-cutover" - -- name: Validate source-of-truth files exist - ansible.builtin.stat: - path: "{{ item }}" - register: proxmox_node_replacement_phase3_file_stats - delegate_to: localhost - loop: - - "{{ proxmox_node_replacement_inventory_file }}" - - "{{ proxmox_node_replacement_group_vars_file }}" - -- name: Assert source-of-truth files are present - ansible.builtin.assert: - that: - - (proxmox_node_replacement_phase3_file_stats.results | map(attribute='stat.exists') | list) | min - fail_msg: "Phase 3 cutover files are missing. Check inventory/group_vars paths." - -- name: Create rollback directory for source-of-truth backups - ansible.builtin.file: - path: "{{ proxmox_node_replacement_rollback_dir }}" - state: directory - mode: '0755' - delegate_to: localhost - when: not ansible_check_mode - -- name: Backup inventory before identity cutover - ansible.builtin.copy: - src: "{{ proxmox_node_replacement_inventory_file }}" - dest: "{{ proxmox_node_replacement_inventory_backup }}" - mode: '0644' - remote_src: true - delegate_to: localhost - when: not ansible_check_mode - -- name: Backup group vars before identity cutover - ansible.builtin.copy: - src: "{{ proxmox_node_replacement_group_vars_file }}" - dest: "{{ proxmox_node_replacement_group_vars_backup }}" - mode: '0644' - remote_src: true - delegate_to: localhost - when: not ansible_check_mode - -- name: Apply atomic source-of-truth cutover updates - block: - - name: Remove replacement physical host alias from proxmox cluster inventory - ansible.builtin.lineinfile: - path: "{{ proxmox_node_replacement_inventory_file }}" - regexp: "^{{ replacement_new_physical_host }}\\s+ansible_host=.*$" - state: absent - delegate_to: localhost - when: replacement_remove_new_physical_from_cluster | bool - - - name: Record physical backing host for logical pve01 in group vars - ansible.builtin.lineinfile: - path: "{{ proxmox_node_replacement_group_vars_file }}" - regexp: '^ physical_backing_host:' - line: " physical_backing_host: \"{{ replacement_new_physical_host }}\"" - insertafter: '^ {{ replacement_old_logical_host }}:$' - delegate_to: localhost - - - name: Mark pve04 role as replaced in group vars metadata - ansible.builtin.lineinfile: - path: "{{ proxmox_node_replacement_group_vars_file }}" - regexp: '^ replacement_status:' - line: " replacement_status: \"retired-identity-now-backing-{{ replacement_old_logical_host }}\"" - insertafter: '^ {{ replacement_new_physical_host }}:$' - delegate_to: localhost - - rescue: - - name: Restore inventory from rollback backup after cutover failure - ansible.builtin.copy: - src: "{{ proxmox_node_replacement_inventory_backup }}" - dest: "{{ proxmox_node_replacement_inventory_file }}" - mode: '0644' - remote_src: true - delegate_to: localhost - - - name: Restore group vars from rollback backup after cutover failure - ansible.builtin.copy: - src: "{{ proxmox_node_replacement_group_vars_backup }}" - dest: "{{ proxmox_node_replacement_group_vars_file }}" - mode: '0644' - remote_src: true - delegate_to: localhost - - - name: Fail phase 3 after rollback restore - ansible.builtin.fail: - msg: "Phase 3 cutover failed. Source-of-truth files were restored from rollback snapshots." - -- name: Write phase 3 cutover summary artifact - ansible.builtin.copy: - dest: "{{ proxmox_node_replacement_output_dir }}/phase3-cutover-summary.txt" - mode: '0644' - content: | - Project: {{ replacement_project_name }} - Phase: identity cutover source-of-truth update - Inventory file: {{ proxmox_node_replacement_inventory_file }} - Group vars file: {{ proxmox_node_replacement_group_vars_file }} - Rollback inventory backup: {{ proxmox_node_replacement_inventory_backup }} - Rollback group vars backup: {{ proxmox_node_replacement_group_vars_backup }} - - Applied updates: - - Removed {{ replacement_new_physical_host }} from proxmox_cluster inventory: {{ replacement_remove_new_physical_from_cluster }} - - Set physical_backing_host for {{ replacement_old_logical_host }} to {{ replacement_new_physical_host }} - - Set replacement_status in {{ replacement_new_physical_host }} metadata - delegate_to: localhost diff --git a/ansible/archive/roles/proxmox_node_replacement/tasks/phase4_validate_cutover.yml b/ansible/archive/roles/proxmox_node_replacement/tasks/phase4_validate_cutover.yml deleted file mode 100644 index 076591b..0000000 --- a/ansible/archive/roles/proxmox_node_replacement/tasks/phase4_validate_cutover.yml +++ /dev/null @@ -1,101 +0,0 @@ ---- -- name: Select swarm validation manager - ansible.builtin.set_fact: - proxmox_node_replacement_validation_manager: >- - {{ - replacement_swarm_seed_manager - if replacement_swarm_seed_manager | trim | length > 0 - else (groups['swarm_managers'] | difference([replacement_swarm_manager_name]) | first) - }} - -- name: Assert validation manager exists in swarm_managers - ansible.builtin.assert: - that: - - proxmox_node_replacement_validation_manager is defined - - proxmox_node_replacement_validation_manager | trim | length > 0 - - proxmox_node_replacement_validation_manager in groups['swarm_managers'] - fail_msg: >- - Unable to determine validation manager. Set replacement_swarm_seed_manager - to a healthy swarm manager. - -- name: Verify logical pve01 SSH reachability after cutover - ansible.builtin.wait_for: - host: "{{ hostvars[replacement_old_logical_host].ansible_host | default(replacement_old_logical_host) }}" - port: 22 - timeout: 30 - connect_timeout: 3 - state: started - delegate_to: localhost - -- name: Verify replacement swarm manager SSH reachability - ansible.builtin.wait_for: - host: "{{ hostvars[replacement_swarm_manager_name].ansible_host }}" - port: 22 - timeout: 30 - connect_timeout: 3 - state: started - delegate_to: localhost - -- name: Verify replacement swarm worker SSH reachability - ansible.builtin.wait_for: - host: "{{ hostvars[replacement_swarm_worker_name].ansible_host }}" - port: 22 - timeout: 30 - connect_timeout: 3 - state: started - delegate_to: localhost - -- name: Capture swarm node table for validation - ansible.builtin.command: docker node ls - register: proxmox_node_replacement_phase4_swarm_node_ls - changed_when: false - become: true - delegate_to: "{{ proxmox_node_replacement_validation_manager }}" - -- name: Assert replacement identities are visible in swarm - ansible.builtin.assert: - that: - - proxmox_node_replacement_phase4_swarm_node_ls.rc == 0 - - proxmox_node_replacement_phase4_swarm_node_ls.stdout is search(replacement_swarm_manager_name) - - proxmox_node_replacement_phase4_swarm_node_ls.stdout is search(replacement_swarm_worker_name) - - proxmox_node_replacement_phase4_swarm_node_ls.stdout is search('Leader|Reachable') - fail_msg: >- - Swarm validation failed. Expected nodes were not visible or manager quorum - markers were missing. - success_msg: "Swarm validation checks passed." - -- name: Validate optional service endpoints after cutover - ansible.builtin.uri: - url: "{{ item }}" - method: GET - status_code: [200, 301, 302, 307, 308] - timeout: "{{ replacement_phase4_url_timeout_seconds }}" - validate_certs: false - register: proxmox_node_replacement_phase4_endpoint_checks - delegate_to: localhost - loop: "{{ replacement_phase4_validation_urls }}" - when: replacement_phase4_validation_urls | length > 0 - -- name: Write phase 4 validation artifact - ansible.builtin.copy: - dest: "{{ proxmox_node_replacement_output_dir }}/phase4-validation-summary.txt" - mode: '0644' - content: | - Project: {{ replacement_project_name }} - Validation manager: {{ proxmox_node_replacement_validation_manager }} - Logical pve01 host: {{ replacement_old_logical_host }} - Swarm manager identity: {{ replacement_swarm_manager_name }} - Swarm worker identity: {{ replacement_swarm_worker_name }} - - === docker node ls === - {{ proxmox_node_replacement_phase4_swarm_node_ls.stdout | default('') }} - - === endpoint checks === - {% if replacement_phase4_validation_urls | length == 0 %} - No endpoint checks configured. - {% else %} - {% for check in proxmox_node_replacement_phase4_endpoint_checks.results %} - - {{ check.item }} => status {{ check.status | default('n/a') }} - {% endfor %} - {% endif %} - delegate_to: localhost diff --git a/ansible/archive/roles/proxmox_post_install/defaults/main.yml b/ansible/archive/roles/proxmox_post_install/defaults/main.yml deleted file mode 100644 index 03aec6b..0000000 --- a/ansible/archive/roles/proxmox_post_install/defaults/main.yml +++ /dev/null @@ -1,32 +0,0 @@ ---- -# Default variables for proxmox_post_install role -# These defaults assume you "approve all risks" similar to the original script - -# General -proxmox_post_install_enabled: true - -# Behavior toggles roughly mirroring the whiptail prompts -proxmox_fix_sources: true -proxmox_disable_pve_enterprise: true -proxmox_enable_pve_no_subscription: true -proxmox_fix_ceph_repos: true -proxmox_add_pvetest_repo_disabled: true - -# Subscription nag removal -proxmox_disable_subscription_nag: true - -# HA behavior -proxmox_enable_ha: false # default: do not auto-enable HA on fresh node -proxmox_disable_ha_on_single_node: true -proxmox_disable_corosync_on_single_node: true - -# Update & reboot -proxmox_run_dist_upgrade: true -proxmox_reboot_after: true - -# PVE version restrictions (mirrors the script: 8.0-8.9.x and 9.0-9.1.x) -proxmox_supported_major_versions: [8, 9] -proxmox_8_min_minor: 0 -proxmox_8_max_minor: 9 -proxmox_9_min_minor: 0 -proxmox_9_max_minor: 1 diff --git a/ansible/archive/roles/proxmox_post_install/tasks/main.yml b/ansible/archive/roles/proxmox_post_install/tasks/main.yml deleted file mode 100644 index 1c7beaf..0000000 --- a/ansible/archive/roles/proxmox_post_install/tasks/main.yml +++ /dev/null @@ -1,55 +0,0 @@ ---- -# Main entrypoint for proxmox_post_install role - -- name: "Check that role is enabled" - ansible.builtin.meta: end_play - when: not proxmox_post_install_enabled - -- name: "Detect Proxmox VE version (pveversion)" - ansible.builtin.command: "pveversion" - register: proxmox_pveversion_cmd - changed_when: false - -- name: "Parse Proxmox VE version" - ansible.builtin.set_fact: - proxmox_pve_version_full: "{{ proxmox_pveversion_cmd.stdout | trim }}" - # pveversion output: "pve-manager/9.1.1/42db4a6cf33dac83" - version is at index 1 - proxmox_pve_version: "{{ (proxmox_pveversion_cmd.stdout | trim).split('/')[1] }}" - proxmox_pve_major: "{{ (proxmox_pveversion_cmd.stdout | trim).split('/')[1].split('.')[0] | int }}" - proxmox_pve_minor: "{{ (proxmox_pveversion_cmd.stdout | trim).split('/')[1].split('.')[1] | int }}" - -- name: "Fail if Proxmox VE major version is unsupported" - ansible.builtin.fail: - msg: >- - Unsupported Proxmox VE major version: {{ proxmox_pve_major }}. - Supported: 8.0–8.9.x and 9.0–9.1.x (mirrors upstream post-pve-install.sh). - when: proxmox_pve_major not in proxmox_supported_major_versions - -- name: "Fail if Proxmox VE 8 minor version unsupported" - ansible.builtin.fail: - msg: >- - Unsupported Proxmox 8 version {{ proxmox_pve_version }}. - Supported minor range: {{ proxmox_8_min_minor }}–{{ proxmox_8_max_minor }}. - when: - - proxmox_pve_major == 8 - - proxmox_pve_minor < proxmox_8_min_minor or proxmox_pve_minor > proxmox_8_max_minor - -- name: "Fail if Proxmox VE 9 minor version unsupported" - ansible.builtin.fail: - msg: >- - Unsupported Proxmox 9 version {{ proxmox_pve_version }}. - Supported minor range: {{ proxmox_9_min_minor }}–{{ proxmox_9_max_minor }}. - when: - - proxmox_pve_major == 9 - - proxmox_pve_minor < proxmox_9_min_minor or proxmox_pve_minor > proxmox_9_max_minor - -- name: "Include version-specific tasks for PVE 8" - ansible.builtin.import_tasks: pve8.yml - when: proxmox_pve_major == 8 - -- name: "Include version-specific tasks for PVE 9" - ansible.builtin.import_tasks: pve9.yml - when: proxmox_pve_major == 9 - -- name: "Common post-routines (nag, HA, update, reboot)" - ansible.builtin.import_tasks: post_common.yml diff --git a/ansible/archive/roles/proxmox_post_install/tasks/post_common.yml b/ansible/archive/roles/proxmox_post_install/tasks/post_common.yml deleted file mode 100644 index 6dde402..0000000 --- a/ansible/archive/roles/proxmox_post_install/tasks/post_common.yml +++ /dev/null @@ -1,76 +0,0 @@ ---- -# Common post-routines for PVE 8 and 9: subscription nag, HA, updates, reboot - -- name: "Deploy subscription nag removal script" - ansible.builtin.template: - src: pve-remove-nag.sh.j2 - dest: /usr/local/bin/pve-remove-nag.sh - owner: root - group: root - mode: '0755' - when: proxmox_disable_subscription_nag - -- name: "Configure dpkg Post-Invoke hook to run nag removal script" - ansible.builtin.copy: - dest: /etc/apt/apt.conf.d/no-nag-script - owner: root - group: root - mode: '0644' - content: | - DPkg::Post-Invoke { "/usr/local/bin/pve-remove-nag.sh"; }; - when: proxmox_disable_subscription_nag - -- name: "Remove subscription nag dpkg hook if disabled via vars" - ansible.builtin.file: - path: /etc/apt/apt.conf.d/no-nag-script - state: absent - when: not proxmox_disable_subscription_nag - -- name: "Ensure proxmox-widget-toolkit is reinstalled (like script)" - ansible.builtin.apt: - name: proxmox-widget-toolkit - state: latest - update_cache: false - force: true - register: proxmox_widget_reinstall - failed_when: false - -- name: "Optionally enable HA services on cluster nodes" - ansible.builtin.service: - name: "{{ item }}" - state: started - enabled: true - loop: - - pve-ha-lrm - - pve-ha-crm - - corosync - when: proxmox_enable_ha - -- name: "Optionally disable HA services on single-node setups" - ansible.builtin.service: - name: "{{ item }}" - state: stopped - enabled: false - loop: - - pve-ha-lrm - - pve-ha-crm - when: proxmox_disable_ha_on_single_node - -- name: "Optionally disable Corosync on single-node setups" - ansible.builtin.service: - name: corosync - state: stopped - enabled: false - when: proxmox_disable_corosync_on_single_node - -- name: "Run apt dist-upgrade (like original script)" - ansible.builtin.apt: - update_cache: true - upgrade: dist - when: proxmox_run_dist_upgrade - -- name: "Reboot Proxmox VE when requested" - ansible.builtin.reboot: - msg: "Rebooting after post-install routines (Ansible)." - reboot_timeout: 1800 - when: proxmox_reboot_after diff --git a/ansible/archive/roles/proxmox_post_install/tasks/pve8.yml b/ansible/archive/roles/proxmox_post_install/tasks/pve8.yml deleted file mode 100644 index 2ef890a..0000000 --- a/ansible/archive/roles/proxmox_post_install/tasks/pve8.yml +++ /dev/null @@ -1,57 +0,0 @@ ---- -# Proxmox VE 8.x (Debian 12 / bookworm) sources and repo configuration - -- name: "Configure Debian bookworm APT sources (if enabled)" - ansible.builtin.copy: - dest: /etc/apt/sources.list - owner: root - group: root - mode: '0644' - content: | - deb http://deb.debian.org/debian bookworm main contrib - deb http://deb.debian.org/debian bookworm-updates main contrib - deb http://security.debian.org/debian-security bookworm-security main contrib - when: proxmox_fix_sources - -- name: "Disable pve-enterprise repository (list file) on 8.x" - ansible.builtin.copy: - dest: /etc/apt/sources.list.d/pve-enterprise.list - owner: root - group: root - mode: '0644' - content: | - # deb https://enterprise.proxmox.com/debian/pve bookworm pve-enterprise - when: proxmox_disable_pve_enterprise - -- name: "Enable pve-no-subscription repository on 8.x" - ansible.builtin.copy: - dest: /etc/apt/sources.list.d/pve-install-repo.list - owner: root - group: root - mode: '0644' - content: | - deb http://download.proxmox.com/debian/pve bookworm pve-no-subscription - when: proxmox_enable_pve_no_subscription - -- name: "Configure Ceph repositories for Proxmox VE 8.x" - ansible.builtin.copy: - dest: /etc/apt/sources.list.d/ceph.list - owner: root - group: root - mode: '0644' - content: | - # deb https://enterprise.proxmox.com/debian/ceph-quincy bookworm enterprise - # deb http://download.proxmox.com/debian/ceph-quincy bookworm no-subscription - # deb https://enterprise.proxmox.com/debian/ceph-reef bookworm enterprise - # deb http://download.proxmox.com/debian/ceph-reef bookworm no-subscription - when: proxmox_fix_ceph_repos - -- name: "Add disabled pvetest repository for 8.x" - ansible.builtin.copy: - dest: /etc/apt/sources.list.d/pvetest-for-beta.list - owner: root - group: root - mode: '0644' - content: | - # deb http://download.proxmox.com/debian/pve bookworm pvetest - when: proxmox_add_pvetest_repo_disabled diff --git a/ansible/archive/roles/proxmox_post_install/tasks/pve9.yml b/ansible/archive/roles/proxmox_post_install/tasks/pve9.yml deleted file mode 100644 index 0849b01..0000000 --- a/ansible/archive/roles/proxmox_post_install/tasks/pve9.yml +++ /dev/null @@ -1,123 +0,0 @@ ---- -# Proxmox VE 9.x (Debian 13 / trixie) sources and repo configuration using deb822 - -- name: "Find legacy .list APT source files on 9.x" - ansible.builtin.find: - paths: - - /etc/apt/sources.list.d - patterns: "*.list" - file_type: file - register: proxmox_legacy_list_files - -- name: "Backup and disable entries in /etc/apt/sources.list (if any)" - ansible.builtin.copy: - src: /etc/apt/sources.list - dest: /etc/apt/sources.list.bak - owner: root - group: root - mode: '0644' - remote_src: true - when: - - proxmox_fix_sources - - ansible_facts['os_family'] is defined - ignore_errors: true - -- name: "Comment legacy deb lines in /etc/apt/sources.list (bookworm/proxmox)" - ansible.builtin.replace: - path: /etc/apt/sources.list - regexp: '^(\s*deb\s+.*(proxmox|bookworm).*)$' - replace: '# Disabled by Proxmox Helper Ansible role \1' - when: proxmox_fix_sources - ignore_errors: true - -- name: "Remove legacy .list files on 9.x when migrating to deb822" - ansible.builtin.file: - path: "{{ item.path }}" - state: absent - loop: "{{ proxmox_legacy_list_files.files | default([]) }}" - when: proxmox_fix_sources - -- name: "Configure Debian Trixie deb822 sources for 9.x" - ansible.builtin.copy: - dest: /etc/apt/sources.list.d/debian.sources - owner: root - group: root - mode: '0644' - content: | - Types: deb - URIs: http://deb.debian.org/debian - Suites: trixie - Components: main contrib - Signed-By: /usr/share/keyrings/debian-archive-keyring.gpg - - Types: deb - URIs: http://security.debian.org/debian-security - Suites: trixie-security - Components: main contrib - Signed-By: /usr/share/keyrings/debian-archive-keyring.gpg - - Types: deb - URIs: http://deb.debian.org/debian - Suites: trixie-updates - Components: main contrib - Signed-By: /usr/share/keyrings/debian-archive-keyring.gpg - when: proxmox_fix_sources - -- name: "Ensure pve-enterprise deb822 source is disabled on 9.x" - ansible.builtin.blockinfile: - path: /etc/apt/sources.list.d/pve-enterprise.sources - create: true - owner: root - group: root - mode: '0644' - block: | - Types: deb - URIs: https://enterprise.proxmox.com/debian/pve - Suites: trixie - Components: pve-enterprise - Signed-By: /usr/share/keyrings/proxmox-archive-keyring.gpg - Enabled: false - when: proxmox_disable_pve_enterprise - -- name: "Configure pve-no-subscription deb822 source on 9.x" - ansible.builtin.copy: - dest: /etc/apt/sources.list.d/proxmox.sources - owner: root - group: root - mode: '0644' - content: | - Types: deb - URIs: http://download.proxmox.com/debian/pve - Suites: trixie - Components: pve-no-subscription - Signed-By: /usr/share/keyrings/proxmox-archive-keyring.gpg - when: proxmox_enable_pve_no_subscription - -- name: "Configure Ceph deb822 source on 9.x (no-subscription)" - ansible.builtin.copy: - dest: /etc/apt/sources.list.d/ceph.sources - owner: root - group: root - mode: '0644' - content: | - Types: deb - URIs: http://download.proxmox.com/debian/ceph-squid - Suites: trixie - Components: no-subscription - Signed-By: /usr/share/keyrings/proxmox-archive-keyring.gpg - when: proxmox_fix_ceph_repos - -- name: "Add disabled pve-test deb822 source on 9.x" - ansible.builtin.copy: - dest: /etc/apt/sources.list.d/pve-test.sources - owner: root - group: root - mode: '0644' - content: | - Types: deb - URIs: http://download.proxmox.com/debian/pve - Suites: trixie - Components: pve-test - Signed-By: /usr/share/keyrings/proxmox-archive-keyring.gpg - Enabled: false - when: proxmox_add_pvetest_repo_disabled diff --git a/ansible/archive/roles/proxmox_post_install/templates/pve-remove-nag.sh.j2 b/ansible/archive/roles/proxmox_post_install/templates/pve-remove-nag.sh.j2 deleted file mode 100644 index de1271f..0000000 --- a/ansible/archive/roles/proxmox_post_install/templates/pve-remove-nag.sh.j2 +++ /dev/null @@ -1,45 +0,0 @@ -#!/bin/sh -WEB_JS=/usr/share/javascript/proxmox-widget-toolkit/proxmoxlib.js -if [ -s "$WEB_JS" ] && ! grep -q NoMoreNagging "$WEB_JS"; then - echo "Patching Web UI nag..." - sed -i -e "/data\.status/ s/!//" -e "/data\.status/ s/active/NoMoreNagging/" "$WEB_JS" -fi - -MOBILE_TPL=/usr/share/pve-yew-mobile-gui/index.html.tpl -MARKER="" -if [ -f "$MOBILE_TPL" ] && ! grep -q "$MARKER" "$MOBILE_TPL"; then - echo "Patching Mobile UI nag..." - printf "%s\n" \ - "$MARKER" \ - "" \ - "" >>"$MOBILE_TPL" -fi diff --git a/ansible/archive/roles/secrets_onboarding/defaults/main.yml b/ansible/archive/roles/secrets_onboarding/defaults/main.yml deleted file mode 100644 index 584fe10..0000000 --- a/ansible/archive/roles/secrets_onboarding/defaults/main.yml +++ /dev/null @@ -1,30 +0,0 @@ -# Secrets onboarding role defaults -# This role bootstraps Ansible Vault infrastructure for safe credential storage. -# All paths and file modes are defined as idempotent defaults for single-vault (all hosts) scoping. - ---- -# Vault infrastructure paths -vault_base_dir: "{{ lookup('env', 'HOME') }}/.ansible/vault" -vault_password_file: "{{ vault_base_dir }}/password" -vault_encrypted_file: "{{ playbook_dir }}/../group_vars/vault/all.yml" -vault_vars_dir: "{{ playbook_dir }}/../group_vars/vault" - -# File and directory security modes (octal strings for Jinja2) -vault_dir_mode: "0700" # Owner read/write/execute only -vault_password_file_mode: "0600" # Owner read/write only -vault_file_mode: "0600" # Owner read/write only (encrypted vars file) - -# Optional enforcement controls for production readiness -vault_require_encrypted_vars_file: false -vault_encrypted_vars_required_keys: [] - -# Onboarding behavior toggles -create_example_vault: false # Set to true to create example encrypted var during first run -vault_skip_validation: false # Set to true to skip assert checks (not recommended) - -# Example variable names (for documentation and learning) -# These are referenced in the vault validation task -example_vault_variables: - - grafana_admin_password - - authentik_outpost_dozzle_token - - docker_registry_password diff --git a/ansible/archive/roles/secrets_onboarding/tasks/main.yml b/ansible/archive/roles/secrets_onboarding/tasks/main.yml deleted file mode 100644 index 8ae0cdb..0000000 --- a/ansible/archive/roles/secrets_onboarding/tasks/main.yml +++ /dev/null @@ -1,114 +0,0 @@ ---- -# Phase: Setup and validation for Ansible Vault infrastructure on control node -# Concept: This role idempotently prepares the control node to encrypt/decrypt secrets -# without embedding plaintext credentials in playbooks or committed files. -# On first run, it creates directories and validates prerequisites. -# On repeat runs, it verifies infrastructure is still healthy. - -- name: Create vault infrastructure directory - tags: - - bootstrap - block: - # Why ansible.builtin.file instead of 'mkdir'? - # - file module is idempotent: runs 100 times, same result - # - shell 'mkdir' fails if dir already exists (unless -p flag in separate task) - # - file module handles ownership, permissions, and state atomically - - name: Ensure vault base directory exists with secure permissions - ansible.builtin.file: - path: "{{ vault_base_dir }}" - state: directory - mode: "{{ vault_dir_mode }}" - recurse: false - register: vault_base_created - - # Why ansible.builtin.file again (not just the above)? - # - Multiple tasks allow clear separation: one for base, one for vars subdir - # - Tags can be applied granularly (useful if debugging one phase) - - name: Ensure vault encrypted vars subdirectory exists - ansible.builtin.file: - path: "{{ vault_vars_dir }}" - state: directory - mode: "{{ vault_dir_mode }}" - recurse: false - register: vault_vars_created - -- name: Validate vault prerequisites - tags: - - validate - block: - # Include external task file for readability and reusability - # - Keeps main.yml focused on the happy path - # - Allows the same validation to run standalone for testing - - name: Run vault validation checks - ansible.builtin.include_tasks: validate.yml - vars: - skip_validation: "{{ vault_skip_validation }}" - -- name: Display setup status - tags: - - bootstrap - block: - # Why ansible.builtin.debug instead of 'echo' or shell? - # - debug module respects Ansible's verbosity levels (-v, -vv) - # - Output is properly formatted in Ansible logs and CI/CD systems - # - Can be silenced in automated runs, shown in verbose/interactive runs - - name: Report vault directory creation status - ansible.builtin.debug: - msg: | - Vault infrastructure ready. - Vault base dir: {{ vault_base_dir }} - Encrypted vars dir: {{ vault_vars_dir }} - Password file expected at: {{ vault_password_file }} - - Next steps: - 1. Create a vault password file (first run only): - echo 'your-strong-password' > {{ vault_password_file }} - chmod 0600 {{ vault_password_file }} - - 2. Create your first encrypted vars file: - ansible-vault create {{ vault_encrypted_file }} - - 3. Reference secrets in playbooks: - vars: - grafana_admin_password: "{{ vault_grafana_admin_password }}" - -- name: Optional example vault setup - tags: - - example - block: - # This block is skipped by default (create_example_vault: false) - # Set create_example_vault: true to auto-generate an example encrypted file for learning - # Why skip by default? - # - Beginners need to understand password generation and encryption manually - # - Automated example creation bypasses the learning moment - # - Example includes a password_hash() to show Jinja2 + vault integration - - - name: Create example vault content (for learning, runs with --tags example) - ansible.builtin.set_fact: - example_vault_content: | - --- - # Example encrypted vars for Grafana - # Reference in playbook with: {{ vault_grafana_admin_password }} - - vault_grafana_admin_password: "{{ 'change-me-to-strong-password' | password_hash('sha512') }}" - vault_authentik_outpost_dozzle_token: "your-authentik-token-here" - vault_docker_registry_password: "your-registry-password-here" - when: create_example_vault | bool - - - name: Create example encrypted vars file - ansible.builtin.copy: - content: "{{ example_vault_content }}" - dest: "{{ vault_encrypted_file }}" - mode: "{{ vault_file_mode }}" - when: create_example_vault | bool - register: example_created - - - name: Encrypt example file with ansible-vault (manual step) - ansible.builtin.debug: - msg: | - To encrypt the example vars file, run manually: - ansible-vault encrypt {{ vault_encrypted_file }} - - Or use the vault password file: - ansible-vault encrypt --vault-password-file {{ vault_password_file }} {{ vault_encrypted_file }} - when: create_example_vault | bool and example_created is changed diff --git a/ansible/archive/roles/secrets_onboarding/tasks/validate.yml b/ansible/archive/roles/secrets_onboarding/tasks/validate.yml deleted file mode 100644 index 4f6dc73..0000000 --- a/ansible/archive/roles/secrets_onboarding/tasks/validate.yml +++ /dev/null @@ -1,184 +0,0 @@ ---- -# Validation tasks for vault infrastructure health -# Concept: These assertions fail fast if vault prerequisites are missing or misconfigured. -# Run as part of the main role or standalone to diagnose vault setup issues. -# Why assert instead of conditional blocks? -# - assert provides clear, fail-fast feedback in the Ansible log -# - conditional blocks silently skip, hiding problems -# - assert messages are easy to test and CI-friendly - -- name: Check vault password file exists - ansible.builtin.stat: - path: "{{ vault_password_file }}" - register: password_file_stat - tags: - - validate - -# Why ansible.builtin.assert instead of shell 'test -f'? -# - assert provides human-readable failure messages -# - Multiple conditions can be checked in one task -# - Integrates with Ansible's fail_msg and exception handling -# - Shell 'test' silently passes/fails; harder to debug in logs -- name: Assert vault password file exists and has secure permissions - ansible.builtin.assert: - that: - - password_file_stat.stat.exists - - password_file_stat.stat.mode == "0600" - fail_msg: | - Vault password file is missing or has wrong permissions. - Expected: {{ vault_password_file }} with mode 0600 - Actual: exists={{ password_file_stat.stat.exists }}, mode={{ password_file_stat.stat.mode | default('N/A') }} - - To fix: - 1. Create the file with a strong password: - echo 'YOUR-STRONG-PASSWORD' > {{ vault_password_file }} - 2. Secure it: - chmod 0600 {{ vault_password_file }} - - For interactive vault prompts (instead of password file), use: - ansible-playbook ansible/playbooks/onboarding/setup_ansible_secrets.yml --ask-vault-pass - when: not skip_validation | default(false) - tags: - - validate - -- name: Check vault directory permissions - ansible.builtin.stat: - path: "{{ vault_base_dir }}" - register: vault_dir_stat - tags: - - validate - -- name: Assert vault directory has secure permissions - ansible.builtin.assert: - that: - - vault_dir_stat.stat.exists - - vault_dir_stat.stat.isdir - - vault_dir_stat.stat.mode == "0700" - fail_msg: | - Vault directory has wrong permissions or does not exist. - Expected: {{ vault_base_dir }} as directory with mode 0700 - Actual: exists={{ vault_dir_stat.stat.exists }}, isdir={{ vault_dir_stat.stat.isdir | default(false) }}, mode={{ vault_dir_stat.stat.mode | default('N/A') }} - - To fix: - mkdir -p {{ vault_base_dir }} - chmod 0700 {{ vault_base_dir }} - when: not skip_validation | default(false) - tags: - - validate - -- name: Check encrypted vars directory - ansible.builtin.stat: - path: "{{ vault_vars_dir }}" - register: vault_vars_dir_stat - tags: - - validate - -- name: Assert encrypted vars directory exists - ansible.builtin.assert: - that: - - vault_vars_dir_stat.stat.exists - - vault_vars_dir_stat.stat.isdir - fail_msg: | - Vault encrypted vars directory does not exist. - Expected: {{ vault_vars_dir }} - - To fix (automatic on next role run): - ansible-playbook playbooks/onboarding/setup_ansible_secrets.yml --tags bootstrap - when: not skip_validation | default(false) - tags: - - validate - -- name: Check encrypted vars file state - ansible.builtin.stat: - path: "{{ vault_encrypted_file }}" - register: vault_encrypted_file_stat - tags: - - validate - -- name: Assert encrypted vars file exists when required - ansible.builtin.assert: - that: - - vault_encrypted_file_stat.stat.exists - - vault_encrypted_file_stat.stat.isreg - fail_msg: | - Vault encrypted vars file is required but missing. - Expected: {{ vault_encrypted_file }} - - To fix: - ansible-vault create {{ vault_encrypted_file }} - when: - - not skip_validation | default(false) - - vault_require_encrypted_vars_file | bool - tags: - - validate - -- name: Read encrypted vars file header - ansible.builtin.slurp: - src: "{{ vault_encrypted_file }}" - register: vault_encrypted_file_content - when: - - not skip_validation | default(false) - - vault_require_encrypted_vars_file | bool - - vault_encrypted_file_stat.stat.exists - no_log: true - tags: - - validate - -- name: Assert encrypted vars file is vault-encrypted - ansible.builtin.assert: - that: - - "(vault_encrypted_file_content.content | b64decode).startswith('$ANSIBLE_VAULT;')" - fail_msg: | - Vault vars file exists but is not encrypted with Ansible Vault. - File: {{ vault_encrypted_file }} - - To fix: - ansible-vault encrypt {{ vault_encrypted_file }} - when: - - not skip_validation | default(false) - - vault_require_encrypted_vars_file | bool - - vault_encrypted_file_stat.stat.exists - no_log: true - tags: - - validate - -- name: Load encrypted vars for required key checks - ansible.builtin.include_vars: - file: "{{ vault_encrypted_file }}" - name: loaded_vault_vars - when: - - not skip_validation | default(false) - - vault_require_encrypted_vars_file | bool - - vault_encrypted_file_stat.stat.exists - - (vault_encrypted_vars_required_keys | length) > 0 - no_log: true - tags: - - validate - -- name: Assert required encrypted keys exist - ansible.builtin.assert: - that: - - "item in loaded_vault_vars" - - "(loaded_vault_vars[item] | string | trim | length) > 0" - fail_msg: "Missing required vault key: {{ item }}" - loop: "{{ vault_encrypted_vars_required_keys }}" - when: - - not skip_validation | default(false) - - vault_require_encrypted_vars_file | bool - - vault_encrypted_file_stat.stat.exists - - (vault_encrypted_vars_required_keys | length) > 0 - no_log: true - tags: - - validate - -- name: Report validation success - ansible.builtin.debug: - msg: | - βœ“ Vault infrastructure is healthy. - βœ“ Vault password file: {{ vault_password_file }} (mode {{ password_file_stat.stat.mode }}) - βœ“ Vault base directory: {{ vault_base_dir }} (mode {{ vault_dir_stat.stat.mode }}) - βœ“ Encrypted vars directory: {{ vault_vars_dir }} - {% if vault_require_encrypted_vars_file | bool %}βœ“ Encrypted vars file: {{ vault_encrypted_file }}{% endif %} - when: not skip_validation | default(false) - tags: - - validate diff --git a/ansible/archive/roles/storage_mounts/defaults/main.yml b/ansible/archive/roles/storage_mounts/defaults/main.yml deleted file mode 100644 index 9d6fb4d..0000000 --- a/ansible/archive/roles/storage_mounts/defaults/main.yml +++ /dev/null @@ -1,22 +0,0 @@ ---- -# roles/storage_mounts/defaults/main.yml -# -# NFS server and mount definitions for all lab VMs that require shared storage. -# Sourced from: ansible/group_vars/all.yml (lab_hosts.terramaster.current_ip) -# -# To override for a specific host, set storage_nfs_mounts in host_vars/.yml. -# To skip a specific mount on a node, remove it from the host_vars override list. - -storage_nfs_server: "10.0.0.250" - -storage_nfs_mounts: - - src: "/Volume1/appdata" - dest: "/mnt/homelab" - opts: "defaults,_netdev" - # WHY _netdev: marks this as a network-dependent mount so systemd waits - # for the network to be up before attempting to mount on boot. Prevents - # hung boot when the NAS is temporarily unreachable. - - - src: "/Volume2/media" - dest: "/mnt/media" - opts: "defaults,_netdev" diff --git a/ansible/archive/roles/storage_mounts/tasks/main.yml b/ansible/archive/roles/storage_mounts/tasks/main.yml deleted file mode 100644 index 5d2dbe8..0000000 --- a/ansible/archive/roles/storage_mounts/tasks/main.yml +++ /dev/null @@ -1,94 +0,0 @@ ---- -# roles/storage_mounts/tasks/main.yml -# -# Idempotent NFS mount configuration for lab VMs. -# Safe to run on already-mounted hosts β€” ansible.posix.mount checks current -# kernel mount table and fstab before making any change. - -# -------------------------------------------------- -# STEP 1: Install NFS client packages -# WHY ansible.builtin.apt (not shell): idempotent; only installs when absent. -# -------------------------------------------------- - -- name: Install NFS client package - ansible.builtin.apt: - name: nfs-common - state: present - update_cache: true - tags: [storage, packages] - -# -------------------------------------------------- -# STEP 2: Create mount point directories -# WHY before fstab: mount points must pre-exist before systemd or mount(8) -# can act on them. file module is idempotent on existing dirs. -# -------------------------------------------------- - -- name: Ensure NFS mount point directories exist - ansible.builtin.file: - path: "{{ item.dest }}" - state: directory - mode: '0755' - owner: root - group: root - loop: "{{ storage_nfs_mounts }}" - tags: [storage, filesystem] - -# -------------------------------------------------- -# STEP 3: Write fstab entries -# WHY state: present (not mounted): separates "persist across reboots" -# from "mount now". The next task handles the live mount separately so -# each concern has a clear changed/ok signal. -# -------------------------------------------------- - -- name: Write NFS entries to /etc/fstab - ansible.posix.mount: - src: "{{ storage_nfs_server }}:{{ item.src }}" - path: "{{ item.dest }}" - fstype: nfs - opts: "{{ item.opts }}" - state: present - loop: "{{ storage_nfs_mounts }}" - tags: [storage, fstab] - -# -------------------------------------------------- -# STEP 4: Mount NFS shares in the live kernel -# WHY state: mounted: issues mount(8) if not already in the mount table. -# No-op when the share is already mounted; changed only on first run -# or after an unmount. -# -------------------------------------------------- - -- name: Mount NFS shares - ansible.posix.mount: - src: "{{ storage_nfs_server }}:{{ item.src }}" - path: "{{ item.dest }}" - fstype: nfs - opts: "{{ item.opts }}" - state: mounted - loop: "{{ storage_nfs_mounts }}" - tags: [storage, mount] - -# -------------------------------------------------- -# STEP 5: Verify mounts are reachable -# WHY stat (not ls): stat is a builtin module with no external dependency; -# it checks that the path exists AND is a directory (not an empty -# local dir masquerading as a successful mount point). -# -------------------------------------------------- - -- name: Verify NFS mount points are accessible - ansible.builtin.stat: - path: "{{ item.dest }}" - register: storage_mounts_stat - loop: "{{ storage_nfs_mounts }}" - tags: [storage, verify] - -- name: Assert NFS mount points are directories - ansible.builtin.assert: - that: - - item.stat.exists - - item.stat.isdir - fail_msg: >- - NFS mount point {{ item.item.dest }} is not accessible after mount attempt. - Check NFS server ({{ storage_nfs_server }}) connectivity and export paths. - loop: "{{ storage_mounts_stat.results }}" - when: not ansible_check_mode - tags: [storage, verify] diff --git a/ansible/archive/roles/swarm_bootstrap/defaults/main.yml b/ansible/archive/roles/swarm_bootstrap/defaults/main.yml deleted file mode 100644 index 8c26b0e..0000000 --- a/ansible/archive/roles/swarm_bootstrap/defaults/main.yml +++ /dev/null @@ -1,33 +0,0 @@ ---- -# Defaults for swarm bootstrap workflow. - -swarm_listen_addr: "0.0.0.0:2377" -swarm_api_port: 2377 -swarm_data_path_port: 4789 -swarm_gossip_port: 7946 - -# Swarm overlay networks (including ingress) are allocated from this pool. -# Keep this non-overlapping with physical LAN/VLAN ranges. -swarm_default_addr_pool: "172.31.0.0/16" -swarm_default_addr_pool_mask_length: 24 - -# Docker apt repo (Ubuntu/Debian). -# swarm_docker_apt_arch is intentionally removed β€” architecture is now detected -# per-node from hostvars[item].ansible_architecture in tasks/main.yml to -# correctly support mixed amd64/arm64 inventories. -swarm_docker_repo_url: "https://download.docker.com/linux/ubuntu" -swarm_docker_repo_gpg_url: "https://download.docker.com/linux/ubuntu/gpg" -swarm_docker_repo_filename: "docker" -swarm_docker_keyring_path: "/etc/apt/keyrings/docker.asc" -swarm_docker_legacy_keyring_path: "/usr/share/keyrings/docker-archive-keyring.gpg" - -swarm_docker_packages: - - ca-certificates - - curl - - gnupg - - python3-docker - - docker-ce - - docker-ce-cli - - containerd.io - - docker-buildx-plugin - - docker-compose-plugin diff --git a/ansible/archive/roles/swarm_bootstrap/tasks/main.yml b/ansible/archive/roles/swarm_bootstrap/tasks/main.yml deleted file mode 100644 index 9353a15..0000000 --- a/ansible/archive/roles/swarm_bootstrap/tasks/main.yml +++ /dev/null @@ -1,452 +0,0 @@ ---- -# End-to-end Docker Swarm bootstrap from the primary manager. - -- name: Validate required inventory groups exist - ansible.builtin.assert: - that: - - groups['swarm_managers'] is defined - - groups['swarm_managers'] | length > 0 - - groups['swarm_workers'] is defined - - groups['swarm_workers'] | length > 0 - fail_msg: "Inventory must define non-empty swarm_managers and swarm_workers groups." - success_msg: "Required Swarm inventory groups are present." - tags: [always] - -- name: Build swarm host lists - ansible.builtin.set_fact: - swarm_primary_manager: "{{ groups['swarm_managers'][0] }}" - swarm_secondary_managers: "{{ groups['swarm_managers'][1:] }}" - swarm_all_nodes: "{{ groups['swarm_managers'] + groups['swarm_workers'] }}" - tags: [always] - -- name: Detect runtime IPv4 on primary manager - ansible.builtin.shell: ip -4 route get 1.1.1.1 | awk '{print $7; exit}' - args: - executable: /bin/bash - changed_when: false - check_mode: false - register: swarm_primary_runtime_ip - become: true - delegate_to: "{{ swarm_primary_manager }}" - tags: [always] - -- name: Validate runtime IPv4 detection on primary manager - ansible.builtin.assert: - that: - - swarm_primary_runtime_ip.stdout | trim | length > 0 - - swarm_primary_runtime_ip.stdout | trim != ansible_host | default('127.0.0.1') - - swarm_primary_runtime_ip.stdout | trim != '10.0.0.200' - fail_msg: >- - IP detected on {{ swarm_primary_manager }} was '{{ swarm_primary_runtime_ip.stdout | trim }}'. - This looks like the control node IP, which means delegate_to is not working. - Ensure 'connection: local' is NOT set at play level in bootstrap_swarm.yml. - success_msg: "Detected primary manager runtime IPv4: {{ swarm_primary_runtime_ip.stdout | trim }}" - tags: [always] - -- name: Set primary manager advertise address from runtime network state - ansible.builtin.set_fact: - swarm_primary_advertise_addr: "{{ swarm_primary_runtime_ip.stdout | trim }}" - tags: [always] - -- name: Set Docker CLI environment guardrails - ansible.builtin.set_fact: - swarm_docker_cli_env: - DOCKER_HOST: "unix:///var/run/docker.sock" - DOCKER_CONTEXT: "" - tags: [always] - -- name: Read Docker daemon hostname from primary manager context - ansible.builtin.command: docker info --format '{{"{{"}} .Name {{"}}"}}' - changed_when: false - check_mode: false - register: swarm_primary_docker_name - become: true - delegate_to: "{{ swarm_primary_manager }}" - environment: "{{ swarm_docker_cli_env }}" - tags: [always] - -- name: Assert Docker daemon context points to expected manager - ansible.builtin.assert: - that: - - swarm_primary_docker_name.stdout | trim != 'watchtower' - - swarm_primary_docker_name.stdout | trim | length > 0 - fail_msg: >- - Docker daemon reports '{{ swarm_primary_docker_name.stdout | trim }}' β€” this is the control node, not {{ swarm_primary_manager }}. - The 'connection: local' directive must NOT be set at play level in bootstrap_swarm.yml. - success_msg: "Docker daemon context verified on primary manager: {{ swarm_primary_docker_name.stdout | trim }}" - tags: [always] - -- name: Preflight - check SSH port reachability from control node - ansible.builtin.wait_for: - host: "{{ hostvars[item].ansible_host | default(item) }}" - port: 22 - timeout: 3 - connect_timeout: 2 - state: started - register: swarm_ssh_preflight - failed_when: false - delegate_to: localhost - loop: "{{ swarm_all_nodes }}" - tags: [always] - -- name: Build list of unreachable swarm nodes - ansible.builtin.set_fact: - swarm_unreachable_nodes: >- - {{ - swarm_ssh_preflight.results - | selectattr('failed', 'defined') - | selectattr('failed') - | map(attribute='item') - | list - }} - tags: [always] - -- name: Fail fast when swarm nodes are unreachable over SSH - ansible.builtin.assert: - that: - - swarm_unreachable_nodes | length == 0 - fail_msg: >- - Cannot reach TCP/22 from control node for: {{ swarm_unreachable_nodes | join(', ') }}. - Confirm VM power state, IP assignment, routing, and firewall before bootstrap. - success_msg: "SSH reachability preflight passed for all swarm nodes." - tags: [always] - -- name: Ensure Docker keyring directory exists - ansible.builtin.file: - path: /etc/apt/keyrings - state: directory - mode: '0755' - become: true - delegate_to: "{{ item }}" - loop: "{{ swarm_all_nodes }}" - -- name: Discover apt source files on swarm nodes - ansible.builtin.find: - paths: /etc/apt/sources.list.d - patterns: - - "*.list" - - "*.sources" - file_type: file - register: swarm_bootstrap_apt_source_candidates - become: true - delegate_to: "{{ item }}" - loop: "{{ swarm_all_nodes }}" - -- name: Read apt source file contents on swarm nodes - ansible.builtin.slurp: - src: "{{ source_file.path }}" - register: swarm_bootstrap_apt_source_contents - become: true - delegate_to: "{{ item.0.item }}" - loop: >- - {{ - swarm_bootstrap_apt_source_candidates.results - | subelements('files', skip_missing=True) - }} - loop_control: - loop_var: item - label: "{{ item.0.item }} -> {{ item.1.path }}" - vars: - source_file: "{{ item.1 }}" - -- name: Remove discovered Docker apt source files on swarm nodes - ansible.builtin.file: - path: "{{ item.source }}" - state: absent - become: true - delegate_to: "{{ item.item.0.item }}" - loop: "{{ swarm_bootstrap_apt_source_contents.results }}" - when: - - item.content is defined - - "'download.docker.com' in (item.content | b64decode)" - loop_control: - label: "{{ item.item.0.item }} -> {{ item.source }}" - -- name: Remove Docker source entries from main apt sources list - ansible.builtin.lineinfile: - path: /etc/apt/sources.list - regexp: '^.*download\\.docker\\.com.*$' - state: absent - become: true - delegate_to: "{{ item }}" - loop: "{{ swarm_all_nodes }}" - -- name: Remove legacy Docker apt source list - ansible.builtin.file: - path: "/etc/apt/sources.list.d/{{ swarm_docker_repo_filename }}.list" - state: absent - become: true - delegate_to: "{{ item }}" - loop: "{{ swarm_all_nodes }}" - -- name: Remove legacy add-apt-repository Docker source list - ansible.builtin.file: - path: >- - /etc/apt/sources.list.d/archive_uri-https_download_docker_com_linux_ubuntu- - {{ hostvars[item].ansible_distribution_release | default('noble') }}.list - state: absent - become: true - delegate_to: "{{ item }}" - loop: "{{ swarm_all_nodes }}" - -- name: Remove legacy Docker deb822 source definition - ansible.builtin.file: - path: "/etc/apt/sources.list.d/{{ swarm_docker_repo_filename }}.sources" - state: absent - become: true - delegate_to: "{{ item }}" - loop: "{{ swarm_all_nodes }}" - -- name: Remove legacy Docker apt signing key path - ansible.builtin.file: - path: "{{ swarm_docker_legacy_keyring_path }}" - state: absent - become: true - delegate_to: "{{ item }}" - loop: "{{ swarm_all_nodes }}" - -- name: Install Docker apt signing key - ansible.builtin.get_url: - url: "{{ swarm_docker_repo_gpg_url }}" - dest: "{{ swarm_docker_keyring_path }}" - mode: '0644' - become: true - delegate_to: "{{ item }}" - loop: "{{ swarm_all_nodes }}" - -- name: Add Docker apt repository - vars: - # Derive the correct Debian architecture string per-node so ARM and x86 nodes - # in the same inventory both receive the right repo entry. - _node_apt_arch: "{{ 'arm64' if hostvars[item].ansible_architecture | default('x86_64') == 'aarch64' else 'amd64' }}" - ansible.builtin.apt_repository: - repo: "deb [arch={{ _node_apt_arch }} signed-by={{ swarm_docker_keyring_path }}] {{ swarm_docker_repo_url }} {{ hostvars[item].ansible_distribution_release | default('noble') }} stable" - filename: "{{ swarm_docker_repo_filename }}" - state: present - become: true - delegate_to: "{{ item }}" - loop: "{{ swarm_all_nodes }}" - -- name: Ensure Docker dependencies are installed - ansible.builtin.apt: - name: - - ca-certificates - - curl - - gnupg - - python3-docker - state: present - update_cache: true - become: true - delegate_to: "{{ item }}" - loop: "{{ swarm_all_nodes }}" - -- name: Install Docker engine packages - ansible.builtin.apt: - name: "{{ swarm_docker_packages }}" - state: present - update_cache: true - become: true - delegate_to: "{{ item }}" - loop: "{{ swarm_all_nodes }}" - -- name: Ensure Docker service is enabled and running - ansible.builtin.systemd: - name: docker - enabled: true - state: started - become: true - delegate_to: "{{ item }}" - loop: "{{ swarm_all_nodes }}" - -- name: Ensure ansible_user is in the docker group - ansible.builtin.user: - # Use per-host ansible_user from hostvars so mixed-user inventories work correctly. - # Falls back to the play-level ansible_user when not set on a specific host. - name: "{{ hostvars[item].ansible_user | default(ansible_user) }}" - groups: docker - append: true - become: true - delegate_to: "{{ item }}" - loop: "{{ swarm_all_nodes }}" - -- name: Read primary manager swarm state - ansible.builtin.command: docker info --format '{{"{{"}} .Swarm.LocalNodeState {{"}}"}}' - changed_when: false - check_mode: false - register: swarm_primary_state - become: true - delegate_to: "{{ swarm_primary_manager }}" - environment: "{{ swarm_docker_cli_env }}" - tags: [swarm-join] - -- name: Initialize swarm on primary manager when inactive - ansible.builtin.command: >- - docker swarm init - --advertise-addr {{ swarm_primary_advertise_addr }} - --listen-addr {{ swarm_primary_advertise_addr }}:{{ swarm_api_port }} - --default-addr-pool {{ swarm_default_addr_pool }} - --default-addr-pool-mask-length {{ swarm_default_addr_pool_mask_length }} - register: swarm_init_result - changed_when: swarm_init_result.rc == 0 - failed_when: - - swarm_init_result.rc != 0 - - "'This node is already part of a swarm' not in swarm_init_result.stderr" - when: swarm_primary_state.stdout != 'active' - become: true - delegate_to: "{{ swarm_primary_manager }}" - environment: "{{ swarm_docker_cli_env }}" - tags: [swarm-join] - -- name: Get manager join token - ansible.builtin.command: docker swarm join-token -q manager - changed_when: false - check_mode: false - register: swarm_manager_token - become: true - delegate_to: "{{ swarm_primary_manager }}" - environment: "{{ swarm_docker_cli_env }}" - tags: [swarm-join] - -- name: Get worker join token - ansible.builtin.command: docker swarm join-token -q worker - changed_when: false - check_mode: false - register: swarm_worker_token - become: true - delegate_to: "{{ swarm_primary_manager }}" - environment: "{{ swarm_docker_cli_env }}" - tags: [swarm-join] - -- name: Check secondary manager swarm state - ansible.builtin.command: docker info --format '{{"{{"}} .Swarm.LocalNodeState {{"}}"}}' - changed_when: false - check_mode: false - register: swarm_secondary_manager_states - become: true - delegate_to: "{{ item }}" - loop: "{{ swarm_secondary_managers }}" - environment: "{{ swarm_docker_cli_env }}" - tags: [swarm-join] - -- name: Join secondary managers to swarm when not active - ansible.builtin.command: >- - docker swarm join - --token {{ swarm_manager_token.stdout }} - {{ swarm_primary_advertise_addr }}:{{ swarm_api_port }} - register: swarm_join_manager_results - changed_when: swarm_join_manager_results.rc == 0 - failed_when: - - swarm_join_manager_results.rc != 0 - - "'This node is already part of a swarm' not in swarm_join_manager_results.stderr" - become: true - delegate_to: "{{ item.item }}" - loop: "{{ swarm_secondary_manager_states.results }}" - when: item.stdout != 'active' - environment: "{{ swarm_docker_cli_env }}" - tags: [swarm-join] - -- name: Check worker swarm state - ansible.builtin.command: docker info --format '{{"{{"}} .Swarm.LocalNodeState {{"}}"}}' - changed_when: false - check_mode: false - register: swarm_worker_states - become: true - delegate_to: "{{ item }}" - loop: "{{ groups['swarm_workers'] }}" - environment: "{{ swarm_docker_cli_env }}" - tags: [swarm-join] - -- name: Join workers to swarm when not active - ansible.builtin.command: >- - docker swarm join - --token {{ swarm_worker_token.stdout }} - {{ swarm_primary_advertise_addr }}:{{ swarm_api_port }} - register: swarm_join_worker_results - changed_when: swarm_join_worker_results.rc == 0 - failed_when: - - swarm_join_worker_results.rc != 0 - - "'This node is already part of a swarm' not in swarm_join_worker_results.stderr" - become: true - delegate_to: "{{ item.item }}" - loop: "{{ swarm_worker_states.results }}" - when: item.stdout != 'active' - environment: "{{ swarm_docker_cli_env }}" - tags: [swarm-join] - -- name: Collect swarm node hostnames from primary manager - ansible.builtin.command: docker node ls --format '{{"{{"}} .Hostname {{"}}"}}' - changed_when: false - check_mode: false - register: swarm_node_hostnames - become: true - delegate_to: "{{ swarm_primary_manager }}" - environment: "{{ swarm_docker_cli_env }}" - tags: [swarm-join] - -- name: Assert expected swarm node count is present on primary manager - ansible.builtin.assert: - that: - - swarm_node_hostnames.stdout_lines | length == swarm_all_nodes | length - fail_msg: >- - Primary manager sees {{ swarm_node_hostnames.stdout_lines | length }} node(s), - expected {{ swarm_all_nodes | length }}. Check Docker context and swarm membership. - success_msg: "Primary manager sees expected swarm node count." - tags: [swarm-join] - -- name: Read existing manager label values by hostname - ansible.builtin.command: >- - docker node inspect - --format '{{"{{"}} with index .Spec.Labels "node.role" {{"}}"}}{{"{{"}} . {{"}}"}}{{"{{"}} end {{"}}"}}' - {{ item }} - changed_when: false - failed_when: false - register: manager_node_labels - become: true - delegate_to: "{{ swarm_primary_manager }}" - loop: "{{ groups['swarm_managers'] }}" - environment: "{{ swarm_docker_cli_env }}" - -- name: Ensure manager role labels are present by hostname - ansible.builtin.command: docker node update --label-add node.role=manager {{ item.item }} - register: manager_label_update - changed_when: manager_label_update.rc == 0 - become: true - delegate_to: "{{ swarm_primary_manager }}" - loop: "{{ manager_node_labels.results }}" - when: item.stdout != 'manager' - environment: "{{ swarm_docker_cli_env }}" - -- name: Read existing worker label values by hostname - ansible.builtin.command: >- - docker node inspect - --format '{{"{{"}} with index .Spec.Labels "node.role" {{"}}"}}{{"{{"}} . {{"}}"}}{{"{{"}} end {{"}}"}}' - {{ item }} - changed_when: false - failed_when: false - register: worker_node_labels - become: true - delegate_to: "{{ swarm_primary_manager }}" - loop: "{{ groups['swarm_workers'] }}" - environment: "{{ swarm_docker_cli_env }}" - -- name: Ensure worker role labels are present by hostname - ansible.builtin.command: docker node update --label-add node.role=worker {{ item.item }} - register: worker_label_update - changed_when: worker_label_update.rc == 0 - become: true - delegate_to: "{{ swarm_primary_manager }}" - loop: "{{ worker_node_labels.results }}" - when: item.stdout != 'worker' - environment: "{{ swarm_docker_cli_env }}" - -- name: Show final swarm node table - ansible.builtin.command: docker node ls - changed_when: false - register: swarm_node_ls - become: true - delegate_to: "{{ swarm_primary_manager }}" - environment: "{{ swarm_docker_cli_env }}" - -- name: Print swarm verification output - ansible.builtin.debug: - var: swarm_node_ls.stdout_lines diff --git a/ansible/archive/roles/swarm_cadvisor/defaults/main.yml b/ansible/archive/roles/swarm_cadvisor/defaults/main.yml deleted file mode 100644 index 374885d..0000000 --- a/ansible/archive/roles/swarm_cadvisor/defaults/main.yml +++ /dev/null @@ -1,29 +0,0 @@ ---- -# roles/swarm_cadvisor/defaults/main.yml -# cAdvisor (Container Advisor) exposes container-level metrics - -# === CONCEPT: Container Metrics vs Host Metrics === -# node-exporter β†’ Host CPU/RAM/Disk -# cAdvisor β†’ Per-container CPU/RAM/Network/Disk I/O -# Combined, these give you full visibility into resource usage - -cadvisor_version: "latest" -cadvisor_port: 8080 -cadvisor_container_name: "cadvisor" - -# === SECURITY: Read-Only Docker Socket === -# cAdvisor needs access to Docker to inspect containers -# Mount the socket as READ-ONLY to prevent tampering -cadvisor_volumes: - - "/:/rootfs:ro" - - "/var/run:/var/run:ro" - - "/sys:/sys:ro" - - "/var/lib/docker/:/var/lib/docker:ro" - - "/dev/disk/:/dev/disk:ro" - -cadvisor_restart_policy: "unless-stopped" - -# === PRO-TIP: Lighter Alternative === -# For Docker-only environments, you can enable Docker's built-in -# metrics endpoint instead: dockerd --metrics-addr=0.0.0.0:9323 -# But cAdvisor provides more detailed per-container breakdowns diff --git a/ansible/archive/roles/swarm_cadvisor/tasks/main.yml b/ansible/archive/roles/swarm_cadvisor/tasks/main.yml deleted file mode 100644 index 7282c74..0000000 --- a/ansible/archive/roles/swarm_cadvisor/tasks/main.yml +++ /dev/null @@ -1,36 +0,0 @@ ---- -# roles/swarm_cadvisor/tasks/main.yml -# Deploy cAdvisor for container-level resource monitoring - -- name: Ensure cAdvisor container is running - community.docker.docker_container: - name: "{{ cadvisor_container_name }}" - image: "gcr.io/cadvisor/cadvisor:{{ cadvisor_version }}" - state: started - restart_policy: "{{ cadvisor_restart_policy }}" - ports: - - "{{ cadvisor_port }}:8080" - volumes: "{{ cadvisor_volumes }}" - privileged: true - # === WHY PRIVILEGED? === - # cAdvisor needs to read cgroup metrics from /sys/fs/cgroup - # This requires elevated permissions. In production, consider - # using specific capabilities instead of full privileged mode: - # cap_add: ["SYS_ADMIN"] - devices: - - "/dev/kmsg:/dev/kmsg" - register: cadvisor_container - -- name: Verify cAdvisor is responding - ansible.builtin.uri: - url: "http://localhost:{{ cadvisor_port }}/metrics" - method: GET - status_code: 200 - retries: 3 - delay: 5 - register: cadvisor_health - failed_when: cadvisor_health.status != 200 - -- name: Display cAdvisor endpoint - ansible.builtin.debug: - msg: "βœ… cAdvisor is running on {{ ansible_hostname }}:{{ cadvisor_port }}" diff --git a/ansible/archive/roles/swarm_dozzle_agent/defaults/main.yml b/ansible/archive/roles/swarm_dozzle_agent/defaults/main.yml deleted file mode 100644 index 3a0c3d1..0000000 --- a/ansible/archive/roles/swarm_dozzle_agent/defaults/main.yml +++ /dev/null @@ -1,7 +0,0 @@ ---- -# roles/swarm_dozzle_agent/defaults/main.yml -# Global swarm agent for remote Dozzle log collection - -dozzle_agent_service_name: "dozzle-agent" -dozzle_agent_image: "amir20/dozzle:v9.0.1" -dozzle_agent_port: 7007 diff --git a/ansible/archive/roles/swarm_dozzle_agent/tasks/main.yml b/ansible/archive/roles/swarm_dozzle_agent/tasks/main.yml deleted file mode 100644 index a50a808..0000000 --- a/ansible/archive/roles/swarm_dozzle_agent/tasks/main.yml +++ /dev/null @@ -1,41 +0,0 @@ ---- -# roles/swarm_dozzle_agent/tasks/main.yml -# Deploy a global Dozzle agent service so the Watchtower Dozzle UI can read logs from all swarm nodes. - -- name: Check if dozzle-agent service exists - ansible.builtin.command: docker service ls --format '{{"{{"}}.Name{{"}}"}}' - register: swarm_services - changed_when: false - -- name: Create dozzle-agent swarm service - ansible.builtin.command: >- - docker service create - --name {{ dozzle_agent_service_name }} - --mode global - --restart-condition any - --publish mode=host,target={{ dozzle_agent_port }},published={{ dozzle_agent_port }} - --mount type=bind,src=/var/run/docker.sock,dst=/var/run/docker.sock,ro - {{ dozzle_agent_image }} - agent - when: dozzle_agent_service_name not in swarm_services.stdout_lines - -- name: Update dozzle-agent swarm service - ansible.builtin.command: >- - docker service update - --image {{ dozzle_agent_image }} - --force - {{ dozzle_agent_service_name }} - when: dozzle_agent_service_name in swarm_services.stdout_lines - changed_when: true - -- name: Verify dozzle-agent is running globally - ansible.builtin.command: docker service ps --filter desired-state=running {{ dozzle_agent_service_name }} --format '{{"{{"}}.Node{{"}}"}}' - register: dozzle_agent_ps - changed_when: false - -- name: Assert dozzle-agent replicas running on all nodes - ansible.builtin.assert: - that: - - dozzle_agent_ps.stdout_lines | length >= groups['swarm_hosts'] | length - fail_msg: "Dozzle agents are not running on all swarm nodes yet" - success_msg: "Dozzle agents running across swarm nodes" diff --git a/ansible/archive/roles/swarm_node_exporter/defaults/main.yml b/ansible/archive/roles/swarm_node_exporter/defaults/main.yml deleted file mode 100644 index f357101..0000000 --- a/ansible/archive/roles/swarm_node_exporter/defaults/main.yml +++ /dev/null @@ -1,29 +0,0 @@ ---- -# roles/swarm_node_exporter/defaults/main.yml -# Low-priority variables for node-exporter deployment - -# === CONCEPT: Exporter Configuration === -# node-exporter runs as a lightweight sidecar on each node -# It exposes system metrics on port 9100 for Prometheus to scrape - -node_exporter_version: "latest" -node_exporter_port: 9100 -node_exporter_container_name: "node-exporter" - -# === SECURITY: Read-Only Mounts === -# We mount host filesystems as READ-ONLY to prevent -# the exporter from modifying system files -node_exporter_volumes: - - "/proc:/host/proc:ro" - - "/sys:/host/sys:ro" - - "/:/rootfs:ro" - -# === HIGH AVAILABILITY: Restart Policy === -# "unless-stopped" ensures the exporter survives reboots -# but can be manually stopped if needed -node_exporter_restart_policy: "unless-stopped" - -# === BEST PRACTICE: Resource Limits === -# Prevent a single exporter from consuming excessive resources -node_exporter_memory_limit: "128M" -node_exporter_cpu_limit: "0.5" diff --git a/ansible/archive/roles/swarm_node_exporter/tasks/main.yml b/ansible/archive/roles/swarm_node_exporter/tasks/main.yml deleted file mode 100644 index 6a38fea..0000000 --- a/ansible/archive/roles/swarm_node_exporter/tasks/main.yml +++ /dev/null @@ -1,41 +0,0 @@ ---- -# roles/swarm_node_exporter/tasks/main.yml -# Deploy node-exporter on each swarm node for host metrics collection - -- name: Ensure node-exporter container is running - community.docker.docker_container: - name: "{{ node_exporter_container_name }}" - image: "prom/node-exporter:{{ node_exporter_version }}" - state: started - restart_policy: "{{ node_exporter_restart_policy }}" - volumes: "{{ node_exporter_volumes }}" - command: - - '--path.procfs=/host/proc' - - '--path.sysfs=/host/sys' - - '--path.rootfs=/rootfs' - - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)' - network_mode: "host" - # === SECURITY HARDENING === - read_only: true - security_opts: - - no-new-privileges:true - cap_drop: - - ALL - # === RESOURCE LIMITS === - memory: "{{ node_exporter_memory_limit }}" - cpus: "{{ node_exporter_cpu_limit }}" - register: node_exporter_container - -- name: Verify node-exporter is responding - ansible.builtin.uri: - url: "http://localhost:{{ node_exporter_port }}/metrics" - method: GET - status_code: 200 - retries: 3 - delay: 5 - register: exporter_health - failed_when: exporter_health.status != 200 - -- name: Display node-exporter endpoint - ansible.builtin.debug: - msg: "βœ… node-exporter is running on {{ ansible_hostname }}:{{ node_exporter_port }}" diff --git a/ansible/archive/roles/swarm_overlay_network/defaults/main.yml b/ansible/archive/roles/swarm_overlay_network/defaults/main.yml deleted file mode 100644 index 8b052b9..0000000 --- a/ansible/archive/roles/swarm_overlay_network/defaults/main.yml +++ /dev/null @@ -1,10 +0,0 @@ ---- -# Default variables for swarm_overlay_network role -# Override these in inventory/group_vars when needed. - -swarm_overlay_network_name: "proxy-net" -swarm_overlay_network_subnet: "172.20.0.0/24" -swarm_overlay_network_gateway: "172.20.0.1" -swarm_overlay_network_attachable: true -swarm_overlay_network_internal: false -swarm_overlay_network_mtu: "1500" diff --git a/ansible/archive/roles/swarm_overlay_network/tasks/main.yml b/ansible/archive/roles/swarm_overlay_network/tasks/main.yml deleted file mode 100644 index e212e20..0000000 --- a/ansible/archive/roles/swarm_overlay_network/tasks/main.yml +++ /dev/null @@ -1,45 +0,0 @@ ---- -# Ensure a Swarm-wide overlay network exists for inter-service communication. - -- name: Validate Docker is available - ansible.builtin.command: docker --version - changed_when: false - check_mode: false - -- name: Collect Swarm state from current host - ansible.builtin.command: docker info --format '{{"{{"}} .Swarm.LocalNodeState {{"}}"}}|{{"{{"}} .Swarm.ControlAvailable {{"}}"}}' - register: swarm_state - changed_when: false - check_mode: false - -- name: Fail when host is not an active swarm manager - ansible.builtin.assert: - that: - - swarm_state.stdout is search('active|true') - fail_msg: >- - This role must run on an active Swarm manager. - Current state was: {{ swarm_state.stdout }} - success_msg: "Host is an active Swarm manager." - -- name: Ensure overlay network exists for swarm services - community.docker.docker_network: - name: "{{ swarm_overlay_network_name }}" - driver: overlay - scope: swarm - attachable: "{{ swarm_overlay_network_attachable }}" - internal: "{{ swarm_overlay_network_internal }}" - ipam_config: - - subnet: "{{ swarm_overlay_network_subnet }}" - gateway: "{{ swarm_overlay_network_gateway }}" - driver_options: - com.docker.network.driver.mtu: "{{ swarm_overlay_network_mtu }}" - state: present - register: swarm_overlay_network_result - -- name: Show network reconciliation result - ansible.builtin.debug: - msg: - - "Overlay network ensured: {{ swarm_overlay_network_name }}" - - "Changed: {{ swarm_overlay_network_result.changed }}" - - "Subnet: {{ swarm_overlay_network_subnet }}" - - "Gateway: {{ swarm_overlay_network_gateway }}" diff --git a/ansible/archive/roles/swarm_stack_deploy/defaults/main.yml b/ansible/archive/roles/swarm_stack_deploy/defaults/main.yml deleted file mode 100644 index 7115c42..0000000 --- a/ansible/archive/roles/swarm_stack_deploy/defaults/main.yml +++ /dev/null @@ -1,25 +0,0 @@ ---- -# Generic, repeatable defaults for Swarm stack deployment. -# Required vars are intentionally empty and must be provided per service. - -stack_name: "" -stack_compose_src: "" - -# Optional stack deploy behavior -stack_state: present -stack_validate_only: false -stack_prune: true -stack_with_registry_auth: true - -# Target path on manager node where stack artifacts are stored -stack_deploy_root: "/opt/stacks" -stack_compose_filename: "stack.yml" -stack_env_src: "" -stack_env_filename: ".env" - -# Optional directories to create before deploy (for bind-mount paths) -stack_required_directories: [] - -# Optional external swarm networks that must already exist -stack_required_external_networks: - - "proxy-net" diff --git a/ansible/archive/roles/swarm_stack_deploy/tasks/main.yml b/ansible/archive/roles/swarm_stack_deploy/tasks/main.yml deleted file mode 100644 index da34cfc..0000000 --- a/ansible/archive/roles/swarm_stack_deploy/tasks/main.yml +++ /dev/null @@ -1,154 +0,0 @@ ---- -# Deploy a Swarm stack from a version-controlled compose file using a repeatable pipeline. - -- name: Build derived stack paths - ansible.builtin.set_fact: - stack_target_dir: "{{ stack_deploy_root }}/{{ stack_name }}" - stack_target_compose: "{{ stack_deploy_root }}/{{ stack_name }}/{{ stack_compose_filename }}" - stack_target_env: "{{ stack_deploy_root }}/{{ stack_name }}/{{ stack_env_filename }}" - -- name: Validate required role inputs - ansible.builtin.assert: - that: - - stack_name | length > 0 - - stack_compose_src | length > 0 - fail_msg: "Provide stack_name and stack_compose_src for swarm_stack_deploy role." - -- name: Verify source compose exists on control node - ansible.builtin.stat: - path: "{{ stack_compose_src }}" - delegate_to: localhost - register: stack_compose_src_stat - -- name: Fail when source compose is missing - ansible.builtin.assert: - that: - - stack_compose_src_stat.stat.exists - fail_msg: "Compose source file not found on control node: {{ stack_compose_src }}" - -- name: Validate Docker is available on target manager - ansible.builtin.command: docker --version - changed_when: false - -- name: Collect Swarm manager state - ansible.builtin.command: docker info --format '{{"{{"}} .Swarm.LocalNodeState {{"}}"}}|{{"{{"}} .Swarm.ControlAvailable {{"}}"}}' - register: swarm_state - changed_when: false - -- name: Ensure target host is an active manager - ansible.builtin.assert: - that: - # WHY exact equality (not search): 'inactive' is a substring of 'active', so - # search('active') passes on an inactive node. The format string always yields - # 'active|true' for a healthy manager β€” match that exact string. - - swarm_state.stdout == 'active|true' - fail_msg: >- - Target host must be an active Swarm manager. - Expected 'active|true', got '{{ swarm_state.stdout }}'. - -- name: Ensure stack target directory exists - become: true - ansible.builtin.file: - path: "{{ stack_target_dir }}" - state: directory - mode: '0755' - -- name: Ensure required bind-mount directories exist - become: true - ansible.builtin.file: - path: "{{ item }}" - state: directory - mode: '0755' - loop: "{{ stack_required_directories if stack_required_directories is not string else stack_required_directories | from_yaml }}" - -- name: Verify required external networks exist - ansible.builtin.command: "docker network inspect {{ item }}" - changed_when: false - loop: "{{ stack_required_external_networks }}" - -- name: Render stack compose from Git source-of-truth to manager - ansible.builtin.template: - src: "{{ stack_compose_src }}" - dest: "{{ stack_target_compose }}" - mode: '0644' - -- name: Copy stack environment file when provided - ansible.builtin.copy: - src: "{{ stack_env_src }}" - dest: "{{ stack_target_env }}" - mode: '0600' - when: stack_env_src | length > 0 - -- name: Validate compose YAML syntax on control node - # WHY local Python parse instead of docker compose config / docker stack --dry-run: - # docker compose v2 CLI plugin is not installed on Swarm nodes. - # docker stack deploy --dry-run is not a valid flag in Docker Engine 29.x. - # A Python YAML parse on the control node is dependency-free, fast, and - # catches all syntax errors before the file is copied to any remote host. - ansible.builtin.command: > - python3 -c "import yaml, sys; - yaml.safe_load(open('{{ stack_compose_src }}')); - print('YAML syntax OK: {{ stack_compose_src }}')" - delegate_to: localhost - changed_when: false - register: stack_syntax_check - -- name: Report compose syntax check result - ansible.builtin.debug: - msg: "{{ stack_syntax_check.stdout }}" - -- name: Deploy or reconcile stack desired state - # WHY docker stack deploy instead of community.docker.docker_stack: - # community.docker.docker_stack requires 'jsondiff' pip package on the - # managed node β€” an unmanaged runtime dep we do not want on Swarm nodes. - # 'docker stack deploy' is idempotent (declarative desired-state), always - # available wherever Docker Engine is installed, and produces clear output - # showing which services were Created vs Updated. - ansible.builtin.command: >- - docker stack deploy - --compose-file {{ stack_target_compose }} - {{ '--with-registry-auth' if stack_with_registry_auth else '' }} - {{ '--prune' if stack_prune else '' }} - {{ stack_name }} - when: - - not stack_validate_only - - stack_state == 'present' - register: stack_deploy_result - changed_when: >- - 'Creating service' in (stack_deploy_result.stdout | default('')) or - 'Updating service' in (stack_deploy_result.stdout | default('')) - -- name: Collect running stack names before removal - # WHY: docker stack rm is not idempotent β€” it errors if the stack is already gone. - # Querying first lets us skip the task cleanly and report changed=false - # when the desired absent state is already satisfied. - ansible.builtin.command: > - docker stack ls --format '{{ "{{" }}.Name{{ "}}" }}' - register: _stack_list - changed_when: false - when: - - not stack_validate_only - - stack_state == 'absent' - -- name: Remove stack (state=absent) - ansible.builtin.command: "docker stack rm {{ stack_name }}" - when: - - not stack_validate_only - - stack_state == 'absent' - - stack_name in (_stack_list.stdout_lines | default([])) - changed_when: stack_name in (_stack_list.stdout_lines | default([])) - -- name: Show stack service status - ansible.builtin.command: "docker stack services {{ stack_name }}" - register: stack_services - changed_when: false - failed_when: false - -- name: Report current stack status - ansible.builtin.debug: - msg: - - "Stack: {{ stack_name }}" - - "Validate-only mode: {{ stack_validate_only }}" - - "Compose source: {{ stack_compose_src }}" - - "Compose target: {{ stack_target_compose }}" - - "Service status output:\n{{ stack_services.stdout | default('No output yet') }}" diff --git a/ansible/archive/scripts/ansible_mcp_server.py b/ansible/archive/scripts/ansible_mcp_server.py deleted file mode 100644 index 0545524..0000000 --- a/ansible/archive/scripts/ansible_mcp_server.py +++ /dev/null @@ -1,749 +0,0 @@ -#!/usr/bin/env python3 -"""Ansible MCP server with path guardrails and auditable run records. - -This server is intentionally conservative: -- Playbook execution is restricted to allowlisted directories. -- Write operations require explicit confirmation. -- Background jobs are tracked in a local state directory. -""" - -from __future__ import annotations - -import argparse -import hmac -import json -import os -import shlex -import signal -import subprocess -import tempfile -import uuid -from dataclasses import dataclass -from datetime import datetime, timezone -from pathlib import Path -from typing import Any - -from mcp.server.fastmcp import FastMCP - - -def _utc_now() -> str: - return datetime.now(timezone.utc).isoformat() - - -@dataclass(frozen=True) -class ServerConfig: - repo_root: Path - inventory_file: Path - allowed_dirs: tuple[Path, ...] - allowed_playbooks: tuple[str, ...] - api_token: str | None - allow_write: bool - require_confirm_for_write: bool - default_timeout_seconds: int - max_timeout_seconds: int - max_extra_vars_bytes: int - blocked_extra_vars_keys: tuple[str, ...] - state_dir: Path - audit_log_file: Path - - -class JobStore: - def __init__(self, state_dir: Path) -> None: - self.state_dir = state_dir - self.jobs_dir = self.state_dir / "jobs" - self.logs_dir = self.state_dir / "logs" - self.wrap_dir = self.state_dir / "wrappers" - self.state_dir.mkdir(parents=True, exist_ok=True) - self.jobs_dir.mkdir(parents=True, exist_ok=True) - self.logs_dir.mkdir(parents=True, exist_ok=True) - self.wrap_dir.mkdir(parents=True, exist_ok=True) - - def _job_path(self, run_id: str) -> Path: - return self.jobs_dir / f"{run_id}.json" - - def save_job(self, run_id: str, payload: dict[str, Any]) -> None: - self._job_path(run_id).write_text(json.dumps(payload, indent=2), encoding="utf-8") - - def load_job(self, run_id: str) -> dict[str, Any] | None: - path = self._job_path(run_id) - if not path.exists(): - return None - return json.loads(path.read_text(encoding="utf-8")) - - -def _load_config() -> ServerConfig: - repo_root = Path(os.getenv("ANSIBLE_MCP_REPO_ROOT", "/home/chester/homelab/ansible")).resolve() - - inventory_env = os.getenv("ANSIBLE_MCP_INVENTORY", "inventory/hosts.ini") - inventory_file = (repo_root / inventory_env).resolve() - - allowed_raw = os.getenv("ANSIBLE_MCP_ALLOWED_PLAYBOOK_DIRS", "playbooks") - allowed_dirs: list[Path] = [] - for item in [p.strip() for p in allowed_raw.split(",") if p.strip()]: - allowed_dirs.append((repo_root / item).resolve()) - - allowlisted_playbooks_raw = os.getenv("ANSIBLE_MCP_ALLOWED_PLAYBOOKS", "") - allowed_playbooks = tuple( - p.strip() for p in allowlisted_playbooks_raw.split(",") if p.strip() - ) - - api_token_raw = os.getenv("ANSIBLE_MCP_API_TOKEN", "").strip() - api_token = api_token_raw if api_token_raw else None - - allow_write = os.getenv("ANSIBLE_MCP_ALLOW_WRITE", "false").lower() == "true" - require_confirm_for_write = os.getenv("ANSIBLE_MCP_REQUIRE_CONFIRM", "true").lower() == "true" - - default_timeout_seconds = int(os.getenv("ANSIBLE_MCP_DEFAULT_TIMEOUT", "900")) - max_timeout_seconds = int(os.getenv("ANSIBLE_MCP_MAX_TIMEOUT", "3600")) - max_extra_vars_bytes = int(os.getenv("ANSIBLE_MCP_MAX_EXTRA_VARS_BYTES", "16384")) - - blocked_extra_vars_raw = os.getenv("ANSIBLE_MCP_BLOCKED_EXTRA_VARS_KEYS", "") - blocked_extra_vars_keys = tuple( - p.strip().lower() for p in blocked_extra_vars_raw.split(",") if p.strip() - ) - - state_dir = Path(os.getenv("ANSIBLE_MCP_STATE_DIR", "/var/lib/ansible-mcp")).resolve() - audit_log_file = ( - Path(os.getenv("ANSIBLE_MCP_AUDIT_LOG_FILE", "") or state_dir / "audit" / "events.jsonl") - .resolve() - ) - - return ServerConfig( - repo_root=repo_root, - inventory_file=inventory_file, - allowed_dirs=tuple(allowed_dirs), - allowed_playbooks=allowed_playbooks, - api_token=api_token, - allow_write=allow_write, - require_confirm_for_write=require_confirm_for_write, - default_timeout_seconds=default_timeout_seconds, - max_timeout_seconds=max_timeout_seconds, - max_extra_vars_bytes=max_extra_vars_bytes, - blocked_extra_vars_keys=blocked_extra_vars_keys, - state_dir=state_dir, - audit_log_file=audit_log_file, - ) - - -def _is_relative_to(candidate: Path, parent: Path) -> bool: - try: - candidate.relative_to(parent) - return True - except ValueError: - return False - - -def _resolve_allowed_playbook(config: ServerConfig, playbook: str) -> Path: - candidate = (config.repo_root / playbook).resolve() - if not candidate.exists(): - raise ValueError(f"Playbook does not exist: {playbook}") - if not candidate.is_file(): - raise ValueError(f"Playbook path is not a file: {playbook}") - - relative_playbook = str(candidate.relative_to(config.repo_root).as_posix()) - - if config.allowed_playbooks: - if relative_playbook not in config.allowed_playbooks: - allow_text = ", ".join(config.allowed_playbooks) - raise ValueError( - f"Playbook is not in explicit allowlist: {relative_playbook}. " - f"Allowed playbooks: {allow_text}" - ) - return candidate - - if not any(_is_relative_to(candidate, allowed) for allowed in config.allowed_dirs): - allowed_text = ", ".join(str(p) for p in config.allowed_dirs) - raise ValueError( - f"Playbook path is outside allowed directories: {playbook}. " - f"Allowed roots: {allowed_text}" - ) - - return candidate - - -def _sanitize_timeout(config: ServerConfig, timeout_seconds: int | None) -> int: - value = timeout_seconds if timeout_seconds is not None else config.default_timeout_seconds - if value <= 0: - raise ValueError("timeout_seconds must be greater than 0") - if value > config.max_timeout_seconds: - raise ValueError( - f"timeout_seconds exceeds maximum allowed ({config.max_timeout_seconds})" - ) - return value - - -def _redact_payload(value: Any) -> Any: - if isinstance(value, dict): - redacted: dict[str, Any] = {} - for key, item in value.items(): - key_l = str(key).lower() - if any(marker in key_l for marker in ("token", "secret", "password", "key")): - redacted[key] = "[REDACTED]" - else: - redacted[key] = _redact_payload(item) - return redacted - if isinstance(value, list): - return [_redact_payload(v) for v in value] - return value - - -def _audit_event(event: str, payload: dict[str, Any]) -> None: - record = { - "timestamp": _utc_now(), - "event": event, - "payload": _redact_payload(payload), - } - CONFIG.audit_log_file.parent.mkdir(parents=True, exist_ok=True) - with CONFIG.audit_log_file.open("a", encoding="utf-8") as fh: - fh.write(json.dumps(record) + "\n") - - -def _require_auth(auth_token: str | None) -> None: - if not CONFIG.api_token: - return - provided = (auth_token or "").strip() - if not provided: - raise ValueError("Authentication required: provide auth_token") - if not hmac.compare_digest(provided, CONFIG.api_token): - raise ValueError("Authentication failed: invalid auth_token") - - -def _collect_keys(node: Any, sink: set[str]) -> None: - if isinstance(node, dict): - for key, value in node.items(): - sink.add(str(key).lower()) - _collect_keys(value, sink) - elif isinstance(node, list): - for value in node: - _collect_keys(value, sink) - - -def _validate_extra_vars(extra_vars: dict[str, Any] | None) -> None: - if not extra_vars: - return - - encoded = json.dumps(extra_vars) - if len(encoded.encode("utf-8")) > CONFIG.max_extra_vars_bytes: - raise ValueError( - f"extra_vars payload exceeds max size ({CONFIG.max_extra_vars_bytes} bytes)" - ) - - if CONFIG.blocked_extra_vars_keys: - keys: set[str] = set() - _collect_keys(extra_vars, keys) - blocked = sorted(k for k in keys if k in CONFIG.blocked_extra_vars_keys) - if blocked: - blocked_text = ", ".join(blocked) - raise ValueError(f"extra_vars contains blocked keys: {blocked_text}") - - -def _build_command( - config: ServerConfig, - playbook_path: Path, - limit: str | None, - tags: str | None, - skip_tags: str | None, - check_mode: bool, - extra_vars_file: Path | None, -) -> list[str]: - cmd = [ - "ansible-playbook", - "-i", - str(config.inventory_file), - str(playbook_path), - ] - - if limit: - cmd.extend(["--limit", limit]) - if tags: - cmd.extend(["--tags", tags]) - if skip_tags: - cmd.extend(["--skip-tags", skip_tags]) - if check_mode: - cmd.append("--check") - if extra_vars_file is not None: - cmd.extend(["--extra-vars", f"@{extra_vars_file}"]) - - return cmd - - -CONFIG = _load_config() -STORE = JobStore(CONFIG.state_dir) -mcp = FastMCP( - "ansible-mcp", - host=os.getenv("ANSIBLE_MCP_HOST", "127.0.0.1"), - port=int(os.getenv("ANSIBLE_MCP_PORT", "8449")), - streamable_http_path="/mcp", -) - - -@mcp.tool() -def health() -> dict[str, Any]: - """Return server health and effective runtime configuration.""" - return { - "ok": True, - "server": "ansible-mcp", - "timestamp": _utc_now(), - "repo_root": str(CONFIG.repo_root), - "inventory_file": str(CONFIG.inventory_file), - "allowed_dirs": [str(p) for p in CONFIG.allowed_dirs], - "allowed_playbooks": list(CONFIG.allowed_playbooks), - "allow_write": CONFIG.allow_write, - "require_confirm_for_write": CONFIG.require_confirm_for_write, - "auth_enabled": CONFIG.api_token is not None, - "max_extra_vars_bytes": CONFIG.max_extra_vars_bytes, - "blocked_extra_vars_keys": list(CONFIG.blocked_extra_vars_keys), - "state_dir": str(CONFIG.state_dir), - } - - -@mcp.tool() -def list_inventory(limit: str | None = None, auth_token: str | None = None) -> dict[str, Any]: - """Return inventory graph information from ansible-inventory --list.""" - _require_auth(auth_token) - cmd = ["ansible-inventory", "-i", str(CONFIG.inventory_file), "--list"] - - if limit: - cmd.extend(["--limit", limit]) - - result = subprocess.run( - cmd, - cwd=CONFIG.repo_root, - capture_output=True, - text=True, - timeout=60, - check=False, - ) - - payload: dict[str, Any] = { - "ok": result.returncode == 0, - "returncode": result.returncode, - "stderr": result.stderr, - "command": " ".join(shlex.quote(c) for c in cmd), - } - - if result.returncode == 0: - try: - payload["inventory"] = json.loads(result.stdout) - except json.JSONDecodeError: - payload["ok"] = False - payload["error"] = "ansible-inventory returned non-JSON output" - payload["raw_stdout"] = result.stdout - else: - payload["stdout"] = result.stdout - - _audit_event( - "list_inventory", - {"limit": limit, "returncode": result.returncode, "ok": payload["ok"]}, - ) - return payload - - -@mcp.tool() -def validate_syntax( - playbook: str, - limit: str | None = None, - extra_vars: dict[str, Any] | None = None, - auth_token: str | None = None, -) -> dict[str, Any]: - """Run ansible-playbook --syntax-check on an allowlisted playbook.""" - _require_auth(auth_token) - playbook_path = _resolve_allowed_playbook(CONFIG, playbook) - _validate_extra_vars(extra_vars) - - extra_vars_file: Path | None = None - with tempfile.NamedTemporaryFile(mode="w", suffix=".json", delete=False) as tf: - if extra_vars: - json.dump(extra_vars, tf) - tf.flush() - extra_vars_file = Path(tf.name) - - try: - cmd = _build_command( - config=CONFIG, - playbook_path=playbook_path, - limit=limit, - tags=None, - skip_tags=None, - check_mode=False, - extra_vars_file=extra_vars_file, - ) - cmd.append("--syntax-check") - - result = subprocess.run( - cmd, - cwd=CONFIG.repo_root, - capture_output=True, - text=True, - timeout=120, - check=False, - ) - - payload = { - "ok": result.returncode == 0, - "returncode": result.returncode, - "stdout": result.stdout, - "stderr": result.stderr, - "command": " ".join(shlex.quote(c) for c in cmd), - "playbook": str(playbook_path.relative_to(CONFIG.repo_root)), - } - _audit_event( - "validate_syntax", - { - "playbook": payload["playbook"], - "limit": limit, - "returncode": payload["returncode"], - "ok": payload["ok"], - }, - ) - return payload - finally: - if extra_vars_file and extra_vars_file.exists(): - extra_vars_file.unlink(missing_ok=True) - - -@mcp.tool() -def run_playbook( - playbook: str, - limit: str | None = None, - extra_vars: dict[str, Any] | None = None, - tags: str | None = None, - skip_tags: str | None = None, - check_mode: bool = True, - confirm: bool = False, - timeout_seconds: int | None = None, - background: bool = False, - auth_token: str | None = None, -) -> dict[str, Any]: - """Run an allowlisted playbook with guardrails and run tracking. - - Safety model: - - check_mode defaults to true. - - write operations require allow_write=true and confirm=true. - """ - _require_auth(auth_token) - playbook_path = _resolve_allowed_playbook(CONFIG, playbook) - _validate_extra_vars(extra_vars) - safe_timeout = _sanitize_timeout(CONFIG, timeout_seconds) - - is_write = not check_mode - if is_write and not CONFIG.allow_write: - payload = { - "ok": False, - "error": "Write operations are disabled (ANSIBLE_MCP_ALLOW_WRITE=false)", - "hint": "Set check_mode=true or enable ANSIBLE_MCP_ALLOW_WRITE", - } - _audit_event( - "run_playbook_denied", - { - "playbook": str(playbook_path.relative_to(CONFIG.repo_root)), - "reason": "write_disabled", - "check_mode": check_mode, - }, - ) - return payload - - if is_write and CONFIG.require_confirm_for_write and not confirm: - payload = { - "ok": False, - "error": "Write operation requires explicit confirm=true", - "hint": "Retry with confirm=true after review", - } - _audit_event( - "run_playbook_denied", - { - "playbook": str(playbook_path.relative_to(CONFIG.repo_root)), - "reason": "missing_confirm", - "check_mode": check_mode, - }, - ) - return payload - - run_id = str(uuid.uuid4()) - started_at = _utc_now() - - extra_vars_file: Path | None = None - with tempfile.NamedTemporaryFile(mode="w", suffix=".json", delete=False) as tf: - if extra_vars: - json.dump(extra_vars, tf) - tf.flush() - extra_vars_file = Path(tf.name) - - cmd = _build_command( - config=CONFIG, - playbook_path=playbook_path, - limit=limit, - tags=tags, - skip_tags=skip_tags, - check_mode=check_mode, - extra_vars_file=extra_vars_file, - ) - - base_job = { - "run_id": run_id, - "playbook": str(playbook_path.relative_to(CONFIG.repo_root)), - "check_mode": check_mode, - "confirm": confirm, - "limit": limit, - "tags": tags, - "skip_tags": skip_tags, - "timeout_seconds": safe_timeout, - "command": " ".join(shlex.quote(c) for c in cmd), - "started_at": started_at, - } - - try: - if background: - log_file = STORE.logs_dir / f"{run_id}.log" - done_file = STORE.jobs_dir / f"{run_id}.done.json" - wrapper = STORE.wrap_dir / f"{run_id}.sh" - - script = "\n".join( - [ - "#!/usr/bin/env bash", - "set -o pipefail", - f"cd {shlex.quote(str(CONFIG.repo_root))}", - "{ " + " ".join(shlex.quote(c) for c in cmd) + "; }", - "rc=$?", - "python3 - <<'PY'", - "import json", - "from datetime import datetime, timezone", - f"done_file = {str(done_file)!r}", - "payload = {", - " 'completed_at': datetime.now(timezone.utc).isoformat(),", - " 'returncode': rc,", - "}", - "with open(done_file, 'w', encoding='utf-8') as f:", - " json.dump(payload, f)", - "PY", - "exit $rc", - ] - ) - wrapper.write_text(script, encoding="utf-8") - wrapper.chmod(0o750) - - log_handle = log_file.open("w", encoding="utf-8") - proc = subprocess.Popen( - [str(wrapper)], - cwd=CONFIG.repo_root, - stdout=log_handle, - stderr=subprocess.STDOUT, - start_new_session=True, - ) - log_handle.close() - - payload = { - **base_job, - "background": True, - "status": "running", - "pid": proc.pid, - "log_file": str(log_file), - "done_file": str(done_file), - } - STORE.save_job(run_id, payload) - _audit_event( - "run_playbook_background_started", - { - "run_id": run_id, - "playbook": payload["playbook"], - "check_mode": check_mode, - "pid": proc.pid, - }, - ) - - return { - "ok": True, - "run_id": run_id, - "status": "running", - "pid": proc.pid, - "log_file": str(log_file), - "message": "Playbook started in background", - } - - result = subprocess.run( - cmd, - cwd=CONFIG.repo_root, - capture_output=True, - text=True, - timeout=safe_timeout, - check=False, - ) - - completed_payload = { - **base_job, - "background": False, - "status": "succeeded" if result.returncode == 0 else "failed", - "returncode": result.returncode, - "completed_at": _utc_now(), - "stdout": result.stdout, - "stderr": result.stderr, - } - STORE.save_job(run_id, completed_payload) - _audit_event( - "run_playbook_completed", - { - "run_id": run_id, - "playbook": completed_payload["playbook"], - "status": completed_payload["status"], - "returncode": result.returncode, - }, - ) - - return { - "ok": result.returncode == 0, - "run_id": run_id, - "status": completed_payload["status"], - "returncode": result.returncode, - "stdout": result.stdout, - "stderr": result.stderr, - "playbook": completed_payload["playbook"], - "command": completed_payload["command"], - } - except subprocess.TimeoutExpired as err: - timed_out_payload = { - **base_job, - "background": False, - "status": "timed_out", - "completed_at": _utc_now(), - "stdout": err.stdout, - "stderr": err.stderr, - } - STORE.save_job(run_id, timed_out_payload) - _audit_event( - "run_playbook_timed_out", - { - "run_id": run_id, - "playbook": timed_out_payload["playbook"], - "timeout_seconds": safe_timeout, - }, - ) - - return { - "ok": False, - "run_id": run_id, - "status": "timed_out", - "timeout_seconds": safe_timeout, - "stdout": err.stdout, - "stderr": err.stderr, - "message": "Playbook exceeded timeout", - } - finally: - if extra_vars_file and extra_vars_file.exists(): - extra_vars_file.unlink(missing_ok=True) - - -@mcp.tool() -def get_job_status( - run_id: str, - tail_lines: int = 80, - auth_token: str | None = None, -) -> dict[str, Any]: - """Get status and recent logs for a tracked run_id.""" - _require_auth(auth_token) - if tail_lines <= 0: - raise ValueError("tail_lines must be greater than 0") - - job = STORE.load_job(run_id) - if not job: - return {"ok": False, "error": f"Unknown run_id: {run_id}"} - - if job.get("background"): - done_file = Path(job["done_file"]) - if done_file.exists(): - done_payload = json.loads(done_file.read_text(encoding="utf-8")) - job["status"] = "succeeded" if done_payload["returncode"] == 0 else "failed" - job["returncode"] = done_payload["returncode"] - job["completed_at"] = done_payload["completed_at"] - STORE.save_job(run_id, job) - else: - pid = int(job.get("pid", 0)) - if pid > 0: - try: - os.kill(pid, 0) - job["status"] = "running" - except OSError: - job["status"] = "unknown" - else: - job["status"] = "unknown" - - response = {"ok": True, **job} - - log_file = job.get("log_file") - if log_file and Path(log_file).exists(): - lines = Path(log_file).read_text(encoding="utf-8", errors="replace").splitlines() - response["log_tail"] = lines[-tail_lines:] - - _audit_event( - "get_job_status", - {"run_id": run_id, "status": response.get("status")}, - ) - return response - - -@mcp.tool() -def cancel_job(run_id: str, auth_token: str | None = None) -> dict[str, Any]: - """Cancel a running background job.""" - _require_auth(auth_token) - job = STORE.load_job(run_id) - if not job: - return {"ok": False, "error": f"Unknown run_id: {run_id}"} - - if not job.get("background"): - return {"ok": False, "error": "cancel_job is only valid for background jobs"} - - if job.get("status") not in {"running", "unknown"}: - return {"ok": False, "error": f"Job is not running (status={job.get('status')})"} - - pid = int(job.get("pid", 0)) - if pid <= 0: - return {"ok": False, "error": "Job PID is invalid"} - - try: - os.killpg(pid, signal.SIGTERM) - except ProcessLookupError: - return {"ok": False, "error": "Process does not exist"} - except PermissionError as err: - return {"ok": False, "error": f"Permission denied terminating process group: {err}"} - - job["status"] = "cancelled" - job["completed_at"] = _utc_now() - STORE.save_job(run_id, job) - - payload = {"ok": True, "run_id": run_id, "status": "cancelled"} - _audit_event("cancel_job", payload) - return payload - - -def main() -> None: - parser = argparse.ArgumentParser(description="Run the Ansible MCP server") - parser.add_argument( - "--transport", - choices=["stdio", "streamable-http"], - default=os.getenv("ANSIBLE_MCP_TRANSPORT", "stdio"), - help="MCP transport to use", - ) - parser.add_argument( - "--host", - default=os.getenv("ANSIBLE_MCP_HOST", "0.0.0.0"), - help="Host for streamable-http transport", - ) - parser.add_argument( - "--port", - type=int, - default=int(os.getenv("ANSIBLE_MCP_PORT", "8449")), - help="Port for streamable-http transport", - ) - args = parser.parse_args() - - # FastMCP transport settings are configured on the server object in this SDK version. - mcp.settings.host = args.host - mcp.settings.port = args.port - - if args.transport == "stdio": - mcp.run(transport="stdio") - else: - mcp.run(transport="streamable-http") - - -if __name__ == "__main__": - main() diff --git a/ansible/archive/scripts/audit_prune_gitea_runners.py b/ansible/archive/scripts/audit_prune_gitea_runners.py deleted file mode 100644 index 59df959..0000000 --- a/ansible/archive/scripts/audit_prune_gitea_runners.py +++ /dev/null @@ -1,187 +0,0 @@ -#!/usr/bin/env python3 -""" -Audit and optionally prune Gitea Actions runners. - -This script is intentionally conservative: -- It can parse a simple text/CSV listing of runners (id,status,name,last_online) or - accept manual runner entries. -- It will by default run in dry-run mode and print the curl commands needed to - delete selected runners. To actually perform deletions supply --prune and - provide environment variables `GITEA_URL` and `GITEA_TOKEN`. - -USAGE (dry-run): - python3 audit_prune_gitea_runners.py --input runners.txt --threshold-hours 24 - -USAGE (execute): - GITEA_URL="https://gitea.example.com" GITEA_TOKEN="" \ - python3 audit_prune_gitea_runners.py --input runners.txt --threshold-hours 24 --prune - -Notes: -- Verify the generated DELETE endpoint before running. Gitea Action Runners API paths - vary by version; the script assumes a generic endpoint: /api/v1/actions/runners/{id} - If your Gitea uses a different path, run in dry-run and adjust commands accordingly. -- Token must have admin privileges to remove runners. -""" - -import argparse -import csv -import datetime -import os -import shlex -import subprocess -from typing import List, Dict - - -def parse_input(path: str) -> List[Dict]: - """Parse a simple CSV or whitespace-delimited file containing runner info. - - Expected columns (header optional): id,name,status,last_online - last_online should be ISO or human-friendly; we attempt flexible parse by - treating empty as unknown (safe: don't delete). - """ - rows = [] - with open(path, 'r') as f: - sample = f.read() - # Try CSV parse first - try: - with open(path, newline='') as csvfile: - reader = csv.DictReader(csvfile) - # If headerless, DictReader will treat first line as header; fallback - for r in reader: - rows.append({k.strip(): v.strip() for k, v in r.items()}) - if rows: - return rows - except Exception: - pass - - # Fallback: parse whitespace lines like "Status\tID\tName\tLastOnline" or - # entries separated by multiple spaces / tabs. We'll extract numeric id and status. - for line in sample.splitlines(): - line = line.strip() - if not line: - continue - parts = [p for p in line.split() if p] - # try to find numeric token for id - id_token = None - for p in parts: - if p.isdigit(): - id_token = p - break - status = 'unknown' - name = '' - last_online = '' - if 'Idle' in line or 'idle' in line: - status = 'idle' - if 'Offline' in line or 'offline' in line: - status = 'offline' - # best effort name - if len(parts) >= 4: - name = parts[-2] - last_online = parts[-1] - rows.append({'id': id_token or '', 'name': name, 'status': status, 'last_online': last_online}) - return rows - - -def candidates(rows: List[Dict], threshold_hours: int) -> List[Dict]: - out = [] - now = datetime.datetime.utcnow() - for r in rows: - status = r.get('status', '').lower() - if status != 'offline': - continue - lo = r.get('last_online') or r.get('Last Online Time') or '' - # Try parse common formats, fallback to include (safe) - parsed = None - for fmt in ('%Y-%m-%dT%H:%M:%S', '%Y-%m-%d %H:%M:%S', '%Y-%m-%d', '%b %d', '%Y-%m-%dT%H:%M:%SZ'): - try: - parsed = datetime.datetime.strptime(lo, fmt) - break - except Exception: - parsed = None - if parsed is None: - # If no parse, assume offline candidate (but include in output for manual review) - out.append(r) - continue - if (now - parsed).total_seconds() >= threshold_hours * 3600: - out.append(r) - return out - - -def gen_delete_command(gitea_url: str, token: str, runner_id: str) -> str: - # Default assumed endpoint β€” verify for your Gitea version. - endpoint = f"{gitea_url.rstrip('/')}/api/v1/actions/runners/{runner_id}" - cmd = f"curl -sS -X DELETE -H 'Authorization: token {token}' '{endpoint}'" - return cmd - - -def main(): - p = argparse.ArgumentParser() - p.add_argument('--input', required=True, help='Path to runner list (CSV or text)') - p.add_argument('--threshold-hours', type=int, default=24, help='Consider offline > N hours') - p.add_argument('--prune', action='store_true', help='If set, execute deletions (requires GITEA_URL and GITEA_TOKEN)') - p.add_argument('--confirm-phrase', help='Required exact confirmation phrase to actually prune') - args = p.parse_args() - - rows = parse_input(args.input) - if not rows: - print('No runner rows parsed from', args.input) - return 1 - - print(f'Parsed {len(rows)} runner entries. Threshold: {args.threshold_hours}h') - cand = candidates(rows, args.threshold_hours) - if not cand: - print('No candidates found for pruning.') - return 0 - - print('\nCandidates for removal:') - for r in cand: - print('-', r) - - if not args.prune: - print('\nDRY RUN: To execute deletions, re-run with --prune and set GITEA_URL and GITEA_TOKEN environment variables.') - print('Example delete commands (verify endpoint before running):') - gitea = os.environ.get('GITEA_URL', 'https://gitea.example.com') - token = os.environ.get('GITEA_TOKEN', '') - for r in cand: - rid = r.get('id') or r.get('ID') - if not rid: - print('# missing id for entry:', r) - continue - print(gen_delete_command(gitea, token, rid)) - return 0 - - # prune requested - phrase = args.confirm_phrase or '' - expected = f'CONFIRM PRUNE RUNNERS: {datetime.date.today().isoformat()}' - if phrase != expected: - print('Refusing to prune. To prune, pass --confirm-phrase with exact phrase:') - print(expected) - return 2 - - gitea = os.environ.get('GITEA_URL') - token = os.environ.get('GITEA_TOKEN') - if not gitea or not token: - print('GITEA_URL and GITEA_TOKEN environment variables required to perform deletions.') - return 3 - - # execute deletions - for r in cand: - rid = r.get('id') or r.get('ID') - if not rid: - print('Skipping entry missing id:', r) - continue - cmd = gen_delete_command(gitea, token, rid) - print('Executing:', cmd) - try: - res = subprocess.run(cmd, shell=True, check=False, capture_output=True, text=True) - print('Return code:', res.returncode) - print('stdout:', res.stdout) - print('stderr:', res.stderr) - except Exception as e: - print('Error executing command for', rid, e) - - return 0 - - -if __name__ == '__main__': - raise SystemExit(main()) diff --git a/ansible/archive/scripts/day0bootstrap.sh b/ansible/archive/scripts/day0bootstrap.sh deleted file mode 100644 index f638294..0000000 --- a/ansible/archive/scripts/day0bootstrap.sh +++ /dev/null @@ -1,438 +0,0 @@ -#!/bin/bash - -# ============================================================================== -# LEAD ARCHITECT β€” DAY 0 BOOTSTRAP -# Pre-Ansible Environment Setup Script (Debian/Ubuntu) -# Purpose: Prepare a fresh host for Ansible management -# ============================================================================== - -set -euo pipefail # Exit on error, undefined vars, pipe failures - -# ============================================================================== -# CONFIGURATION -# ============================================================================== -PROXMOX_IP="${1:-}" # pass as arg1 or prompt interactively -PROXMOX_USER="${2:-root}" # <--- CHANGE ME or pass as arg2 -PROXMOX_HOSTNAME="${3:-}" # <--- OPTIONAL: pass as arg3 -PROXMOX_PORT="${4:-22}" # <--- OPTIONAL: SSH port - -# Always resolve paths from script location for deterministic behavior. -SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)" -ANSIBLE_ROOT="$(cd -- "$SCRIPT_DIR/.." && pwd)" - -SSH_KEY_PATH="$HOME/.ssh/id_ed25519" -KNOWN_HOSTS="$HOME/.ssh/known_hosts" -INVENTORY_FILE="${INVENTORY_PATH:-$ANSIBLE_ROOT/inventory/hosts.ini}" -ACTIVE_INVENTORY="$INVENTORY_FILE" -TARGET_GROUP="proxmox_nodes" -INTEGRATION_REQUIRED="false" -INVENTORY_HAS_TARGET="false" -EFFECTIVE_VERIFY_USER="$PROXMOX_USER" -EXISTING_ALIAS="" - -# Prompt for IP when not provided (interactive shells only). -if [ -z "$PROXMOX_IP" ]; then - if [ -t 0 ]; then - read -r -p "Enter Proxmox host IP address: " PROXMOX_IP - else - log_error "PROXMOX_IP is required in non-interactive mode. Usage: ./day0bootstrap.sh [user] [hostname] [port]" - exit 1 - fi -fi - -# Validate IPv4 format to fail early on typos. -if ! [[ "$PROXMOX_IP" =~ ^([0-9]{1,3}\.){3}[0-9]{1,3}$ ]]; then - log_error "Invalid IPv4 address format: $PROXMOX_IP" - exit 1 -fi - -IFS='.' read -r -a _ip_octets <<< "$PROXMOX_IP" -for _octet in "${_ip_octets[@]}"; do - if [ "$_octet" -lt 0 ] || [ "$_octet" -gt 255 ]; then - log_error "Invalid IPv4 octet in address: $PROXMOX_IP" - exit 1 - fi -done - -# Logging functions -log_step() { - echo "" - echo "[βš™ STEP] $1" -} - -log_success() { - echo "[βœ“ OK] $1" -} - -log_error() { - echo "[βœ— ERROR] $1" >&2 -} - -log_warning() { - echo "[⚠ WARNING] $1" -} - -# Error trap -trap 'log_error "Bootstrap failed at line $LINENO"; exit 1' ERR - -# ============================================================================== -# SECTION 1: PRE-FLIGHT VALIDATION -# ============================================================================== - -log_step "PRE-FLIGHT VALIDATION" - -# 1.1 Verify target is reachable -if ! ping -c 1 -W 2 "$PROXMOX_IP" &>/dev/null; then - log_error "Cannot reach $PROXMOX_IP. Check network and IP address." - exit 1 -fi -log_success "Host $PROXMOX_IP is reachable" - -# 1.2 Check if we can resolve target hostname -if [ -z "$PROXMOX_HOSTNAME" ]; then - if PROXMOX_HOSTNAME=$(getent hosts "$PROXMOX_IP" | awk '{print $2}'); then - log_success "Resolved hostname from /etc/hosts: $PROXMOX_HOSTNAME" - else - PROXMOX_HOSTNAME="proxmox-${PROXMOX_IP##*.}" - log_warning "No hostname resolved; using fallback alias: $PROXMOX_HOSTNAME" - fi -else - log_success "Using provided hostname: $PROXMOX_HOSTNAME" -fi - -# 1.3 Detect local IP (for context) -LOCAL_IP=$(hostname -I | awk '{print $1}') -log_success "Local IP detected: $LOCAL_IP" - -# ============================================================================== -# SECTION 2: SSH KEY MANAGEMENT -# ============================================================================== - -log_step "SSH KEY MANAGEMENT" - -# 2.1 Create .ssh directory if needed -mkdir -p "$HOME/.ssh" -chmod 700 "$HOME/.ssh" -log_success "SSH directory ready: $HOME/.ssh" - -# 2.2 Check for existing keys (ED25519 > RSA preference) -if [ -f "$SSH_KEY_PATH" ]; then - log_success "Found existing ED25519 key at $SSH_KEY_PATH" -elif [ -f "$HOME/.ssh/id_rsa" ]; then - SSH_KEY_PATH="$HOME/.ssh/id_rsa" - log_success "Found existing RSA key; using as fallback: $SSH_KEY_PATH" -else - log_warning "No SSH key found. Generating new ED25519 keypair..." - ssh-keygen -t ed25519 -f "$SSH_KEY_PATH" -N "" -C "ansible@$(hostname)" - chmod 600 "$SSH_KEY_PATH" - chmod 644 "${SSH_KEY_PATH}.pub" - log_success "Generated new key: $SSH_KEY_PATH" -fi - -# ============================================================================== -# SECTION 3: SSH TRUST (Option A: ssh-keyscan) -# ============================================================================== - -log_step "SSH TRUST ESTABLISHMENT (ssh-keyscan)" - -# 3.1 Create known_hosts if missing -if [ ! -f "$KNOWN_HOSTS" ]; then - touch "$KNOWN_HOSTS" - chmod 600 "$KNOWN_HOSTS" - log_success "Created new $KNOWN_HOSTS" -fi - -# 3.2 Remove old host key for this IP (if exists) to avoid conflicts -if grep -q "^$PROXMOX_IP " "$KNOWN_HOSTS" 2>/dev/null; then - log_warning "Removing outdated host key for $PROXMOX_IP from known_hosts..." - ssh-keygen -f "$KNOWN_HOSTS" -R "$PROXMOX_IP" >/dev/null 2>&1 || true -fi - -# 3.3 Scan and add new host key -log_warning "Scanning remote host key (this may take a few seconds)..." -if ssh-keyscan -p "$PROXMOX_PORT" -H "$PROXMOX_IP" >> "$KNOWN_HOSTS" 2>/dev/null; then - log_success "Host key added to known_hosts" -else - log_error "Failed to scan host key. Verify target is running SSH." - exit 1 -fi - -# 3.4 Transfer public key via ssh-copy-id -log_warning "Transferring SSH public key to $PROXMOX_USER@$PROXMOX_IP..." -if ssh-copy-id -i "${SSH_KEY_PATH}.pub" \ - -o StrictHostKeyChecking=accept-new \ - -o ConnectTimeout=5 \ - -p "$PROXMOX_PORT" \ - "${PROXMOX_USER}@${PROXMOX_IP}" 2>&1; then - log_success "Public key installed on remote host" -else - log_error "Failed to copy public key. Verify SSH credentials and connectivity." - exit 1 -fi - -# ============================================================================== -# SECTION 4: PACKAGE INSTALLATION -# ============================================================================== - -log_step "PACKAGE INSTALLATION (Debian/Ubuntu)" - -# 4.1 Update package lists -log_warning "Updating package lists..." -sudo apt-get update -qq -log_success "Package lists updated" - -# 4.2 Install Ansible (skip if already installed) -if command -v ansible &>/dev/null; then - ANSIBLE_VERSION=$(ansible --version | head -n1) - log_success "Ansible already installed: $ANSIBLE_VERSION" -else - log_warning "Installing Ansible..." - sudo apt-get install -y -qq ansible - log_success "Ansible installed" -fi - -# 4.3 Install Python3-pip (skip if present) -if command -v pip3 &>/dev/null; then - log_success "python3-pip already installed" -else - log_warning "Installing python3-pip..." - sudo apt-get install -y -qq python3-pip - log_success "python3-pip installed" -fi - -# 4.4 Install additional tools (git, curl, jq) -TOOLS=("git" "curl" "jq") -for tool in "${TOOLS[@]}"; do - if command -v "$tool" &>/dev/null; then - log_success "$tool already installed" - else - log_warning "Installing $tool..." - sudo apt-get install -y -qq "$tool" - log_success "$tool installed" - fi -done - -# 4.5 Install Proxmoxer (Python library) -log_warning "Installing Proxmoxer (Python library)..." -if pip3 install --quiet proxmoxer 2>/dev/null; then - log_success "Proxmoxer installed (user-level)" -else - log_warning "Proxmoxer install had warnings (may already exist)" -fi - -# ============================================================================== -# SECTION 5: NTP TIME SYNCHRONIZATION -# ============================================================================== - -log_step "NTP TIME SYNCHRONIZATION" - -# 5.1 Check timedatectl status -if command -v timedatectl &>/dev/null; then - if timedatectl status | grep -q "synchronized: yes"; then - log_success "NTP is synchronized" - else - log_warning "NTP not synchronized. Attempting sync..." - sudo timedatectl set-ntp true 2>/dev/null || true - sleep 2 - if timedatectl status | grep -q "synchronized: yes"; then - log_success "NTP synchronized" - else - log_warning "NTP sync pending; SSH key negotiation may fail if time drift is excessive" - fi - fi -else - log_warning "timedatectl not available; skipping NTP check" -fi - -# ============================================================================== -# SECTION 6: INVENTORY GENERATION -# ============================================================================== - -log_step "INVENTORY GENERATION" - -# 6.1 Use one canonical inventory path for consistent behavior -EXISTING_INVENTORY="" -if [ -f "$INVENTORY_FILE" ]; then - EXISTING_INVENTORY="$INVENTORY_FILE" - log_success "Found canonical inventory: $EXISTING_INVENTORY" -else - log_warning "Canonical inventory not found; will create: $INVENTORY_FILE" -fi - -# 6.2 Handle existing inventory -if [ -n "$EXISTING_INVENTORY" ]; then - ACTIVE_INVENTORY="$EXISTING_INVENTORY" - INTEGRATION_REQUIRED="true" - log_warning "Existing inventory detected at: $EXISTING_INVENTORY" - echo "" - echo "============================================" - echo "⚠ MANUAL CONFIGURATION REQUIRED" - echo "============================================" - echo "" - echo "To integrate this host into your Ansible inventory," - echo "ADD the following entry to: $EXISTING_INVENTORY" - echo "" - echo "--- SUGGESTED ADDITION ---" - echo "" - - # Detect group context from existing inventory - if grep -q "\[proxmox" "$EXISTING_INVENTORY"; then - GROUP="proxmox_cluster" - TARGET_GROUP="$GROUP" - - # If target IP already exists in inventory, don't suggest duplicate add. - if grep -Eq "^[[:space:]]*[^#[:space:]]+[[:space:]]+ansible_host=${PROXMOX_IP}([[:space:]]|$)" "$EXISTING_INVENTORY"; then - INVENTORY_HAS_TARGET="true" - EXISTING_ALIAS=$(grep -E "^[[:space:]]*[^#[:space:]]+[[:space:]]+ansible_host=${PROXMOX_IP}([[:space:]]|$)" "$EXISTING_INVENTORY" | awk '{print $1}' | head -n1) - log_success "Target IP already exists in inventory as host: ${EXISTING_ALIAS}" - - # Resolve effective SSH user from host line or group vars section. - HOST_LINE=$(grep -E "^[[:space:]]*${EXISTING_ALIAS}[[:space:]]+" "$EXISTING_INVENTORY" | head -n1 || true) - HOST_USER=$(echo "$HOST_LINE" | sed -nE 's/.*ansible_user=([^[:space:]]+).*/\1/p') - GROUP_USER=$(awk -v section="[$GROUP:vars]" ' - $0==section {in_section=1; next} - /^\[/ && in_section {exit} - in_section && $0 ~ /^ansible_user=/ { - gsub(/^ansible_user=/, "", $0) - print $0 - exit - } - ' "$EXISTING_INVENTORY") - - if [ -n "$HOST_USER" ]; then - EFFECTIVE_VERIFY_USER="$HOST_USER" - elif [ -n "$GROUP_USER" ]; then - EFFECTIVE_VERIFY_USER="$GROUP_USER" - fi - - if [ "$EFFECTIVE_VERIFY_USER" != "$PROXMOX_USER" ]; then - log_warning "Inventory user ($EFFECTIVE_VERIFY_USER) differs from bootstrap SSH user ($PROXMOX_USER)" - fi - echo "# No addition required: host already exists in [proxmox_cluster]" - echo "# Existing entry alias: ${EXISTING_ALIAS}" - echo "" - echo "COPY/PASTE BLOCK" - echo "(No inventory update needed for this host)" - else - # Infer naming style: pveNN sequence if present. - if grep -Eq "^[[:space:]]*pve[0-9]+[[:space:]]+ansible_host=" "$EXISTING_INVENTORY"; then - NEXT_NUM=$(grep -E "^[[:space:]]*pve[0-9]+[[:space:]]+ansible_host=" "$EXISTING_INVENTORY" | sed -E 's/^[[:space:]]*pve([0-9]+).*/\1/' | sort -n | tail -n1) - if [ -n "$NEXT_NUM" ]; then - PROXMOX_HOSTNAME="pve$(printf "%02d" $((10#$NEXT_NUM + 1)))" - fi - fi - - echo "# Add to the [proxmox_cluster] section:" - echo "" - echo "COPY/PASTE BLOCK" - echo "-----8<-----" - echo "$PROXMOX_HOSTNAME ansible_host=$PROXMOX_IP ansible_user=$PROXMOX_USER ansible_ssh_private_key_file=$SSH_KEY_PATH ansible_port=$PROXMOX_PORT" - echo "----->8-----" - fi - else - GROUP="proxmox_nodes" - TARGET_GROUP="$GROUP" - echo "# Add a new section for Proxmox nodes:" - echo "" - echo "COPY/PASTE BLOCK" - echo "-----8<-----" - echo "[$GROUP]" - echo "$PROXMOX_HOSTNAME ansible_host=$PROXMOX_IP ansible_user=$PROXMOX_USER ansible_ssh_private_key_file=$SSH_KEY_PATH ansible_port=$PROXMOX_PORT" - echo "" - echo "[$GROUP:vars]" - echo "ansible_python_interpreter=/usr/bin/python3" - echo "----->8-----" - fi - - echo "" - echo "--- END SUGGESTION ---" - echo "" - echo "Current inventory file excerpt:" - echo "-------------------------------" - grep -E "^\[|^[a-zA-Z0-9_-]+ " "$EXISTING_INVENTORY" | head -20 - echo "" - -else - # 6.3 Create new inventory file if none exists - log_warning "No existing inventory found. Creating $INVENTORY_FILE..." - mkdir -p "$(dirname "$INVENTORY_FILE")" - cat > "$INVENTORY_FILE" <&1); then - log_success "Ansible verifies connectivity to target host!" - echo "" - echo "============================================" - echo "βœ“ BOOTSTRAP COMPLETE" - echo "============================================" - echo "" - echo "Next steps:" - echo "1. Review $ACTIVE_INVENTORY" - echo "2. Run your Ansible playbook:" - echo " ansible-playbook -i $ACTIVE_INVENTORY .yml" - echo "" -else - log_warning "Inventory-based ping failed. Collecting diagnostics..." - echo "$ANSIBLE_OUT" - - # If existing inventory already has host, test with bootstrap credentials directly. - if [ "$INVENTORY_HAS_TARGET" = "true" ]; then - log_warning "Trying direct host ping with bootstrap credentials ($PROXMOX_USER)..." - if ansible all -i "${PROXMOX_IP}," -u "$PROXMOX_USER" --private-key "$SSH_KEY_PATH" -m ping >/dev/null 2>&1; then - log_success "Direct SSH ping with bootstrap user works" - echo "" - echo "============================================" - echo "βœ“ BOOTSTRAP COMPLETE (INVENTORY USER MISMATCH)" - echo "============================================" - echo "" - echo "Inventory authentication does not match bootstrap credentials." - echo "Current inventory likely uses: ansible_user=$EFFECTIVE_VERIFY_USER" - echo "Bootstrap validated with: ansible_user=$PROXMOX_USER" - echo "" - echo "Suggested fixes (pick one):" - echo "1. Update host entry '$EXISTING_ALIAS' in $ACTIVE_INVENTORY with ansible_user=$PROXMOX_USER" - echo "2. Add your SSH key for user '$EFFECTIVE_VERIFY_USER' on the target host" - echo "3. Temporarily run playbooks with override: -u $PROXMOX_USER" - echo "" - exit 0 - fi - fi - - log_error "Ansible ping test failed and direct bootstrap-user test also failed" - exit 1 -fi \ No newline at end of file diff --git a/ansible/archive/scripts/generate-quick-command.sh b/ansible/archive/scripts/generate-quick-command.sh deleted file mode 100644 index 2940adb..0000000 --- a/ansible/archive/scripts/generate-quick-command.sh +++ /dev/null @@ -1,14 +0,0 @@ -#!/bin/bash - -# One-liner command generator for Quick mode health check -# Usage: ./generate-quick-command.sh - -SERVICE_NAME="$1" - -if [ -z "$SERVICE_NAME" ]; then - echo "Usage: $0 " - echo "Example: $0 watchtower" - exit 1 -fi - -echo "docker ps -a --filter \"name=$SERVICE_NAME\" --format \"table {{.Names}}\t{{.Status}}\t{{.RunningFor}}\t{{.Ports}}\" && docker stats $SERVICE_NAME --no-stream" \ No newline at end of file diff --git a/ansible/archive/scripts/generate_inventory.py b/ansible/archive/scripts/generate_inventory.py deleted file mode 100644 index 822f7f4..0000000 --- a/ansible/archive/scripts/generate_inventory.py +++ /dev/null @@ -1,235 +0,0 @@ -#!/usr/bin/env python3 -import sys -import yaml -from pathlib import Path - - -class VaultSafeLoader(yaml.SafeLoader): - pass - - -def vault_constructor(loader, node): - return loader.construct_scalar(node) - - -VaultSafeLoader.add_constructor('!vault', vault_constructor) - - -def load_sot(path): - with open(path, 'r') as f: - return yaml.load(f, Loader=VaultSafeLoader) - - -def ip_for(info, prefer_desired=False): - # Default behavior for flat-network operation: - # prefer current_ip and only use desired_ip as fallback. - # For migration windows, set prefer_desired=True. - current_ip = info.get('current_ip') - desired_ip = info.get('desired_ip') - - if prefer_desired: - if isinstance(desired_ip, str) and desired_ip: - return desired_ip - if isinstance(current_ip, str) and current_ip: - return current_ip - return None - - if isinstance(current_ip, str) and current_ip: - return current_ip - if isinstance(desired_ip, str) and desired_ip: - return desired_ip - return None - - -def is_retired(info): - status = str(info.get('lifecycle_status', '')).strip().lower() - return status in {'retired', 'retired-shutdown', 'shutdown'} - - -def is_onboarding_tbd(info): - status = str(info.get('onboarding_status', '')).strip().lower() - return status.startswith('tbd') - - -def is_active(info): - if is_retired(info): - return False - if is_onboarding_tbd(info): - return False - if info.get('ansible_managed') is False: - return False - return True - - -def main(): - import argparse - p = argparse.ArgumentParser() - p.add_argument('--sot', required=True) - p.add_argument('--out', required=True) - p.add_argument( - '--prefer-desired', - action='store_true', - help='Prefer desired_ip over current_ip when generating inventory (use during migrations).', - ) - args = p.parse_args() - - sot = load_sot(args.sot) - # Prefer the non-reserved key; keep fallback for older SoT files. - hosts = sot.get('lab_hosts', sot.get('hosts', {})) or {} - edge_host = (sot.get('edge_routing') or {}).get('edge_host') or {} - edge_host_ip = edge_host.get('ip') - - out_lines = [] - out_lines.append('# Generated inventory from ../group_vars/all.yml') - out_lines.append('') - - # watchtower - out_lines.append('# --- Watchtower (local controller) ---') - out_lines.append('[watchtower]') - out_lines.append('localhost ansible_connection=local') - out_lines.append('') - - # proxmox - out_lines.append('# --- Proxmox Cluster (management) ---') - out_lines.append('[proxmox_cluster]') - for name, info in hosts.items(): - if info.get('role') == 'proxmox' and is_active(info): - ip = ip_for(info, prefer_desired=args.prefer_desired) - if ip: - out_lines.append( - f"{name} ansible_host={ip} ansible_user=root " - "ansible_ssh_private_key_file=/home/chester/.ssh/id_ed25519 ansible_port=22" - ) - out_lines.append('') - out_lines.append('[proxmox_cluster:vars]') - out_lines.append('ansible_user=root') - out_lines.append('ansible_become=true') - out_lines.append('ansible_python_interpreter=/usr/bin/python3') - out_lines.append('') - - # swarm managers - out_lines.append('# --- Swarm Managers ---') - out_lines.append('[swarm_managers]') - for name, info in hosts.items(): - if (info.get('role') == 'swarm_manager' or name.startswith('swarm-manager')) and is_active(info): - ip = ip_for(info, prefer_desired=args.prefer_desired) - if ip: - out_lines.append(f"{name} ansible_host={ip}") - out_lines.append('') - - # swarm workers - out_lines.append('# --- Swarm Workers ---') - out_lines.append('[swarm_workers]') - for name, info in hosts.items(): - if (info.get('role') == 'swarm_worker' or name.startswith('swarm-worker')) and is_active(info): - ip = ip_for(info, prefer_desired=args.prefer_desired) - if ip: - out_lines.append(f"{name} ansible_host={ip}") - out_lines.append('') - - out_lines.append('[swarm_hosts:children]') - out_lines.append('swarm_managers') - out_lines.append('swarm_workers') - out_lines.append('') - out_lines.append('[swarm_hosts:vars]') - out_lines.append('ansible_user=chester') - out_lines.append('ansible_ssh_private_key_file=/home/chester/.ssh/id_ed25519') - out_lines.append('') - - # standalone ubuntu VMs - out_lines.append('# --- Standalone Ubuntu VMs ---') - out_lines.append('[standalone_ubuntu]') - for name, info in hosts.items(): - if info.get('role') in {'standalone_vm', 'standalone_ubuntu'} and is_active(info): - ip = ip_for(info, prefer_desired=args.prefer_desired) - if ip: - out_lines.append(f"{name} ansible_host={ip}") - out_lines.append('') - - out_lines.append('[standalone_ubuntu:vars]') - out_lines.append('ansible_user=chester') - out_lines.append('ansible_ssh_private_key_file=/home/chester/.ssh/id_ed25519') - out_lines.append('') - - # heimdall edge host - out_lines.append('# --- Heimdall (Edge Router / Traefik host) ---') - out_lines.append('[heimdall_hosts]') - heimdall_info = hosts.get('heimdall', {}) - heimdall_ip = ip_for(heimdall_info, prefer_desired=args.prefer_desired) or edge_host_ip - if heimdall_info and is_active(heimdall_info) and heimdall_ip: - out_lines.append(f"heimdall ansible_host={heimdall_ip}") - out_lines.append('') - - out_lines.append('[heimdall_hosts:vars]') - out_lines.append('ansible_user=chester') - out_lines.append('ansible_ssh_private_key_file=/home/chester/.ssh/id_ed25519') - out_lines.append('') - - # ai_grid - out_lines.append('# --- AI Grid ---') - out_lines.append('[ai_grid]') - for name, info in hosts.items(): - if (info.get('role') == 'ai_node' or name.startswith('ai-')) and is_active(info): - ip = ip_for(info, prefer_desired=args.prefer_desired) - if ip: - out_lines.append(f"{name} ansible_host={ip}") - out_lines.append('') - - # docker hosts - out_lines.append('# --- Docker Hosts ---') - out_lines.append('[docker_hosts]') - for name, info in hosts.items(): - if info.get('role') in {'docker_host', 'standalone_vm', 'standalone_ubuntu'} and is_active(info): - ip = ip_for(info, prefer_desired=args.prefer_desired) - if ip: - out_lines.append(f"{name} ansible_host={ip}") - out_lines.append('') - - # storage - out_lines.append('# --- Storage ---') - out_lines.append('[storage]') - for name, info in hosts.items(): - if (info.get('role') == 'nas' or name in ('synology', 'terramaster')) and is_active(info): - ip = ip_for(info, prefer_desired=args.prefer_desired) - if ip: - out_lines.append(f"{name} ansible_host={ip} ansible_scp_if_ssh=True") - out_lines.append('') - - out_lines.append('# --- Lifecycle: Onboarding TBD ---') - out_lines.append('[onboarding_tbd]') - for name, info in hosts.items(): - if is_onboarding_tbd(info): - ip = ip_for(info, prefer_desired=args.prefer_desired) - if ip: - out_lines.append(f"{name} ansible_host={ip}") - out_lines.append('') - - out_lines.append('# --- Lifecycle: Retired / Shutdown ---') - out_lines.append('[retired_hosts]') - for name, info in hosts.items(): - if is_retired(info): - ip = ip_for(info, prefer_desired=args.prefer_desired) - if ip: - out_lines.append(f"{name} ansible_host={ip}") - out_lines.append('') - - out_lines.append('# --- Aggregate grouping ---') - out_lines.append('[ubuntu_lab:children]') - out_lines.append('swarm_managers') - out_lines.append('swarm_workers') - out_lines.append('standalone_ubuntu') - out_lines.append('ai_grid') - out_lines.append('docker_hosts') - out_lines.append('storage') - out_lines.append('') - out_lines.append('[ubuntu_lab:vars]') - out_lines.append('ansible_user=chester') - out_lines.append('ansible_ssh_private_key_file=/home/chester/.ssh/id_ed25519') - - out_path = Path(args.out) - out_path.parent.mkdir(parents=True, exist_ok=True) - out_path.write_text('\n'.join(out_lines) + '\n') - - -if __name__ == '__main__': - main() diff --git a/ansible/archive/scripts/health-check-quick.sh b/ansible/archive/scripts/health-check-quick.sh deleted file mode 100644 index ed05bbf..0000000 --- a/ansible/archive/scripts/health-check-quick.sh +++ /dev/null @@ -1,286 +0,0 @@ -#!/bin/bash - -# homelab-sentinel-health-quick.sh -# Quick terminal pulse check for Docker services in Nathan's homelab -# Validates uptime stability, resource pressure, and network exposure - -# Color codes for output -RED='\033[0;31m' -GREEN='\033[0;32m' -YELLOW='\033[1;33m' -BLUE='\033[0;34m' -NC='\033[0m' # No Color - -# Resource thresholds (from generic_host_conversational.yml) -MAX_SAFE_RAM_MB=16384 # 16GB -MAX_SAFE_CPU_CORES=8 - -# VLAN definitions from ansible/group_vars/all.yml -declare -A VLAN_CIDRS -VLAN_CIDRS["main"]="10.0.0.0/24" -VLAN_CIDRS["infra"]="10.0.10.0/24" -VLAN_CIDRS["iot"]="10.0.50.0/24" -VLAN_CIDRS["guest"]="10.0.30.0/24" -VLAN_CIDRS["compute"]="10.0.200.0/24" - -# Zone definitions -declare -A ZONES -ZONES["core"]="main" -ZONES["infrastructure"]="infra" -ZONES["iot"]="iot" -ZONES["guest"]="guest" -ZONES["compute"]="compute" - -# Help function -usage() { - echo "Usage: $0 " - echo " Performs a quick health check on a Docker service" - echo "" - echo "Arguments:" - echo " service_name Name of the Docker service to check" - echo "" - echo "Examples:" - echo " $0 watchtower" - echo " $0 prometheus" - exit 1 -} - -# Validate service name follows naming convention (--) -validate_service_naming() { - local service_name=$1 - - # Check if service name follows the pattern *_*_* (at least two underscores) - if [[ ! "$service_name" =~ .*-.*-.* ]]; then - echo -e "${YELLOW}⚠️ Naming Convention Warning:${NC} Service '$service_name' does not follow the -- naming convention" - return 1 - fi - - return 0 -} - -# Get zone assignment for a service based on naming convention -get_service_zone() { - local service_name=$1 - - # Extract role from service name (middle part) - local role=$(echo "$service_name" | cut -d'-' -f2) - - # Map common roles to zones - case "$role" in - "pve"|"proxmox"|"nas"|"heimdall"|"watchtower") - echo "infrastructure" - ;; - "swarm"|"ai"|"compute") - echo "compute" - ;; - "controller"|"omada") - echo "iot" - ;; - *) - # Default to infrastructure for unknown roles - echo "infrastructure" - ;; - esac -} - -# Check if IP is in correct VLAN range -is_ip_in_correct_vlan() { - local ip=$1 - local zone=$2 - - # Get expected CIDR for zone - local expected_cidr=${VLAN_CIDRS[$zone]} - - if [ -z "$expected_cidr" ]; then - echo "Unknown zone: $zone" - return 1 - fi - - # Simple check - in real implementation, would use ipcalc or similar - case "$zone" in - "infrastructure") - [[ $ip =~ ^10\.0\.10\. ]] && return 0 || return 1 - ;; - "compute") - [[ $ip =~ ^10\.0\.200\. ]] && return 0 || return 1 - ;; - "iot") - [[ $ip =~ ^10\.0\.50\. ]] && return 0 || return 1 - ;; - "guest") - [[ $ip =~ ^10\.0\.30\. ]] && return 0 || return 1 - ;; - "main") - [[ $ip =~ ^10\.0\.0\. ]] && return 0 || return 1 - ;; - esac - - return 1 -} - -# Parse docker stats output -parse_docker_stats() { - local service_name=$1 - - # Get docker stats in JSON format - local stats_json=$(docker stats "$service_name" --no-stream --format json 2>/dev/null) - - if [ -z "$stats_json" ]; then - echo "{}" - return - fi - - echo "$stats_json" -} - -# Parse docker ps output -parse_docker_ps() { - local service_name=$1 - - # Get docker ps info in JSON format - local ps_info=$(docker ps -a --filter "name=$service_name" --format json 2>/dev/null) - - if [ -z "$ps_info" ]; then - echo "{}" - return - fi - - echo "$ps_info" -} - -# Main health check function -check_service_health() { - local service_name=$1 - - echo -e "${BLUE}πŸ” Homelab Sentinel Quick Health Check${NC}" - echo -e "${BLUE}=====================================${NC}" - echo "Service: $service_name" - echo "" - - # Validate service naming - validate_service_naming "$service_name" - - # Determine expected zone - local expected_zone=$(get_service_zone "$service_name") - local expected_vlan=${ZONES[$expected_zone]} - echo -e "${BLUE}πŸ“ Expected Zone:${NC} $expected_zone (${VLAN_CIDRS[$expected_vlan]})" - echo "" - - # Get docker ps info - echo -e "${BLUE}πŸ“Š Container Status:${NC}" - docker ps -a --filter "name=$service_name" --format "table {{.Names}}\t{{.Status}}\t{{.RunningFor}}\t{{.Ports}}" - echo "" - - # Check uptime stability - echo -e "${BLUE}⏱️ Uptime Stability:${NC}" - local ps_output=$(docker ps -a --filter "name=$service_name" --format "{{.Status}}\t{{.RunningFor}}" 2>/dev/null) - if [ -n "$ps_output" ]; then - local status=$(echo "$ps_output" | cut -f1) - local running_for=$(echo "$ps_output" | cut -f2) - - if [[ "$status" == *"Restarting"* ]]; then - echo -e "${RED}❌ Unstable:${NC} Container is restarting ($status)" - elif [[ "$status" == *"Exited"* ]]; then - echo -e "${YELLOW}⚠️ Stopped:${NC} Container is not running ($status)" - else - echo -e "${GREEN}βœ… Stable:${NC} Container has been running for $running_for" - fi - else - echo -e "${RED}❌ Not Found:${NC} No container found with name '$service_name'" - return 1 - fi - echo "" - - # Check resource pressure - echo -e "${BLUE}⚑ Resource Pressure:${NC}" - local stats_json=$(parse_docker_stats "$service_name") - if [ -n "$stats_json" ] && [ "$stats_json" != "{}" ]; then - # Extract CPU and memory usage; docker may return units like B/KiB/MiB/GiB. - local cpu_percent=$(echo "$stats_json" | jq -r '.CPUPerc' 2>/dev/null | sed 's/%//' | cut -d'.' -f1) - local mem_usage_raw=$(echo "$stats_json" | jq -r '.MemUsage' 2>/dev/null | cut -d'/' -f1 | xargs) - local mem_mb="" - - # Parse values like 0B, 512KiB, 85.3MiB, 1.2GiB into MiB. - if [[ "$mem_usage_raw" =~ ^([0-9]+([.][0-9]+)?)([A-Za-z]+)$ ]]; then - local mem_val="${BASH_REMATCH[1]}" - local mem_unit="${BASH_REMATCH[3]}" - mem_mb=$(awk -v v="$mem_val" -v u="$mem_unit" 'BEGIN { - if (u == "B") printf "%.0f", v / 1048576; - else if (u == "KiB" || u == "KB" || u == "kB") printf "%.0f", v / 1024; - else if (u == "MiB" || u == "MB") printf "%.0f", v; - else if (u == "GiB" || u == "GB") printf "%.0f", v * 1024; - else if (u == "TiB" || u == "TB") printf "%.0f", v * 1048576; - }') - fi - - if [ -n "$cpu_percent" ] && [ "$cpu_percent" != "null" ]; then - if [ "$cpu_percent" -gt $((MAX_SAFE_CPU_CORES * 10)) ]; then # 10% per core threshold - echo -e "${YELLOW}⚠️ High CPU:${NC} ${cpu_percent}% (threshold: $((MAX_SAFE_CPU_CORES * 10))%)" - else - echo -e "${GREEN}βœ… CPU OK:${NC} ${cpu_percent}%" - fi - fi - - if [ -n "$mem_mb" ] && [ "$mem_mb" != "null" ]; then - if [ "$mem_mb" -gt "$MAX_SAFE_RAM_MB" ]; then - echo -e "${YELLOW}⚠️ High Memory:${NC} ${mem_usage_raw} (threshold: ${MAX_SAFE_RAM_MB}MiB)" - else - echo -e "${GREEN}βœ… Memory OK:${NC} ${mem_usage_raw}" - fi - fi - else - echo -e "${YELLOW}⚠️ No Stats:${NC} Unable to retrieve resource statistics" - fi - echo "" - - # Check network exposure - echo -e "${BLUE}🌐 Network Exposure:${NC}" - local port_info=$(docker ps --filter "name=$service_name" --format "{{.Ports}}" 2>/dev/null) - if [ -n "$port_info" ] && [ "$port_info" != "" ]; then - echo "Published Ports: $port_info" - - # Check if ports are exposed on unexpected interfaces - if [[ "$port_info" == *":"* ]]; then - echo -e "${YELLOW}⚠️ Port Exposure:${NC} Container exposes ports on host interfaces" - echo " Review port mappings to ensure they comply with network zoning" - else - echo -e "${GREEN}βœ… Port Exposure:${NC} No host ports exposed" - fi - else - echo -e "${GREEN}βœ… Port Exposure:${NC} No ports exposed" - fi - echo "" - - # Summary - echo -e "${BLUE}πŸ“‹ Summary:${NC}" - echo "Expected Zone: $expected_zone" - echo "Naming Convention: $(validate_service_naming "$service_name" && echo -e "${GREEN}βœ… Valid${NC}" || echo -e "${YELLOW}⚠️ Invalid${NC}")" - echo "" - echo -e "${BLUE}πŸ’‘ Next Actions:${NC}" - echo "1. If uptime is unstable, check logs with: docker logs $service_name" - echo "2. If resources are high, consider optimization or scaling" - echo "3. If network exposure seems wrong, review docker-compose configuration" - echo "4. For detailed analysis, run Deep mode health check" -} - -# Main script execution -main() { - # Check if service name is provided - if [ $# -eq 0 ]; then - usage - fi - - local service_name=$1 - - # Check if docker is installed and accessible - if ! command -v docker &> /dev/null; then - echo -e "${RED}❌ Docker Error:${NC} Docker is not installed or not accessible" - exit 1 - fi - - # Perform health check - check_service_health "$service_name" -} - -# Run main function with all arguments -main "$@" \ No newline at end of file diff --git a/ansible/archive/scripts/pi_init.sh b/ansible/archive/scripts/pi_init.sh deleted file mode 100644 index 7f9b451..0000000 --- a/ansible/archive/scripts/pi_init.sh +++ /dev/null @@ -1,30 +0,0 @@ -# --- 4. VS CODE TUNNEL SETUP (Hardened v2) --- -ARCH=$(uname -m) -case $ARCH in - x86_64) PLAT="cli-linux-x64" ;; - aarch64) PLAT="cli-linux-arm64" ;; - *) echo "❌ Unsupported arch: $ARCH"; exit 1 ;; -esac - -echo "πŸ’» Installing VS Code CLI ($ARCH)..." -URL="https://update.code.visualstudio.com/latest/$PLAT/stable" - -curl -Lk "$URL" --output vscode_cli.tar.gz - -# Check for a healthy file size (> 5MB) -FILESIZE=$(stat -c%s "vscode_cli.tar.gz") -if [ "$FILESIZE" -lt 5000000 ]; then - echo "❌ Error: VS Code download failed (Size: $FILESIZE bytes)." - exit 1 -fi - -tar -xf vscode_cli.tar.gz -sudo mv code /usr/local/bin/ -rm vscode_cli.tar.gz - -# Required for background persistence -sudo loginctl enable-linger $USER - -echo "πŸ”§ Registering Tunnel Service..." -# Simplified command: Provider is handled during the 'code tunnel user login' step -code tunnel service install --accept-server-license-terms --name "$(hostname)" \ No newline at end of file diff --git a/ansible/archive/scripts/pi_pull_updates.sh b/ansible/archive/scripts/pi_pull_updates.sh deleted file mode 100644 index 65f8e34..0000000 --- a/ansible/archive/scripts/pi_pull_updates.sh +++ /dev/null @@ -1,29 +0,0 @@ -#!/usr/bin/env bash -set -euo pipefail - -# Self-heal runner for Watchtower. -# Prefer SSH-based git auth (deploy key) instead of embedding tokens. - -LOG_FILE="${LOG_FILE:-/home/chester/ansible-pull.log}" -WORKSPACE="${WORKSPACE:-/home/chester/.ansible_pull_workspace}" -REPO_URL="${REPO_URL:-git@git.castaldifamily.com:nathan/homelab.git}" -REPO_REF="${REPO_REF:-main}" -PLAYBOOK_PATH="${PLAYBOOK_PATH:-ansible/playbooks/self-heal/watchtower.yml}" -INVENTORY="${INVENTORY:-localhost,}" - -mkdir -p "$(dirname "$LOG_FILE")" "$WORKSPACE" - -echo "--- Starting Update: $(date -Is) ---" | tee -a "$LOG_FILE" - -if [[ "$REPO_URL" == https://*"@"* ]]; then - echo "WARNING: Credentialed HTTPS URL detected in REPO_URL. Use SSH deploy keys when possible." | tee -a "$LOG_FILE" -fi - -ansible-pull \ - -U "$REPO_URL" \ - -C "$REPO_REF" \ - -d "$WORKSPACE" \ - -i "$INVENTORY" \ - "$PLAYBOOK_PATH" 2>&1 | tee -a "$LOG_FILE" - -echo "--- Update Complete: $(date -Is) ---" | tee -a "$LOG_FILE" \ No newline at end of file diff --git a/ansible/archive/templates/hosts.ini.j2 b/ansible/archive/templates/hosts.ini.j2 deleted file mode 100644 index 7c33c3f..0000000 --- a/ansible/archive/templates/hosts.ini.j2 +++ /dev/null @@ -1,97 +0,0 @@ -# Generated inventory from ansible/group_vars/all.yml -;; Backup of previous file will be created when template runs. - -{% set hosts_dict = lab_sot.lab_hosts | default(lab_sot.hosts | default({})) %} - -# --- Watchtower (local controller) --- -[watchtower] -localhost ansible_connection=local - -# --- Proxmox Cluster (management) --- -[proxmox_cluster] -{% for name,info in hosts_dict.items() %} -{% if info.get('role') == 'proxmox' %} -{% if info.current_ip is defined and info.current_ip %} -{{ name }} ansible_host={{ info.current_ip }} -{% elif info.desired_ip is string %} -{{ name }} ansible_host={{ info.desired_ip }} -{% endif %} -{% endif %} -{% endfor %} - -[proxmox_cluster:vars] -ansible_user=chester -ansible_become=true -ansible_python_interpreter=/usr/bin/python3 - -# --- Swarm Managers --- -[swarm_managers] -{% for name,info in hosts_dict.items() %} -{% if 'swarm_manager' in (info.get('role')|default('')) %} -{% if info.current_ip is defined and info.current_ip %} -{{ name }} ansible_host={{ info.current_ip }} -{% elif info.desired_ip is string %} -{{ name }} ansible_host={{ info.desired_ip }} -{% endif %} -{% endif %} -{% endfor %} - -# --- Swarm Workers --- -[swarm_workers] -{% for name,info in hosts_dict.items() %} -{% if 'swarm_worker' in (info.get('role')|default('')) %} -{% if info.current_ip is defined and info.current_ip %} -{{ name }} ansible_host={{ info.current_ip }} -{% elif info.desired_ip is string %} -{{ name }} ansible_host={{ info.desired_ip }} -{% endif %} -{% endif %} -{% endfor %} - -# --- AI Grid --- -[ai_grid] -{% for name,info in hosts_dict.items() %} -{% if info.get('role') == 'ai_node' %} -{% if info.current_ip is defined and info.current_ip %} -{{ name }} ansible_host={{ info.current_ip }} -{% elif info.desired_ip is string %} -{{ name }} ansible_host={{ info.desired_ip }} -{% endif %} -{% endif %} -{% endfor %} - -# --- Docker Hosts --- -[docker_hosts] -{% for name,info in hosts_dict.items() %} -{% if info.get('role') == 'docker_host' %} -{% if info.current_ip is defined and info.current_ip %} -{{ name }} ansible_host={{ info.current_ip }} -{% elif info.desired_ip is string %} -{{ name }} ansible_host={{ info.desired_ip }} -{% endif %} -{% endif %} -{% endfor %} - -# --- Storage --- -[storage] -{% for name,info in hosts_dict.items() %} -{% if info.get('role') == 'nas' %} -{% if info.current_ip is defined and info.current_ip %} -{{ name }} ansible_host={{ info.current_ip }} ansible_scp_if_ssh=True -{% elif info.desired_ip is string %} -{{ name }} ansible_host={{ info.desired_ip }} ansible_scp_if_ssh=True -{% endif %} -{% endif %} -{% endfor %} - -# --- Aggregate grouping --- -[ubuntu_lab:children] -swarm_managers -swarm_workers -ai_grid -docker_hosts -storage - -[ubuntu_lab:vars] -ansible_user=chester -ansible_ssh_private_key_file=/home/chester/.ssh/id_ed25519 diff --git a/ansible/archive/templates/stacks/authentik.stack.yml b/ansible/archive/templates/stacks/authentik.stack.yml deleted file mode 100644 index 251663f..0000000 --- a/ansible/archive/templates/stacks/authentik.stack.yml +++ /dev/null @@ -1,196 +0,0 @@ -x-info: - github: https://github.com/goauthentik/authentik - docs: https://goauthentik.io/docs - changelog: https://github.com/goauthentik/authentik/releases - homelab_status: stable - last_updated: 2026-03-15 - -# Managed by Ansible β€” manual edits will be overwritten on next deploy. -# Source: ansible/templates/stacks/authentik.stack.yml -# Deploy: ansible-playbook -i inventory/hosts.ini playbooks/docker/deploy_authentik.yml -# -# DATA SAFETY NOTE: -# This stack uses absolute bind mounts under /mnt/homelab/apps/authentik. -# Deploy playbook preflight checks require these paths to exist before deploy, -# which protects pre-existing Authentik data from accidental fresh bootstraps. - -version: "3.9" - -services: - authentik-postgres: - image: docker.io/library/postgres:16-alpine - environment: - - TZ=America/New_York - - POSTGRES_DB={{ authentik_postgres_db | default('authentik') }} - - POSTGRES_PASSWORD={{ vault_authentik_postgres_password }} - - POSTGRES_USER={{ authentik_postgres_user | default('authentik') }} - healthcheck: - test: ["CMD-SHELL", "pg_isready -d $$POSTGRES_DB -U $$POSTGRES_USER"] - interval: 30s - timeout: 5s - retries: 5 - start_period: 20s - volumes: - - /mnt/homelab/apps/authentik/data/database:/var/lib/postgresql/data - networks: - - proxy-net - deploy: - replicas: 1 - placement: - constraints: - - node.hostname == {{ authentik_placement_node | default('swarm-manager-1') }} - resources: - limits: - memory: 1G - cpus: "0.75" - restart_policy: - condition: on-failure - delay: 10s - max_attempts: 3 - window: 60s - update_config: - parallelism: 1 - order: stop-first - failure_action: rollback - delay: 10s - monitor: 30s - rollback_config: - parallelism: 1 - order: stop-first - - authentik-redis: - image: redis:7-alpine - command: ["--save", "60", "1", "--loglevel", "warning"] - healthcheck: - test: ["CMD", "redis-cli", "ping"] - interval: 10s - timeout: 5s - retries: 5 - volumes: - - /mnt/homelab/apps/authentik/data/redis:/data - networks: - - proxy-net - deploy: - replicas: 1 - placement: - constraints: - - node.hostname == {{ authentik_placement_node | default('swarm-manager-1') }} - resources: - limits: - memory: 512M - cpus: "0.50" - restart_policy: - condition: on-failure - delay: 10s - max_attempts: 3 - window: 60s - update_config: - parallelism: 1 - order: start-first - failure_action: rollback - delay: 10s - monitor: 30s - rollback_config: - parallelism: 1 - order: stop-first - - authentik-server: - image: ghcr.io/goauthentik/server:{{ authentik_version | default('2025.10.1') }} - command: ["server"] - environment: - - TZ=America/New_York - - AUTHENTIK_POSTGRESQL__HOST=authentik-postgres - - AUTHENTIK_POSTGRESQL__NAME={{ authentik_postgres_db | default('authentik') }} - - AUTHENTIK_POSTGRESQL__PASSWORD={{ vault_authentik_postgres_password }} - - AUTHENTIK_POSTGRESQL__USER={{ authentik_postgres_user | default('authentik') }} - - AUTHENTIK_SECRET_KEY={{ vault_authentik_secret_key }} - - AUTHENTIK_REDIS__HOST=authentik-redis - ports: - - "9000:9000" - volumes: - - /mnt/homelab/apps/authentik/data/media:/media - - /mnt/homelab/apps/authentik/data/config:/config - - /mnt/homelab/apps/authentik/data/blueprints:/blueprints/custom:ro - networks: - - proxy-net - labels: - - "homepage.name=Authentik" - - "homepage.icon=si:authentik" - - "homepage.url=https://sso.castaldifamily.com" - - "homepage.description=Identity provider" - deploy: - labels: - - "traefik.enable=true" - - "traefik.http.routers.authentik.rule=Host(`sso.castaldifamily.com`)" - - "traefik.http.routers.authentik.entrypoints=websecure" - - "traefik.http.routers.authentik.tls=true" - - "traefik.http.routers.authentik.tls.certresolver=cloudflare" - - "traefik.http.routers.authentik.middlewares=security-headers@file,ratelimit-basic@file" - - "traefik.http.services.authentik.loadbalancer.server.url=http://{{ edge_routing.swarm.bind_ip }}:9000" - replicas: 1 - placement: - constraints: - - node.hostname == {{ authentik_placement_node | default('swarm-manager-1') }} - resources: - limits: - memory: 2G - cpus: "1.0" - restart_policy: - condition: on-failure - delay: 10s - max_attempts: 3 - window: 60s - update_config: - parallelism: 1 - order: start-first - failure_action: rollback - delay: 10s - monitor: 30s - rollback_config: - parallelism: 1 - order: stop-first - - authentik-worker: - image: ghcr.io/goauthentik/server:{{ authentik_version | default('2025.10.1') }} - command: ["worker"] - environment: - - TZ=America/New_York - - AUTHENTIK_POSTGRESQL__HOST=authentik-postgres - - AUTHENTIK_POSTGRESQL__NAME={{ authentik_postgres_db | default('authentik') }} - - AUTHENTIK_POSTGRESQL__PASSWORD={{ vault_authentik_postgres_password }} - - AUTHENTIK_POSTGRESQL__USER={{ authentik_postgres_user | default('authentik') }} - - AUTHENTIK_SECRET_KEY={{ vault_authentik_secret_key }} - - AUTHENTIK_REDIS__HOST=authentik-redis - volumes: - - /mnt/homelab/apps/authentik/data/media:/media - - /mnt/homelab/apps/authentik/data/config:/config - networks: - - proxy-net - deploy: - replicas: 1 - placement: - constraints: - - node.hostname == {{ authentik_placement_node | default('swarm-manager-1') }} - resources: - limits: - memory: 1G - cpus: "0.75" - restart_policy: - condition: on-failure - delay: 10s - max_attempts: 3 - window: 60s - update_config: - parallelism: 1 - order: start-first - failure_action: rollback - delay: 10s - monitor: 30s - rollback_config: - parallelism: 1 - order: stop-first - -networks: - proxy-net: - external: true - name: proxy-net \ No newline at end of file diff --git a/ansible/archive/templates/stacks/example.service.stack.yml b/ansible/archive/templates/stacks/example.service.stack.yml deleted file mode 100644 index 09e9def..0000000 --- a/ansible/archive/templates/stacks/example.service.stack.yml +++ /dev/null @@ -1,120 +0,0 @@ -# ============================================================================= -# FUTURE-STACK BLUEPRINT β€” copy, rename, and fill in the TODO items. -# This file is the minimum viable Swarm stack template for this homelab. -# Every field section is either REQUIRED or RECOMMENDED. Delete nothing; -# fill TODOs or leave the defaults for sections that do not apply. -# -# Naming: rename this file to .stack.yml in kebab-case. -# Deploy: copy playbooks/docker/deploy_example_stack.yml -> deploy_.yml -# and fill in the TODO items there too. -# ============================================================================= -x-info: - # TODO: fill in canonical upstream URLs. - github: https://github.com/TODO/TODO - docs: https://TODO.example.com/docs - changelog: https://github.com/TODO/TODO/releases - # lifecycle: planned | active | stable | deprecated - homelab_status: planned - last_updated: 2026-03-14 - -# Managed by Ansible β€” manual edits will be overwritten on next deploy. -# Source: ansible/templates/stacks/.stack.yml -# Deploy: ansible-playbook -i inventory/hosts.ini playbooks/docker/deploy_.yml - -version: "3.9" - -services: - example-app: - # REQUIRED: Pin to a specific digest or semver tag. Never use :latest. - # WHY: Floating tags break idempotency β€” redeploying a service with an - # unpinned image may silently change the running version. - image: nginx:1.27.4-alpine - - environment: - - TZ=America/New_York - # REQUIRED for secrets: reference vault variables injected by the deploy - # playbook. Never hardcode passwords here. - # Example: - EXAMPLE_DB_PASSWORD={{ vault_example_db_password }} - # TODO: add service-specific environment variables. - - # REQUIRED for services with persistent data: use absolute bind-mount paths. - # WHY: Swarm services have no well-defined working directory; relative paths - # (e.g. ./data) are unsafe and non-portable in Swarm stacks. - # WHY pre-existence check: the deploy playbook MUST assert these paths exist - # before deploy to protect existing data from accidental fresh bootstrap. - volumes: - - /mnt/homelab/apps/example/data:/data # TODO: adjust per service - - ports: - # TODO: choose a port in the 8200–8299 range to avoid collisions. - - "8299:80" # host:container - - # REQUIRED: Always declare healthcheck so Swarm scheduler can detect failure. - healthcheck: - test: ["CMD", "curl", "-sf", "http://localhost:80/"] - interval: 30s - timeout: 5s - retries: 5 - start_period: 30s - - networks: - - proxy-net - - # Top-level labels: for non-Swarm consumers (homepage, glance). - # Do NOT put traefik labels here β€” traefik-kop reads deploy.labels only. - labels: - - "homepage.name=Example App" - - "homepage.icon=si:nginx" - - "homepage.url=https://example.castaldifamily.com" - - "homepage.description=TODO: one-line description" - - deploy: - replicas: 1 - - placement: - constraints: - # REQUIRED: pin to a specific node when using bind-mount volumes or - # hardware (GPU, USB). Otherwise, remove this block for free placement. - - node.hostname == swarm-manager-1 # TODO: adjust or remove - - # REQUIRED: declare deploy labels for traefik-kop route publication. - # WHY deploy.labels (not top-level): traefik-kop reads Swarm *service* - # labels via the Docker API. Top-level labels are on the container image. - labels: - - "traefik.enable=true" - - "traefik.http.routers.example.rule=Host(`example.castaldifamily.com`)" - - "traefik.http.routers.example.entrypoints=websecure" - - "traefik.http.routers.example.tls=true" - - "traefik.http.routers.example.tls.certresolver=cloudflare" - # WHY server.url (not server.port): routes Traefik to the Swarm routing - # mesh IP rather than a container IP, which is ephemeral in Swarm. - - "traefik.http.services.example.loadbalancer.server.url=http://{{ edge_routing.swarm.bind_ip }}:8299" - - resources: - limits: - memory: 512M # TODO: set appropriate limit for this service - cpus: "0.5" # TODO: set appropriate cpu limit - - restart_policy: - condition: on-failure - delay: 10s - max_attempts: 3 - window: 60s - - update_config: - parallelism: 1 - order: start-first # prefer start-first for stateless; stop-first for stateful/DBs - failure_action: rollback - delay: 10s - monitor: 30s - - rollback_config: - parallelism: 1 - order: stop-first - -# REQUIRED: declare all external networks the stack consumes. -# Networks NOT listed here will fail silently at deploy time. -networks: - proxy-net: - external: true - name: proxy-net diff --git a/ansible/archive/templates/stacks/gitea.stack.yml b/ansible/archive/templates/stacks/gitea.stack.yml deleted file mode 100644 index e8815c9..0000000 --- a/ansible/archive/templates/stacks/gitea.stack.yml +++ /dev/null @@ -1,142 +0,0 @@ -x-info: - github: https://github.com/go-gitea/gitea - docs: https://docs.gitea.com/ - changelog: https://github.com/go-gitea/gitea/releases - homelab_status: stable - last_updated: 2026-03-09 - -version: "3.9" - -services: - server: - image: docker.gitea.com/gitea:1.25.1 - environment: - - TZ=America/New_York - - GITEA__database__DB_TYPE=postgres - - GITEA__database__HOST=gitea-db:5432 - - GITEA__database__NAME=gitea - - GITEA__database__PASSWD={{ vault_gitea_db_password }} - - GITEA__database__USER=gitea - - GITEA__server__ROOT_URL=https://git.castaldifamily.com/ - volumes: - - /mnt/homelab/apps/gitea/data:/data - ports: - - "8251:3000" - # Keep container labels for compatibility with external route publishers - # that inspect task/container metadata instead of Swarm service metadata. - labels: - - "traefik.enable=true" - - "traefik.http.routers.gitea.rule=Host(`git.castaldifamily.com`)" - - "traefik.http.routers.gitea.entrypoints=websecure" - - "traefik.http.routers.gitea.tls=true" - - "traefik.http.routers.gitea.tls.certresolver=cloudflare" - - "traefik.http.services.gitea.loadbalancer.server.url=http://{{ edge_routing.swarm.bind_ip }}:8251" - networks: - - proxy-net - deploy: - labels: - - "traefik.enable=true" - - "traefik.http.routers.gitea.rule=Host(`git.castaldifamily.com`)" - - "traefik.http.routers.gitea.entrypoints=websecure" - - "traefik.http.routers.gitea.tls=true" - - "traefik.http.routers.gitea.tls.certresolver=cloudflare" - - "traefik.http.services.gitea.loadbalancer.server.url=http://{{ edge_routing.swarm.bind_ip }}:8251" - # - "glance.name=Gitea" - # - "glance.icon=si:gitea" - # - "glance.url=https://git.castaldifamily.com" - # - "glance.category=dev-tools" - # - "glance.hide=false" - replicas: 1 - placement: - constraints: - - node.hostname == {{ gitea_placement_node | default('swarm-manager-1') }} - resources: - limits: - memory: 1G - cpus: "0.5" - restart_policy: - # WHY any (not on-failure): Gitea is a persistent web service. Swarm's - # on-failure policy does NOT restart a container that exits cleanly - # (SIGTERM β†’ code 0). A graceful shutdown during a rolling update - # would leave the service at 0/1 permanently until manually forced. - # 'any' ensures the service is always rescheduled regardless of exit code. - condition: any - delay: 10s - update_config: - delay: 10s - monitor: 30s - rollback_config: - parallelism: 1 - order: stop-first - - gitea-db: - image: postgres:17.4 - environment: - - TZ=America/New_York - - POSTGRES_DB=gitea - - POSTGRES_PASSWORD={{ vault_gitea_db_password }} - - POSTGRES_USER=gitea - volumes: - - /mnt/homelab/apps/gitea/data/db:/var/lib/postgresql/data - networks: - - proxy-net - healthcheck: - test: ["CMD-SHELL", "pg_isready -U gitea -d gitea"] - interval: 10s - timeout: 5s - retries: 5 - start_period: 30s - deploy: - replicas: 1 - placement: - constraints: - - node.hostname == {{ gitea_placement_node | default('swarm-manager-1') }} - resources: - limits: - memory: 1G - cpus: "0.5" - restart_policy: - condition: on-failure - delay: 10s - max_attempts: 3 - window: 60s - update_config: - parallelism: 1 - order: stop-first - failure_action: rollback - delay: 10s - monitor: 30s - rollback_config: - parallelism: 1 - order: stop-first - -networks: - proxy-net: - external: true - name: proxy-net - - # runner: - # image: gitea/act_runner:0.2.13 - # deploy: - # replicas: 1 - # placement: - # constraints: - # - node.hostname == {{ gitea_placement_node | default('swarm-manager-1') }} - # resources: - # limits: - # memory: 512M - # cpus: "0.3" - # restart_policy: - # condition: on-failure - # volumes: - # - /mnt/homelab/apps/gitea/data/config.yaml:/config.yaml - # - /mnt/homelab/apps/gitea/data/runner/data:/data - # - /var/run/docker.sock:/var/run/docker.sock - # environment: - # - TZ=America/New_York - # - CONFIG_FILE=/config.yaml - # - GITEA_INSTANCE_URL=https://git.castaldifamily.com - # - GITEA_RUNNER_REGISTRATION_TOKEN=SET_VAULT_TOKEN_IF_ENABLING_RUNNER - # - GITEA_RUNNER_NAME=homelab - # networks: - # - proxy-net diff --git a/ansible/archive/templates/stacks/plex.stack.yml b/ansible/archive/templates/stacks/plex.stack.yml deleted file mode 100644 index 256062a..0000000 --- a/ansible/archive/templates/stacks/plex.stack.yml +++ /dev/null @@ -1,112 +0,0 @@ -x-info: - github: https://github.com/linuxserver/docker-plex - docs: https://docs.linuxserver.io/images/docker-plex/ - changelog: https://github.com/plexinc/pms-docker/releases - homelab_status: active - last_updated: 2026-03-13 - -# Managed by Ansible β€” manual edits will be overwritten on next deploy. -# Source: ansible/templates/stacks/plex.stack.yml -# Deploy: ansible-playbook -i inventory/hosts.ini playbooks/docker/deploy_plex.yml -# -# SECRETS REQUIRED: -# vault_plex_claim must be defined in group_vars/vault/all.yml. -# Encrypt with: -# ansible-vault encrypt_string 'claim-XXXX' --name 'vault_plex_claim' -# PLEX_CLAIM is a bootstrap token used only on first server claim; it is -# ignored by Plex on subsequent starts. -# -# DEVICE PASSTHROUGH NOTE: -# Docker Swarm has limited support for 'devices:' in service specs (requires -# Docker Engine 20.10+ and a single-node placement constraint). Hardware -# transcoding (/dev/dri, /dev/dvb) is pinned to swarm-manager-1. If your -# Swarm Engine does not support device passthrough, remove the 'devices:' -# block and rely on CPU transcoding only. - -version: "3.9" - -services: - plex: - image: lscr.io/linuxserver/plex:1.42.2.10156-f737b826c-ls283 - environment: - - PUID=1000 - - PGID=1000 - - TZ=America/New_York - - VERSION=docker - - PLEX_CLAIM={{ vault_plex_claim }} - # ADVERTISE_IP: comma-separated list; Plex clients select the best path. - # External clients β†’ Traefik HTTPS on port 443 (websecure entrypoint). - # LAN clients β†’ direct Swarm routing mesh on port 32400 (fast path, - # no Traefik hop; required for SSDP/GDM discovery). - - ADVERTISE_IP=https://plex.castaldifamily.com:443/,http://{{ edge_routing.swarm.bind_ip }}:32400/ - # ACCESS POLICY (Option B β€” split-access): - # LAN (10.0.0.0/24, 10.0.200.0/24): direct on port 32400 via Swarm routing mesh. - # External: Traefik HTTPS only (see deploy.labels block below). - # FIREWALL REQUIREMENT: block port 32400 from VLAN 30 (Guest) and VLAN 50 (IoT). - ports: - - "32400:32400" - volumes: - - /mnt/homelab/apps/plex/data:/config - - /mnt/media/tvshows:/tv - - /mnt/media/movies:/movies - # WHY absolute paths: Swarm services have no well-defined working directory. - # Relative paths (e.g. ./data) are unsafe in Swarm stacks. - # - # Device passthrough β€” requires Docker Engine >= 20.10 and single-node placement. - # If a device is absent: Docker ignores it and Plex falls back to CPU transcoding. - # Run deploy_plex.yml to see preflight warnings if GPU devices are absent. - devices: - - /dev/renderD128:/dev/renderD128 - - /dev/dri:/dev/dri - - /dev/dvb:/dev/dvb - networks: - - proxy-net - # Top-level labels: used by homepage widget discovery (non-Swarm consumers). - labels: - - "homepage.name=Plex" - - "homepage.icon=si:plex" - - "homepage.url=https://plex.castaldifamily.com" - - "homepage.description=Movies & shows" - deploy: - replicas: 1 - placement: - constraints: - # WHY pinned to swarm-manager-1: media volumes and hardware device - # nodes are local to this host. Update if media/GPU lives elsewhere. - - node.hostname == swarm-manager-1 - labels: - # WHY deploy.labels (not top-level): traefik-kop reads Swarm *service* - # labels via the Docker API. Top-level labels are on the container image, - # not the Swarm service β€” traefik-kop will ignore them. - - "traefik.enable=true" - - "traefik.http.routers.plex.rule=Host(`plex.castaldifamily.com`)" - - "traefik.http.routers.plex.entrypoints=websecure" - - "traefik.http.routers.plex.tls=true" - - "traefik.http.routers.plex.tls.certresolver=cloudflare" - # WHY server.url (not server.port): routes external Traefik to the Swarm - # routing mesh IP rather than guessing a container IP. Consistent with - # gitea.stack.yml pattern. - - "traefik.http.services.plex.loadbalancer.server.url=http://{{ edge_routing.swarm.bind_ip }}:32400" - resources: - limits: - memory: 2G - cpus: "2.0" - restart_policy: - condition: on-failure - delay: 10s - max_attempts: 3 - window: 60s - update_config: - parallelism: 1 - order: start-first - failure_action: rollback - delay: 10s - monitor: 30s - rollback_config: - parallelism: 1 - order: stop-first - -networks: - proxy-net: - external: true - name: proxy-net diff --git a/ansible/archive/templates/stacks/portainer-agent.stack.yml b/ansible/archive/templates/stacks/portainer-agent.stack.yml deleted file mode 100644 index fac61b9..0000000 --- a/ansible/archive/templates/stacks/portainer-agent.stack.yml +++ /dev/null @@ -1,78 +0,0 @@ -x-info: - github: https://github.com/portainer/agent - docs: https://docs.portainer.io/admin/environments/add/swarm/agent - homelab_status: stable - last_updated: 2026-03-13 - -# portainer-agent Swarm stack -# Managed by Ansible β€” manual edits will be overwritten on next deploy. -# Deploy via: -# ansible-playbook -i inventory/hosts.ini playbooks/docker/deploy_swarm_stack.yml \ -# -e "stack_name=portainer-agent" \ -# -e "stack_compose_src=/home/chester/homelab/ansible/templates/stacks/portainer-agent.stack.yml" -# -# WHAT THIS DOES: -# Deploys the Portainer Agent as a global Swarm service β€” one instance on every -# node in the cluster. Portainer on Watchtower connects to any manager IP on -# port 9001 (AGENT_PORT) to discover and manage the full Swarm. -# -# HOW TO ADD TO PORTAINER UI: -# Environments β†’ Add Environment β†’ Docker Swarm β†’ Agent -# Name: homelab-swarm -# Agent: 10.0.0.211:9001 (any Swarm manager IP) - -version: "3.9" - -services: - portainer-agent: - image: portainer/agent:latest - volumes: - - /var/run/docker.sock:/var/run/docker.sock:ro - - /var/lib/docker/volumes:/var/lib/docker/volumes - environment: - - AGENT_CLUSTER_ADDR=tasks.portainer-agent - # WHY tasks.portainer-agent: Swarm DNS resolves the service task IPs, - # allowing agents on each node to discover each other for cluster mode. - networks: - - portainer-agent-net - ports: - - target: 9001 - published: 9001 - protocol: tcp - mode: host - # WHY mode: host (not ingress): Portainer connects to a *specific* agent - # instance on each node to gather that node's local container data. - # Ingress mode would load-balance across all nodes, breaking per-node views. - deploy: - mode: global - # WHY global: one agent per node. Portainer needs an agent on every node - # to show per-node container stats, logs, and volume state. - placement: - constraints: - - node.platform.os == linux - resources: - limits: - memory: 128M - cpus: "0.1" - restart_policy: - condition: on-failure - delay: 5s - max_attempts: 3 - window: 30s - update_config: - parallelism: 1 - order: start-first - failure_action: rollback - delay: 10s - monitor: 30s - rollback_config: - parallelism: 1 - order: stop-first - -networks: - portainer-agent-net: - driver: overlay - attachable: true - # WHY attachable overlay: agents communicate with each other over this - # dedicated network for cluster-aware discovery (AGENT_CLUSTER_ADDR). - # Separate from proxy-net to isolate management traffic from app traffic. diff --git a/ansible/archive/templates/stacks/traefik-kop.stack.yml b/ansible/archive/templates/stacks/traefik-kop.stack.yml deleted file mode 100644 index b645ef4..0000000 --- a/ansible/archive/templates/stacks/traefik-kop.stack.yml +++ /dev/null @@ -1,65 +0,0 @@ -# traefik-kop Swarm stack -# Managed by Ansible β€” manual edits will be overwritten on next deploy. -# Source vars: group_vars/all.yml (edge_routing.swarm.*) -# Deploy via: ansible-playbook playbooks/docker/deploy_traefik_kop.yml -# -# WHAT THIS DOES: -# Runs as a Swarm service on a manager node. Reads Docker service labels -# (traefik.enable=true etc.) from Swarm services and publishes routing -# rules into the Redis instance on Heimdall ({{ edge_routing.integration.redis_addr }}). -# Traefik then picks up these routes from Redis automatically. -# -# NETWORK NOTE: -# proxy-net here is a Swarm overlay network β€” distinct from the bridge -# network of the same name on Heimdall. The overlay allows future Swarm -# services to declare `networks: [proxy-net]` and be discoverable by kop. -version: "3.9" - -services: - traefik-kop: - image: "{{ edge_routing.integration.agent_image }}" - volumes: - - /var/run/docker.sock:/var/run/docker.sock:ro - # WHY :ro β€” kop only reads Swarm service state, never modifies Docker. - # Read-only mount is defence-in-depth against container escape. - environment: - - REDIS_ADDR={{ edge_routing.integration.redis_addr }} - - BIND_IP={{ edge_routing.swarm.bind_ip }} - # WHY BIND_IP is a Swarm node IP (not Heimdall): - # kop writes "route traffic for to BIND_IP:". - # The Swarm routing mesh makes published ports available on ALL nodes, - # so Traefik sends the request here and the mesh handles the rest. - networks: - - proxy-net - deploy: - replicas: 1 - placement: - constraints: - - node.role == manager - # WHY manager only: only manager nodes hold full Swarm Raft state. - # A worker node has an incomplete view of all services and their labels. - restart_policy: - condition: on-failure - delay: 5s - max_attempts: 3 - window: 30s - # WHY on-failure (not always): avoids rapid reconnect storms - # against Redis during a network partition. - update_config: - parallelism: 1 - order: start-first - failure_action: rollback - delay: 10s - monitor: 30s - # WHY start-first: new task starts before old one stops, giving - # zero downtime. Rollback triggers if monitoring detects failure. - rollback_config: - parallelism: 1 - order: stop-first - -networks: - proxy-net: - external: true - name: "{{ edge_routing.swarm.proxy_network }}" - # WHY external: this overlay network is pre-created in the deploy playbook - # so future Swarm service stacks can also join it without stack coupling. diff --git a/ansible/group_vars/all.yml b/ansible/group_vars/all.yml deleted file mode 100644 index 45f8da5..0000000 --- a/ansible/group_vars/all.yml +++ /dev/null @@ -1,37 +0,0 @@ ---- -# Global variables for all hosts -# These apply to every host in the inventory unless overridden - -# Network Configuration -network: - gateway: 10.0.0.2 - dns_servers: - - 10.0.0.2 - - 8.8.8.8 - subnet: 10.0.0.0/24 - -# Time and Locale -timezone: America/New_York -locale: en_US.UTF-8 - -# SSH Configuration -ssh_port: 22 -ssh_key_type: ed25519 - -# Docker Configuration -docker: - version: latest - compose_version: latest - registry_mirrors: [] - -# Security Defaults -security: - ufw_enabled: false - fail2ban_enabled: false - automatic_updates: true - -# Maintenance Windows -maintenance: - reboot_allowed: true - reboot_time: "03:00" - update_cache_valid_time: 3600 diff --git a/ansible/group_vars/all/gitvana_bun.yml b/ansible/group_vars/all/gitvana_bun.yml deleted file mode 100644 index f7b2794..0000000 --- a/ansible/group_vars/all/gitvana_bun.yml +++ /dev/null @@ -1,11 +0,0 @@ ---- -# Gitvana deployment defaults. Override these in host_vars or via -e as needed. -gitvana_repo_url: https://git.castaldifamily.com/nathan/gitvana -gitvana_repo_version: main - -# Runtime mode can be dev or preview. Preview is recommended for long-running service. -gitvana_run_mode: preview -gitvana_service_port: 3000 - -# Optional: If repository access requires a dedicated SSH key on target hosts, set this path. -# gitvana_git_key_file: /home/chester/.ssh/id_ed25519 diff --git a/ansible/group_vars/all/vault.yml b/ansible/group_vars/all/vault.yml deleted file mode 100644 index cb43ba7..0000000 --- a/ansible/group_vars/all/vault.yml +++ /dev/null @@ -1,78 +0,0 @@ -$ANSIBLE_VAULT;1.1;AES256 -33613665626662643762613134343430336530326237613537383438383532613662306165643738 -3831663166653966376330616361663831303231326266380a613537353561343839653332613966 -64326632333130313233636166363666623564623430623039393765626631616233356664366436 -6537643565626561340a336232653134306662623166316337303862363134363066383834633937 -33316265343031313738303235383266336331633230363963383032396262373831613234646130 -30396663376635666266626663323364623536613238383265336465633432313931356432346439 -65636637333237376532383732316139636235323432616133336633613330356562646531363465 -37646434656432663062613461383736643262373066316632366262636539353638313665616137 -39376339636235333138633665393336363133396231373438383036646565323838663936303935 -33306434306333363238393361393235633339633137653430643635356534343838663565376461 -32373930613633393864636630303039383635303437343938383263646232623866653565623238 -63363864326163663232336233316132633030643536353165383966613630663139346137663738 -34623465633661346662656533626561386665343162303963633135326232373733326633333764 -65383664623063316438346235336137323933386563623230666562393638373832373266373265 -35336666643435643463323639343137363663643934313830306430626663316530633035323234 -66383763393036373533393933663832336537613566353233353830343435386139633433646434 -31306663363166336432383261363331383864376161656430383739383236643964323134393966 -32376538353435613363313034346564303231353633333764373932346335613930656361346338 -30626334353066303861323135366166653330303265666537396134306530333336363932346338 -34653634613335613435366165393261616634613261336539656333333736626133643435313965 -62656430333737646361336132383436616561633166383261356331346231313964393335663864 -39626161633033363438643831333431316262373832373131346437343962636634663432366531 -64313362366531326537643237346437366161636566393739616565633439343733363164623337 -35383137376632656538373835346666613938396539353934633239623364666665656132326335 -34646232373966303834366439303161343433333432613831336465373234336234363731393266 -36633638373532333063373837613839303431663330306661663361626362643939303362326533 -37623030373163613734393432623432666331666161643832333661313936326532363834343563 -36623962616239336362626466396639653937363661376533346638306161653435343365613336 -30666638393132663539373434376138613530643331346261333132663938363433613366646561 -32633861306134373463623163383838666135343635623465343865373365653038336462613864 -31616535653830303262643338623864366239313038323734663663333334336161393332366330 -39333938636531646336626332643838303434653732353032306664656631313234316466333764 -62646530643639613834316632666563363830396438343533353633633836373533323563383334 -35646533623334393866333463303439396662333931316639346564343738613539643037336564 -64346264633434643539313863656364613262343037386531353735333832343963366464623332 -39386430626464643565333761303864646339623939623664306265373137366261323539326331 -66316365616236383237336533333765666363343163646531343932386430346662343762316565 -62343335316633636364333963383038613331323238353838396339626666376430346466633565 -37363666633636386165346364313433343637343932313763316562613032303764316639646433 -38393038383638613737636362333661663562326361363037626234306333326336656363633138 -63326262663136643639373366616134326564396561656666336635343639346132656237346264 -39666563663330393235336236646166616137626538613934303063663061333538656530373830 -61336663333136373662633862326238666361303538306561336162303330626166353438366335 -37336561343462303238366630356630303530333839373038323337343065383366633437323136 -39306364336265626538313833336561383132383737336333366332616632623930376630633136 -37373263653534323936643731386430396332663963613232376232313733633861323739353562 -33653536663232326464333937313463343930363462653430663961613335306439616131653962 -66333263373737376637366334313139643030663036663932633231336533396639323864383265 -62666662376533393834636661393862616435376538363334626135396236386533353463343938 -36636531396362356561303338643833386337363761633530303333373063626135356535323138 -32626338396438623065313063343037303130353136653466323537666631353237653761653938 -31326233343963373639346136626562333432386166336632333163383363663732656633366430 -62326330316637393639323733653866623532343732313366363366366139376130303464643961 -66323166356533383763343064366466623638343564656130656432313233316330613939326532 -35353530653562646131363862373966613363653133343039323336623131386238303635353662 -36616262613264636236376230333862653737323933663735616334623330313039643630343832 -64663531613061366236373730333934643932333966303835336261386633646466623765643133 -37653036303964616130646637383763656665353866383538643439666132613261633066643234 -63376130333264653863613139303161326633643861666236326363316364613033303631623537 -37393765616438653436336536303837333339336634316230336432623966633863643937613838 -38376565613039346564356265303338653538613566373837643165336161343031306631396434 -66336336623438376563336265613133613430346461373733383439373731383862316237393030 -65393837333461653432313063373034356235653837386436313564653963383232646533303337 -64356136383965393437313961343132303663373861646437616632333261396334343336646237 -66373266633531653932656638646630393165373334343564653765343136353437303338386132 -35316333653362633632663736323565663061366630303931666538303138303365663265346131 -31633762383832626539363266316532393738636637323961303930643666383633633963376164 -33613366326161303764643366366265396539376631646435356332613935343931633436643664 -64636637393637376230363937373539363337616636623565323432656235306461383136626262 -31356662316132613634663062613934633831313634663634663634633666633864353936396561 -63643133356530326261353234643864363761646264343566353764343237373361313436303732 -63393137613931326262373263633066316534353931636139326164396137313435343635663839 -36306333343361343866393134313735356564303562393934393130316461613763343337623632 -37363739626331393564303438383163626433356534303066643662396561323061643137363339 -30623734663638396335613063376364366264613862643333316264353663613537333166316232 -62333731623163656232666533666139376463666166376631316231356664343236616533313562 -30666662323566363962616461386339626439343131633031613234636261383363 diff --git a/ansible/group_vars/proxmox_cluster.yml b/ansible/group_vars/proxmox_cluster.yml deleted file mode 100644 index 1ef4277..0000000 --- a/ansible/group_vars/proxmox_cluster.yml +++ /dev/null @@ -1,5 +0,0 @@ ---- -# Cluster-scoped overrides for Proxmox-hosted workloads -# Override these values here when multiple Proxmox nodes diverge. - -openapply_proxmox_validate_certs: false diff --git a/ansible/inventory/host_vars/ai-p410.yml b/ansible/inventory/host_vars/ai-p410.yml deleted file mode 100644 index 2578d36..0000000 --- a/ansible/inventory/host_vars/ai-p410.yml +++ /dev/null @@ -1,41 +0,0 @@ ---- -# Host-specific variables for ai-p410 -# IP: 10.0.0.202 -# Auto-generated: 2026-04-21T15:40:11Z - -# Hardware Details -hardware: - platform: physical_server - cpu: Intel(R) Xeon(R) CPU E5-1630 v4 @ 3.70GHz (8 cores) - memory_gb: 126 - storage_gb: 100 - architecture: x86_64 - -# Operating System -os: - distribution: Ubuntu - version: "24.04" - codename: "Noble" - kernel: 6.8.0-110-generic - -# GPU Configuration -gpu: - enabled: true - device: /dev/dri - info: "/usr/bin/lspci -01:00.0 VGA compatible controller: NVIDIA Corporation GM206GL [Quadro M2000] (rev a1)" - -# Docker Status -docker: - installed: False - -# NFS Configuration -nfs: - mounts_configured: False - -# Network Configuration -network: - primary_ip: 10.0.0.202 - primary_interface: eno1 - hostname: ai-p410 - fqdn: ai-p410 diff --git a/ansible/inventory/host_vars/heimdall.yml b/ansible/inventory/host_vars/heimdall.yml deleted file mode 100644 index f19351d..0000000 --- a/ansible/inventory/host_vars/heimdall.yml +++ /dev/null @@ -1,73 +0,0 @@ ---- -# Host-specific variables for heimdall -# IP: 10.0.0.151 -# Auto-generated: 2026-04-21T16:25:55Z - -# Hardware Details -hardware: - platform: physical_server - cpu: Intel(R) N100 (4 cores) - memory_gb: 15 - storage_gb: 100 - architecture: x86_64 - -# Operating System -os: - distribution: Ubuntu - version: "24.04" - codename: "Noble" - kernel: 6.8.0-107-generic - -# GPU Configuration -gpu: - enabled: true - device: /dev/dri - info: "/usr/bin/lspci -00:02.0 VGA compatible controller: Intel Corporation Alder Lake-N [UHD Graphics]" - -# Docker Status -docker: - installed: True - version: "Docker version 29.4.0, build 9d7ad9f" - running_containers: - - gitea-server - - gitea-db - - goaccess - - goaccess-cron - - prowlarr - - wizarr - - profilarr - - tracearr - - tracearr-redis - - tracearr-db - - docker_registry - - radarr - - sonarr - - sabnzbd - - seerr - - trek - - authentik_worker - - authentik_server - - authentik_postgres - - authentik_redis - - vaultwarden - - komodo-core - - komodo-periphery-heimdall - - traefik - - docker-socket-proxy - - redis - - komodo-db - -# NFS Configuration -nfs: - mounts_configured: True - mount_details: | - 10.0.0.250:/Volume2/media on /mnt/media type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.0.0.151,local_lock=none,addr=10.0.0.250) - 10.0.0.250:/Volume1/appdata on /mnt/appdata type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.0.0.151,local_lock=none,addr=10.0.0.250) - -# Network Configuration -network: - primary_ip: 10.0.0.151 - primary_interface: bond0 - hostname: heimdall - fqdn: heimdall diff --git a/ansible/inventory/host_vars/pve01.yml b/ansible/inventory/host_vars/pve01.yml deleted file mode 100644 index d485a2f..0000000 --- a/ansible/inventory/host_vars/pve01.yml +++ /dev/null @@ -1,41 +0,0 @@ ---- -# Host-specific variables for pve01 -# IP: 10.0.0.201 -# Auto-generated: 2026-04-21T16:25:55Z - -# Hardware Details -hardware: - platform: physical_server - cpu: 13th Gen Intel(R) Core(TM) i5-13500T (14 cores) - memory_gb: 15 - storage_gb: 8 - architecture: x86_64 - -# Operating System -os: - distribution: Debian - version: "13.4" - codename: "Trixie" - kernel: 6.17.2-1-pve - -# GPU Configuration -gpu: - enabled: true - device: /dev/dri - info: "/usr/bin/lspci -0000:00:02.0 VGA compatible controller: Intel Corporation AlderLake-S GT1 (rev 0c)" - -# Docker Status -docker: - installed: False - -# NFS Configuration -nfs: - mounts_configured: False - -# Network Configuration -network: - primary_ip: 10.0.0.201 - primary_interface: vmbr0 - hostname: pve01 - fqdn: pve01.local diff --git a/ansible/inventory/host_vars/waldorf.yml b/ansible/inventory/host_vars/waldorf.yml deleted file mode 100644 index 5524adb..0000000 --- a/ansible/inventory/host_vars/waldorf.yml +++ /dev/null @@ -1,52 +0,0 @@ ---- -# Host-specific variables for waldorf -# IP: 10.0.0.251 -# Auto-generated: 2026-04-21T16:25:56Z - -# Hardware Details -hardware: - platform: physical_server - cpu: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz (8 cores) - memory_gb: 16 - storage_gb: 100 - architecture: x86_64 - -# Operating System -os: - distribution: Ubuntu - version: "24.04" - codename: "Noble" - kernel: 6.8.0-107-generic - -# GPU Configuration -gpu: - enabled: true - device: /dev/dri - info: "/usr/bin/lspci -01:00.0 VGA compatible controller: NVIDIA Corporation GP106M [GeForce GTX 1060 Mobile Rev. 2] (rev a1)" - -# Docker Status -docker: - installed: True - version: "Docker version 29.4.0, build 9d7ad9f" - running_containers: - - plex - - pinchflat - - komodo-periphery-waldorf - - docker-socket-proxy - - buildx_buildkit_default - - tunarr - -# NFS Configuration -nfs: - mounts_configured: True - mount_details: | - 10.0.0.250:/Volume1/appdata on /mnt/appdata type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.0.0.251,local_lock=none,addr=10.0.0.250) - 10.0.0.250:/Volume2/media on /mnt/media type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.0.0.251,local_lock=none,addr=10.0.0.250) - -# Network Configuration -network: - primary_ip: 10.0.0.110 - primary_interface: enp0s31f6 - hostname: waldorf - fqdn: waldorf diff --git a/ansible/inventory/host_vars/watchtower.yml b/ansible/inventory/host_vars/watchtower.yml deleted file mode 100644 index ec540c0..0000000 --- a/ansible/inventory/host_vars/watchtower.yml +++ /dev/null @@ -1,48 +0,0 @@ ---- -# Host-specific variables for watchtower -# IP: 10.0.0.200 -# Auto-generated: 2026-04-21T16:25:55Z - -# Hardware Details -hardware: - platform: physical_server - cpu: 2 (4 cores) - memory_gb: 16 - storage_gb: 2 - architecture: aarch64 - -# Operating System -os: - distribution: Debian - version: "13.4" - codename: "Trixie" - kernel: 6.12.75+rpt-rpi-2712 - -# GPU Configuration -gpu: - enabled: true - device: /dev/dri - info: "/usr/bin/lspci" - -# Docker Status -docker: - installed: True - version: "Docker version 29.4.0, build 9d7ad9f" - running_containers: - - vscode - - komodo-perihery-watchtower - - traefik-kop - - docker-socket-proxy - -# NFS Configuration -nfs: - mounts_configured: True - mount_details: | - 10.0.0.250:/Volume1/appdata on /mnt/appdata type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.0.0.250,mountvers=3,mountport=39187,mountproto=udp,local_lock=all,addr=10.0.0.250,x-systemd.automount) - -# Network Configuration -network: - primary_ip: 10.0.0.80 - primary_interface: eth0 - hostname: watchtower - fqdn: watchtower diff --git a/ansible/inventory/hosts.ini b/ansible/inventory/hosts.ini deleted file mode 100644 index 211d750..0000000 --- a/ansible/inventory/hosts.ini +++ /dev/null @@ -1,58 +0,0 @@ -# Ansible Inventory for Homelab Infrastructure -# This is the active inventory - do NOT use archive/inventory/hosts.ini - -# ============================================================================= -# Control Plane -# ============================================================================= -[control_plane] -watchtower ansible_host=10.0.0.200 ansible_user=chester - -# ============================================================================= -# Docker Nodes -# ============================================================================= -[docker_nodes] -heimdall ansible_host=10.0.0.151 ansible_user=chester -waldorf ansible_host=10.0.0.251 ansible_user=chester - -# Core infrastructure services (Komodo, Gitea, Traefik) -[core_services] -heimdall ansible_host=10.0.0.151 ansible_user=chester - -# Media services (Plex, Tunarr) -[media_services] -waldorf ansible_host=10.0.0.251 ansible_user=chester - -# ============================================================================= -# Platform Groups -# ============================================================================= -[proxmox_cluster] -pve01 ansible_host=10.0.0.201 ansible_user=root - -[physical_servers] -heimdall ansible_host=10.0.0.151 ansible_user=chester -waldorf ansible_host=10.0.0.251 ansible_user=chester -ai-p410 ansible_host=10.0.0.202 ansible_user=chester - -[ai_nodes] -ai-p410 ansible_host=10.0.0.202 ansible_user=chester - -[raspberry_pi] -watchtower ansible_host=10.0.0.200 ansible_user=chester - -# ============================================================================= -# NFS Clients (nodes with /mnt/appdata) -# ============================================================================= -[nfs_clients] -heimdall ansible_host=10.0.0.151 ansible_user=chester -waldorf ansible_host=10.0.0.251 ansible_user=chester -ai-p410 ansible_host=10.0.0.202 ansible_user=chester - -# ============================================================================= -# Group Variables -# ============================================================================= -[all:vars] -ansible_python_interpreter=/usr/bin/python3 -ansible_ssh_private_key_file=~/.ssh/id_ed25519 - -[nfs_clients:vars] -nfs_mount_point=/mnt/appdata diff --git a/ansible/playbooks/ONBOARDING.md b/ansible/playbooks/ONBOARDING.md deleted file mode 100644 index 675fa27..0000000 --- a/ansible/playbooks/ONBOARDING.md +++ /dev/null @@ -1,130 +0,0 @@ -# Node Onboarding Guide - -This guide covers onboarding new nodes into Ansible management from the watchtower control node. - ---- - -## Prerequisites - -**On Control Node (watchtower):** -- βœ… Ansible installed -- βœ… SSH key generated (`~/.ssh/id_ed25519`) -- βœ… Inventory configured - -**On Target Nodes:** -- SSH access with password authentication enabled -- User account with sudo privileges -- Python 3 installed - ---- - -## Quick Onboarding - -### Step 1: Update Inventory - -Edit [inventory/hosts.ini](../inventory/hosts.ini) to add the new node: - -```ini -[docker_nodes] -newnode ansible_host=10.0.0.X ansible_user=chester -``` - -### Step 2: Run Onboarding Playbook - -```bash -cd /home/chester/homelab/ansible - -# Onboard specific nodes (will prompt for passwords) -ansible-playbook playbooks/onboard-nodes.yml -k -K --limit newnode - -# Onboard all unonboarded nodes -ansible-playbook playbooks/onboard-nodes.yml -k -K --limit heimdall,waldorf -``` - -**Flags:** -- `-k` = Prompt for SSH password (initial connection) -- `-K` = Prompt for sudo password (if passwordless sudo not configured) - -### Step 3: Test Connectivity - -```bash -# Test basic connectivity -ansible newnode -m ping - -# Test with privilege escalation -ansible newnode -b -m command -a 'whoami' - -# Gather facts about the node -ansible newnode -m setup -``` - ---- - -## Current Node Status - -| Node | IP | Platform | Services | Status | -|------|-------|----------|----------|--------| -| **watchtower** | 10.0.0.200 | Raspberry Pi 5 | Control Plane, Komodo Periphery | βœ… Control Node | -| **heimdall** | 10.0.0.151 | Proxmox VM | Komodo Core, Gitea, Traefik | ⏳ Pending onboarding | -| **waldorf** | 10.0.0.251 | Physical Server | Plex, Tunarr | ⏳ Pending onboarding | - ---- - -## What the Onboarding Playbook Does - -1. βœ… Deploys watchtower's SSH public key to target node -2. βœ… Verifies passwordless sudo configuration -3. βœ… Checks Python 3 availability -4. βœ… Validates Docker installation -5. βœ… Verifies NFS mount points - ---- - -## Post-Onboarding - -After successful onboarding, you can: - -- Use all Ansible modules without password prompts -- Run playbooks against the node -- Automate deployments and configuration management - ---- - -## Troubleshooting - -### SSH Connection Fails - -```bash -# Test manual SSH connection first -ssh chester@10.0.0.151 - -# If that works but Ansible fails, check inventory syntax -ansible-inventory --list -``` - -### Passwordless Sudo Required - -Edit `/etc/sudoers.d/90-cloud-init-users` on target node: - -```bash -# Allow user to run sudo without password -chester ALL=(ALL) NOPASSWD:ALL -``` - -### Python Not Found - -```bash -# Install Python 3 on target node -sudo apt update && sudo apt install -y python3 -``` - ---- - -## Next Steps - -After onboarding, consider: - -1. Configure automated deployments for Docker stacks -2. Set up monitoring and health checks -3. Implement backup automation -4. Create maintenance playbooks (updates, reboots, etc.) diff --git a/ansible/playbooks/RUN-GITVANA-BUN.md b/ansible/playbooks/RUN-GITVANA-BUN.md deleted file mode 100644 index 1537e7d..0000000 --- a/ansible/playbooks/RUN-GITVANA-BUN.md +++ /dev/null @@ -1,87 +0,0 @@ -# Run Gitvana Bun Deployment - -This runbook deploys Gitvana directly to Linux VM/LXC hosts using the Ansible role `gitvana_bun_host`. - -## Files added - -- `roles/gitvana_bun_host/*` -- `playbooks/deploy-gitvana-bun.yml` -- `group_vars/all/gitvana_bun.yml` - -## 1) Prepare control node - -From the repository root: - -```bash -cd ansible -ansible --version -ansible-galaxy collection install -r requirements.yml -``` - -## 2) Confirm target host access - -Use your existing inventory and test connectivity: - -```bash -ansible -i inventory/hosts.ini docker_nodes -m ping -``` - -If you want a single host, replace `docker_nodes` with a host like `heimdall`. - -## 3) Review or override deployment variables - -Default values are in `group_vars/all/gitvana_bun.yml` and role defaults. - -Common overrides: - -```bash --e gitvana_target_hosts=heimdall --e gitvana_repo_version=main --e gitvana_service_port=3000 --e gitvana_run_mode=preview -``` - -## 4) Run deployment - -Deploy to all `docker_nodes`: - -```bash -ansible-playbook -i inventory/hosts.ini playbooks/deploy-gitvana-bun.yml -``` - -Deploy to one host: - -```bash -ansible-playbook -i inventory/hosts.ini playbooks/deploy-gitvana-bun.yml -e gitvana_target_hosts=heimdall -``` - -## 5) Verify service - -Check status on target host: - -```bash -sudo systemctl status gitvana --no-pager -sudo journalctl -u gitvana -n 100 --no-pager -curl -I http://127.0.0.1:3000/ -``` - -## 6) Day-2 operations - -Redeploy after code updates: - -```bash -ansible-playbook -i inventory/hosts.ini playbooks/deploy-gitvana-bun.yml -e gitvana_target_hosts=heimdall -``` - -Restart service only: - -```bash -ansible -i inventory/hosts.ini heimdall -b -m ansible.builtin.systemd -a "name=gitvana state=restarted" -``` - -## Troubleshooting quick checks - -- Ensure Bun is present: `which bun && bun --version` -- Ensure app directory is owned by runtime user: `ls -la /opt/gitvana` -- Ensure service unit exists: `cat /etc/systemd/system/gitvana.service` -- Ensure selected host can access the git repository URL over network diff --git a/ansible/playbooks/deploy-aitutor-vm.yml b/ansible/playbooks/deploy-aitutor-vm.yml deleted file mode 100644 index 6b9abf0..0000000 --- a/ansible/playbooks/deploy-aitutor-vm.yml +++ /dev/null @@ -1,16 +0,0 @@ ---- -- name: Provision VM on Proxmox for AI Tutor - hosts: localhost - gather_facts: false - connection: local - - roles: - - role: proxmox_vm_deploy - -- name: Install AI Tutor on provisioned VM - hosts: aitutor_vm - gather_facts: true - become: true - - roles: - - role: aitutor_install diff --git a/ansible/playbooks/deploy-gitvana-bun.yml b/ansible/playbooks/deploy-gitvana-bun.yml deleted file mode 100644 index d3e8120..0000000 --- a/ansible/playbooks/deploy-gitvana-bun.yml +++ /dev/null @@ -1,15 +0,0 @@ ---- -- name: Deploy Gitvana on Linux VM/LXC hosts with Bun - hosts: "{{ gitvana_target_hosts | default('docker_nodes') }}" - gather_facts: true - become: true - - pre_tasks: - - name: Validate target host pattern - ansible.builtin.assert: - that: - - (gitvana_target_hosts | default('docker_nodes')) | length > 0 - fail_msg: "gitvana_target_hosts must not be empty." - - roles: - - role: gitvana_bun_host diff --git a/ansible/playbooks/gather-node-facts.yml b/ansible/playbooks/gather-node-facts.yml deleted file mode 100644 index d4ab135..0000000 --- a/ansible/playbooks/gather-node-facts.yml +++ /dev/null @@ -1,140 +0,0 @@ ---- -# Gather Node Facts Playbook -# Purpose: Collect accurate system information from nodes for inventory -# Usage: ansible-playbook playbooks/gather-node-facts.yml -# Add --limit to target specific nodes -# Use -k flag only if nodes aren't onboarded yet - -- name: Gather facts from managed nodes - hosts: all - gather_facts: true - become: false - vars: - output_dir: "{{ playbook_dir }}/../inventory/host_vars" - tasks: - - name: Display discovered facts summary - ansible.builtin.debug: - msg: - - "======================================" - - "Host: {{ inventory_hostname }}" - - "======================================" - - "FQDN: {{ ansible_fqdn }}" - - "Distribution: {{ ansible_distribution }} {{ ansible_distribution_version }}" - - "Kernel: {{ ansible_kernel }}" - - "Architecture: {{ ansible_architecture }}" - - "CPU Model: {{ ansible_processor[2] | default('N/A') }}" - - "CPU Cores: {{ ansible_processor_vcpus }}" - - "Memory: {{ (ansible_memtotal_mb / 1024) | round(0) }} GB" - - "Primary IP: {{ ansible_default_ipv4.address }}" - - "Hostname: {{ ansible_hostname }}" - - - name: Check for GPU devices - ansible.builtin.stat: - path: /dev/dri - register: gpu_check - - - name: Detect GPU information (if available) - ansible.builtin.shell: | - if command -v lspci &> /dev/null; then - lspci | grep -i vga | head -1 - else - echo "lspci not available" - fi - register: gpu_info - changed_when: false - failed_when: false - when: gpu_check.stat.exists - - - name: Check Docker installation - ansible.builtin.command: docker --version - register: docker_version - changed_when: false - failed_when: false - - - name: Check NFS mounts - ansible.builtin.shell: mount | grep nfs || echo "No NFS mounts" - register: nfs_mounts - changed_when: false - failed_when: false - - - name: Detect running Docker containers - ansible.builtin.command: docker ps --format "{{ '{{' }}.Names{{ '}}' }}" - register: docker_containers - changed_when: false - failed_when: false - when: docker_version.rc == 0 - - - name: Generate host_vars content - ansible.builtin.set_fact: - host_vars_content: | - --- - # Host-specific variables for {{ inventory_hostname }} - # IP: {{ ansible_host }} - # Auto-generated: {{ ansible_date_time.iso8601 }} - - # Hardware Details - hardware: - platform: {{ 'proxmox_vm' if 'pve' in ansible_system_vendor | lower else 'physical_server' }} - cpu: {{ ansible_processor[2] if ansible_processor | length > 2 else ansible_processor[0] }} ({{ ansible_processor_vcpus }} cores) - memory_gb: {{ (ansible_memtotal_mb / 1024) | round(0) | int }} - storage_gb: {{ (ansible_devices[ansible_devices.keys() | list | first].size | replace('GB', '') | float) | round(0) | int if ansible_devices else 'unknown' }} - architecture: {{ ansible_architecture }} - - # Operating System - os: - distribution: {{ ansible_distribution }} - version: "{{ ansible_distribution_version }}" - codename: "{{ ansible_distribution_release | title }}" - kernel: {{ ansible_kernel }} - - {% if gpu_check.stat.exists and gpu_info.stdout != "lspci not available" %} - # GPU Configuration - gpu: - enabled: true - device: /dev/dri - info: "{{ gpu_info.stdout }}" - {% endif %} - - # Docker Status - docker: - installed: {{ docker_version.rc == 0 }} - {% if docker_version.rc == 0 %} - version: "{{ docker_version.stdout }}" - {% endif %} - {% if docker_containers.stdout_lines | default([]) | length > 0 %} - running_containers: - {% for container in docker_containers.stdout_lines %} - - {{ container }} - {% endfor %} - {% endif %} - - # NFS Configuration - nfs: - mounts_configured: {{ 'nfs' in nfs_mounts.stdout }} - {% if 'nfs' in nfs_mounts.stdout %} - mount_details: | - {{ nfs_mounts.stdout | indent(6) }} - {% endif %} - - # Network Configuration - network: - primary_ip: {{ ansible_default_ipv4.address }} - primary_interface: {{ ansible_default_ipv4.interface }} - hostname: {{ ansible_hostname }} - fqdn: {{ ansible_fqdn }} - - - name: Display generated host_vars - ansible.builtin.debug: - msg: "{{ host_vars_content }}" - - - name: Save host_vars to file (local action) - delegate_to: localhost - ansible.builtin.copy: - content: "{{ host_vars_content }}" - dest: "{{ output_dir }}/{{ inventory_hostname }}.yml" - mode: "0644" - become: false - - - name: Summary - ansible.builtin.debug: - msg: "βœ… Generated {{ output_dir }}/{{ inventory_hostname }}.yml" diff --git a/ansible/playbooks/onboard-ai-node.yml b/ansible/playbooks/onboard-ai-node.yml deleted file mode 100644 index deafaad..0000000 --- a/ansible/playbooks/onboard-ai-node.yml +++ /dev/null @@ -1,11 +0,0 @@ ---- -# Dedicated onboarding workflow for AI-focused nodes. -# Usage: ansible-playbook playbooks/onboard-ai-node.yml -K --limit ai-p410 - -- name: Onboard and baseline AI nodes - hosts: ai_nodes - gather_facts: true - become: true - - roles: - - role: ai_node_onboarding diff --git a/ansible/playbooks/onboard-nodes.yml b/ansible/playbooks/onboard-nodes.yml deleted file mode 100644 index d86a02f..0000000 --- a/ansible/playbooks/onboard-nodes.yml +++ /dev/null @@ -1,106 +0,0 @@ ---- -# Node Onboarding Playbook -# Purpose: Bootstrap new nodes for Ansible management -# Usage: ansible-playbook playbooks/onboard-nodes.yml -k -K -# (-k prompts for SSH password, -K prompts for sudo password) - -- name: Onboard new nodes to Ansible control - hosts: physical_servers - gather_facts: true - become: false - tasks: - - name: Gather OS facts - ansible.builtin.setup: - gather_subset: - - "!all" - - "!min" - - "network" - - "distribution" - - - name: Display target host information - ansible.builtin.debug: - msg: | - Onboarding {{ inventory_hostname }} - IP: {{ ansible_host }} - Distribution: {{ ansible_distribution }} {{ ansible_distribution_version }} - Architecture: {{ ansible_architecture }} - - - name: Ensure .ssh directory exists - ansible.builtin.file: - path: "/home/{{ ansible_user }}/.ssh" - state: directory - mode: "0700" - owner: "{{ ansible_user }}" - group: "{{ ansible_user }}" - - - name: Deploy watchtower SSH public key - ansible.builtin.authorized_key: - user: "{{ ansible_user }}" - state: present - key: "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIJ9ryXcRsMITcIW+Rc0t3Qou7XGfyIeihLR2PInySogp ansible@watchtower" - comment: "ansible@watchtower" - - - name: Test passwordless sudo access - ansible.builtin.command: sudo -n true - register: sudo_check - changed_when: false - failed_when: false - - - name: Display sudo access status - ansible.builtin.debug: - msg: >- - {% if sudo_check.rc == 0 %} - βœ… Passwordless sudo is configured - {% else %} - ⚠️ Passwordless sudo is NOT configured - some playbooks may require -K flag - {% endif %} - - - name: Verify Python 3 is available - ansible.builtin.command: python3 --version - register: python_version - changed_when: false - - - name: Display Python version - ansible.builtin.debug: - msg: "Python: {{ python_version.stdout }}" - - - name: Check if Docker is installed - ansible.builtin.command: docker --version - register: docker_check - changed_when: false - failed_when: false - - - name: Display Docker status - ansible.builtin.debug: - msg: >- - {% if docker_check.rc == 0 %} - βœ… Docker installed: {{ docker_check.stdout }} - {% else %} - ⚠️ Docker is NOT installed - {% endif %} - - - name: Check NFS mount point - ansible.builtin.stat: - path: /mnt/appdata - register: nfs_mount - - - name: Display NFS mount status - ansible.builtin.debug: - msg: >- - {% if nfs_mount.stat.exists %} - βœ… /mnt/appdata exists - {% else %} - ⚠️ /mnt/appdata does NOT exist - {% endif %} - - - name: Create onboarding summary - ansible.builtin.debug: - msg: - - "==========================================" - - "Onboarding Complete for {{ inventory_hostname }}" - - "==========================================" - - "βœ… SSH key deployed" - - "βœ… Host is reachable" - - "Next steps:" - - " β€’ Test connectivity: ansible {{ inventory_hostname }} -m ping" - - " β€’ Verify sudo: ansible {{ inventory_hostname }} -b -m command -a 'whoami'" diff --git a/ansible/playbooks/onboard-proxmox.yml b/ansible/playbooks/onboard-proxmox.yml deleted file mode 100644 index 2011de0..0000000 --- a/ansible/playbooks/onboard-proxmox.yml +++ /dev/null @@ -1,86 +0,0 @@ ---- -# Proxmox Node Onboarding Playbook -# Purpose: Onboard Proxmox VE hosts with post-install configuration -# Usage: ansible-playbook playbooks/onboard-proxmox.yml -k --limit pve01 -# (-k prompts for root SSH password on first run) - -- name: Onboard Proxmox VE node - hosts: proxmox_cluster - gather_facts: true - become: false # Already connecting as root - - tasks: - - name: Display target host information - ansible.builtin.debug: - msg: | - Onboarding {{ inventory_hostname }} - IP: {{ ansible_host }} - User: {{ ansible_user }} - - - name: Ensure .ssh directory exists for root - ansible.builtin.file: - path: /root/.ssh - state: directory - mode: "0700" - owner: root - group: root - - - name: Deploy watchtower SSH public key to root - ansible.builtin.authorized_key: - user: root - state: present - key: "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIJ9ryXcRsMITcIW+Rc0t3Qou7XGfyIeihLR2PInySogp ansible@watchtower" - comment: "ansible@watchtower" - - - name: Detect Proxmox VE version - ansible.builtin.command: pveversion - register: pve_version_check - changed_when: false - failed_when: false - - - name: Display Proxmox version - ansible.builtin.debug: - msg: | - {% if pve_version_check.rc == 0 %} - βœ… Proxmox VE detected: {{ pve_version_check.stdout }} - {% else %} - ⚠️ Could not detect Proxmox VE (pveversion command failed) - {% endif %} - - - name: Verify Python 3 is available - ansible.builtin.command: python3 --version - register: python_version - changed_when: false - - - name: Display Python version - ansible.builtin.debug: - msg: "Python: {{ python_version.stdout }}" - - - name: Run Proxmox post-install configuration - ansible.builtin.include_role: - name: proxmox_post_install - vars: - proxmox_post_install_enabled: true - proxmox_disable_subscription_nag: true - proxmox_disable_pve_enterprise: true - proxmox_enable_pve_no_subscription: true - proxmox_fix_sources: true - proxmox_fix_ceph_repos: true - proxmox_run_dist_upgrade: false # Skip for initial onboarding - proxmox_reboot_after: false # Manual control - when: pve_version_check.rc == 0 - - - name: Display onboarding summary - ansible.builtin.debug: - msg: - - "==========================================" - - "Proxmox Onboarding Complete: {{ inventory_hostname }}" - - "==========================================" - - "βœ… SSH key deployed to root" - - "βœ… Subscription nag removed" - - "βœ… Repositories configured" - - "" - - "Next steps:" - - " β€’ Test connectivity: ansible pve01 -m ping" - - " β€’ Update system: ansible pve01 -m apt -a 'upgrade=dist update_cache=yes'" - - " β€’ Review logs and reboot if kernel/system updates applied" diff --git a/ansible/playbooks/quick-facts.yml b/ansible/playbooks/quick-facts.yml deleted file mode 100644 index 20f0cdb..0000000 --- a/ansible/playbooks/quick-facts.yml +++ /dev/null @@ -1,39 +0,0 @@ ---- -# Quick Facts Display -# Purpose: Show key system information without saving -# Usage: ansible-playbook playbooks/quick-facts.yml -k --limit hostname - -- name: Quick system facts check - hosts: all - gather_facts: true - tasks: - - name: Display system summary - ansible.builtin.debug: - msg: - - "==========================================" - - "{{ inventory_hostname | upper }}" - - "==========================================" - - "IP Address: {{ ansible_host }}" - - "OS: {{ ansible_distribution }} {{ ansible_distribution_version }}" - - "Kernel: {{ ansible_kernel }}" - - "Arch: {{ ansible_architecture }}" - - "CPU: {{ ansible_processor[2] | default(ansible_processor[0]) }}" - - "Cores: {{ ansible_processor_vcpus }}" - - "Memory: {{ (ansible_memtotal_mb / 1024) | round(1) }} GB" - - "Disk: {{ ansible_devices.keys() | list }}" - - "Hostname: {{ ansible_hostname }}" - - "FQDN: {{ ansible_fqdn }}" - - - name: Check for key paths - ansible.builtin.stat: - path: "{{ item }}" - loop: - - /mnt/appdata - - /dev/dri - - /usr/bin/docker - register: path_checks - - - name: Display path status - ansible.builtin.debug: - msg: "{{ item.stat.path }}: {{ 'βœ… exists' if item.stat.exists else '❌ missing' }}" - loop: "{{ path_checks.results }}" diff --git a/ansible/playbooks/setup-ai-gpu-runtime.yml b/ansible/playbooks/setup-ai-gpu-runtime.yml deleted file mode 100644 index 8ce1ce5..0000000 --- a/ansible/playbooks/setup-ai-gpu-runtime.yml +++ /dev/null @@ -1,11 +0,0 @@ ---- -# Configure NVIDIA driver/runtime for AI nodes. -# Usage: ansible-playbook playbooks/setup-ai-gpu-runtime.yml -K --limit ai-p410 - -- name: Configure NVIDIA runtime on AI nodes - hosts: ai_nodes - gather_facts: true - become: true - - roles: - - role: nvidia_runtime_setup diff --git a/ansible/playbooks/test-connection.yml b/ansible/playbooks/test-connection.yml deleted file mode 100644 index ee7f83d..0000000 --- a/ansible/playbooks/test-connection.yml +++ /dev/null @@ -1,14 +0,0 @@ ---- -# Test playbook to verify Ansible control node setup -# Usage: ansible-playbook playbooks/test-connection.yml - -- name: Test connection to all hosts - hosts: all - gather_facts: true - tasks: - - name: Ping all hosts - ansible.builtin.ping: - - - name: Display host information - ansible.builtin.debug: - msg: "Connected to {{ inventory_hostname }} ({{ ansible_host }})" diff --git a/ansible/playbooks/validate-connectivity.yml b/ansible/playbooks/validate-connectivity.yml deleted file mode 100644 index 80310a2..0000000 --- a/ansible/playbooks/validate-connectivity.yml +++ /dev/null @@ -1,121 +0,0 @@ ---- -# Comprehensive Ansible Environment Validation -# Purpose: Deep health check of all managed nodes -# Usage: ansible-playbook playbooks/validate-connectivity.yml - -- name: Ansible Environment Validation - hosts: all - gather_facts: true - - tasks: - - name: Test ping module - ansible.builtin.ping: - - - name: Display node facts - ansible.builtin.debug: - msg: | - Hostname: {{ ansible_hostname }} - OS: {{ ansible_distribution }} {{ ansible_distribution_version }} - Architecture: {{ ansible_architecture }} - Python: {{ ansible_python_version }} - Total Memory: {{ (ansible_memory_mb.real.total / 1024) | round(1) }}GB - CPU Cores: {{ ansible_processor_vcpus }} - - - name: Test privilege escalation - ansible.builtin.command: - cmd: whoami - become: true - register: sudo_test - changed_when: false - - - name: Verify sudo worked - ansible.builtin.assert: - that: - - sudo_test.stdout == "root" - success_msg: "Privilege escalation: PASS" - fail_msg: "Privilege escalation: FAIL" - - - name: Check Docker installation - ansible.builtin.command: - cmd: docker --version - register: docker_version - changed_when: false - failed_when: false - when: inventory_hostname in groups['docker_nodes'] - - - name: Display Docker status - ansible.builtin.debug: - msg: "Docker {{ 'installed: ' + docker_version.stdout if docker_version.rc == 0 else 'NOT installed' }}" - when: inventory_hostname in groups['docker_nodes'] - - - name: Check NFS mount (infrastructure nodes only) - ansible.builtin.stat: - path: /mnt/appdata - register: nfs_mount - when: inventory_hostname in groups.get('nfs_clients', []) - - - name: Display NFS status - ansible.builtin.debug: - msg: "NFS mount /mnt/appdata: {{ 'EXISTS' if nfs_mount.stat.exists else 'MISSING' }}" - when: - - inventory_hostname in groups.get('nfs_clients', []) - - nfs_mount is defined - - - name: Check available disk space - ansible.builtin.shell: - cmd: set -o pipefail && df -h / | tail -1 | awk '{print $5}' | sed 's/%//' - executable: /bin/bash - register: disk_usage - changed_when: false - - - name: Warn if disk usage high - ansible.builtin.debug: - msg: "WARNING: Root filesystem {{ disk_usage.stdout }}% full" - when: disk_usage.stdout | int > 80 - - - name: Check system uptime - ansible.builtin.command: - cmd: uptime -p - register: uptime_output - changed_when: false - - - name: Display uptime - ansible.builtin.debug: - msg: "System uptime: {{ uptime_output.stdout }}" - -- name: Proxmox-specific validation - hosts: proxmox_cluster - gather_facts: false - - tasks: - - name: Check Proxmox version - ansible.builtin.command: - cmd: pveversion - register: pve_version - changed_when: false - - - name: Display Proxmox version - ansible.builtin.debug: - msg: "{{ pve_version.stdout_lines }}" - - - name: Check cluster status - ansible.builtin.command: - cmd: pvecm status - register: cluster_status - changed_when: false - failed_when: false - - - name: Display cluster info - ansible.builtin.debug: - msg: "{{ 'Cluster configured' if cluster_status.rc == 0 else 'Standalone node (no cluster)' }}" - -- name: Final summary - hosts: all - gather_facts: false - - tasks: - - name: Environment validation complete - ansible.builtin.debug: - msg: | - βœ… Validation complete for {{ inventory_hostname }} - All critical checks passed successfully. diff --git a/ansible/requirements.yml b/ansible/requirements.yml deleted file mode 100644 index 333d4bd..0000000 --- a/ansible/requirements.yml +++ /dev/null @@ -1,39 +0,0 @@ ---- -# Ansible Galaxy requirements -# Install with: ansible-galaxy install -r requirements.yml -# -# This file tracks all external collections and roles required by this repository. -# Version pinning ensures reproducible deployments. -# -# Last updated: 2026-01-10 - -collections: - # Community Proxmox Collection - # Used for: proxmox lifecycle, kvm, and nic management modules - # Docs: https://docs.ansible.com/ansible/latest/collections/community/proxmox/ - - name: community.proxmox - version: ">=1.6.0" - - # Community General Collection - # Used for: docker modules and general utilities - # Docs: https://docs.ansible.com/ansible/latest/collections/community/general/ - - name: community.general - version: ">=8.0.0" - - # Community Docker Collection - # Used for: docker_swarm, docker_container, docker_network modules - # Docs: https://docs.ansible.com/ansible/latest/collections/community/docker/ - - name: community.docker - version: ">=3.0.0" - - # Ansible POSIX Collection - # Used for: authorized_key, synchronize, sysctl modules - # Docs: https://docs.ansible.com/ansible/latest/collections/ansible/posix/ - - name: ansible.posix - version: ">=1.5.0" - -# roles: - # Add external roles here as needed - # Example: - # - name: geerlingguy.docker - # version: "6.1.0" diff --git a/ansible/roles/ai_node_onboarding/defaults/main.yml b/ansible/roles/ai_node_onboarding/defaults/main.yml deleted file mode 100644 index a538766..0000000 --- a/ansible/roles/ai_node_onboarding/defaults/main.yml +++ /dev/null @@ -1,36 +0,0 @@ ---- -# Toggle to true only if you intentionally want to hard-fail when NVIDIA tooling is missing. -ai_node_require_nvidia_tooling: false - -# OS packages useful for AI-node observability and build workloads. -ai_node_base_packages: - - ca-certificates - - curl - - git - - htop - - nvtop - - pciutils - - python3 - - python3-pip - - python3-venv - - tmux - -# Conservative kernel tuning for mixed service + AI workloads. -ai_node_sysctl: - vm.swappiness: "10" - vm.max_map_count: "262144" - -# AI workload directories. Keep models/data on persistent storage. -ai_node_directories: - - path: /srv/ai - owner: root - group: root - mode: "0755" - - path: /srv/ai/models - owner: root - group: root - mode: "0755" - - path: /srv/ai/workspaces - owner: root - group: root - mode: "0775" diff --git a/ansible/roles/ai_node_onboarding/tasks/gpu_checks.yml b/ansible/roles/ai_node_onboarding/tasks/gpu_checks.yml deleted file mode 100644 index f9757a2..0000000 --- a/ansible/roles/ai_node_onboarding/tasks/gpu_checks.yml +++ /dev/null @@ -1,27 +0,0 @@ ---- -- name: Check nvidia-smi availability - ansible.builtin.command: nvidia-smi --query-gpu=name,memory.total,driver_version --format=csv,noheader - register: ai_node_nvidia_smi - changed_when: false - failed_when: false - -- name: Optionally fail when NVIDIA tooling is required but unavailable - ansible.builtin.fail: - msg: >- - NVIDIA GPU tooling is unavailable. Install a compatible NVIDIA driver and - nvidia-utils package, then re-run onboarding. - when: - - ai_node_require_nvidia_tooling | bool - - ai_node_nvidia_smi.rc != 0 - -- name: Warn when nvidia-smi is unavailable - ansible.builtin.debug: - msg: >- - nvidia-smi is not available yet. This is common on fresh hosts before driver install. - Continue onboarding now, then install validated drivers separately. - when: ai_node_nvidia_smi.rc != 0 - -- name: Capture GPU info lines - ansible.builtin.set_fact: - ai_node_gpu_lines: "{{ ai_node_nvidia_smi.stdout_lines | default([]) }}" - when: ai_node_nvidia_smi.rc == 0 diff --git a/ansible/roles/ai_node_onboarding/tasks/main.yml b/ansible/roles/ai_node_onboarding/tasks/main.yml deleted file mode 100644 index 567508d..0000000 --- a/ansible/roles/ai_node_onboarding/tasks/main.yml +++ /dev/null @@ -1,15 +0,0 @@ ---- -- name: Validate AI node prerequisites - ansible.builtin.import_tasks: validate.yml - -- name: Install baseline packages - ansible.builtin.import_tasks: prereqs.yml - -- name: Apply kernel tuning - ansible.builtin.import_tasks: tuning.yml - -- name: Run GPU readiness checks - ansible.builtin.import_tasks: gpu_checks.yml - -- name: Show onboarding summary - ansible.builtin.import_tasks: summary.yml diff --git a/ansible/roles/ai_node_onboarding/tasks/prereqs.yml b/ansible/roles/ai_node_onboarding/tasks/prereqs.yml deleted file mode 100644 index bcddb45..0000000 --- a/ansible/roles/ai_node_onboarding/tasks/prereqs.yml +++ /dev/null @@ -1,15 +0,0 @@ ---- -- name: Install AI-node baseline packages - ansible.builtin.apt: - name: "{{ ai_node_base_packages }}" - state: present - update_cache: true - -- name: Ensure AI workload directories exist - ansible.builtin.file: - path: "{{ item.path }}" - state: directory - owner: "{{ item.owner }}" - group: "{{ item.group }}" - mode: "{{ item.mode }}" - loop: "{{ ai_node_directories }}" diff --git a/ansible/roles/ai_node_onboarding/tasks/summary.yml b/ansible/roles/ai_node_onboarding/tasks/summary.yml deleted file mode 100644 index ff4d38a..0000000 --- a/ansible/roles/ai_node_onboarding/tasks/summary.yml +++ /dev/null @@ -1,12 +0,0 @@ ---- -- name: Summarize AI-node onboarding results - ansible.builtin.debug: - msg: - - "==========================================" - - "AI node onboarding complete for {{ inventory_hostname }}" - - "==========================================" - - "RAM (MB): {{ ansible_memtotal_mb }}" - - "NVIDIA detected via lspci: {{ ai_node_has_nvidia_gpu | default(false) }}" - - "nvidia-smi ready: {{ (ai_node_nvidia_smi.rc | default(1)) == 0 }}" - - "GPU details: {{ ai_node_gpu_lines | default(['not available']) | join('; ') }}" - - "AI directories: {{ ai_node_directories | map(attribute='path') | list | join(', ') }}" diff --git a/ansible/roles/ai_node_onboarding/tasks/tuning.yml b/ansible/roles/ai_node_onboarding/tasks/tuning.yml deleted file mode 100644 index 5e200e5..0000000 --- a/ansible/roles/ai_node_onboarding/tasks/tuning.yml +++ /dev/null @@ -1,8 +0,0 @@ ---- -- name: Apply sysctl settings for AI workloads - ansible.posix.sysctl: - name: "{{ item.key }}" - value: "{{ item.value }}" - state: present - reload: true - loop: "{{ ai_node_sysctl | dict2items }}" diff --git a/ansible/roles/ai_node_onboarding/tasks/validate.yml b/ansible/roles/ai_node_onboarding/tasks/validate.yml deleted file mode 100644 index 0b283d0..0000000 --- a/ansible/roles/ai_node_onboarding/tasks/validate.yml +++ /dev/null @@ -1,26 +0,0 @@ ---- -- name: Assert supported operating system family - ansible.builtin.assert: - that: - - ansible_os_family == "Debian" - fail_msg: "ai_node_onboarding currently supports Debian/Ubuntu only." - -- name: Assert minimum RAM for AI node profile - ansible.builtin.assert: - that: - - ansible_memtotal_mb | int >= 16384 - fail_msg: "AI node profile expects at least 16 GB RAM." - -- name: Detect NVIDIA GPU via lspci - ansible.builtin.command: lspci - register: ai_node_lspci - changed_when: false - -- name: Derive GPU detection flag - ansible.builtin.set_fact: - ai_node_has_nvidia_gpu: "{{ 'NVIDIA' in ai_node_lspci.stdout }}" - -- name: Warn when no NVIDIA GPU is detected - ansible.builtin.debug: - msg: "No NVIDIA GPU was detected via lspci; continuing because this check is advisory." - when: not ai_node_has_nvidia_gpu | bool diff --git a/ansible/roles/aitutor_install/defaults/main.yml b/ansible/roles/aitutor_install/defaults/main.yml deleted file mode 100644 index 3186696..0000000 --- a/ansible/roles/aitutor_install/defaults/main.yml +++ /dev/null @@ -1,7 +0,0 @@ ---- -aitutor_npm_package: "@aitutor/cli" -aitutor_npm_version: "latest" -aitutor_extra_packages: - - nodejs - - npm - - ca-certificates diff --git a/ansible/roles/aitutor_install/tasks/main.yml b/ansible/roles/aitutor_install/tasks/main.yml deleted file mode 100644 index 7474a46..0000000 --- a/ansible/roles/aitutor_install/tasks/main.yml +++ /dev/null @@ -1,28 +0,0 @@ ---- -- name: Ensure supported OS family - ansible.builtin.assert: - that: - - ansible_os_family == 'Debian' - fail_msg: "This role currently supports Debian-family distributions only." - -- name: Install runtime packages for AI Tutor - ansible.builtin.apt: - name: "{{ aitutor_extra_packages }}" - state: present - update_cache: true - -- name: Install or update AI Tutor CLI globally via npm - community.general.npm: - name: "{{ aitutor_npm_package }}" - version: "{{ aitutor_npm_version }}" - global: true - state: present - -- name: Verify aitutor command is available - ansible.builtin.command: which aitutor - register: aitutor_bin - changed_when: false - -- name: Show installed aitutor path - ansible.builtin.debug: - msg: "AITutor installed at {{ aitutor_bin.stdout }}" diff --git a/ansible/roles/nvidia_runtime_setup/README.md b/ansible/roles/nvidia_runtime_setup/README.md deleted file mode 100644 index f3f695b..0000000 --- a/ansible/roles/nvidia_runtime_setup/README.md +++ /dev/null @@ -1,31 +0,0 @@ -# nvidia_runtime_setup - -Ansible role to configure NVIDIA driver/runtime readiness on Debian-family hosts. - -## What it does - -- Detects NVIDIA GPU hardware via `lspci` -- Auto-selects a recommended driver on Ubuntu (or uses an explicit package pin) -- Installs the NVIDIA driver package -- Optionally installs CUDA toolkit and NVIDIA container toolkit -- Handles optional reboot logic -- Verifies readiness with `nvidia-smi` - -## Safe defaults - -- Reboot is disabled by default (`nvidia_runtime_reboot_if_needed: false`) -- CUDA and container toolkit installs are disabled by default -- Validation is enabled by default and fails if `nvidia-smi` is unavailable - -## Example - -```yaml ---- -- name: Configure NVIDIA runtime for AI nodes - hosts: ai_nodes - become: true - roles: - - role: nvidia_runtime_setup - vars: - nvidia_runtime_reboot_if_needed: true -``` diff --git a/ansible/roles/nvidia_runtime_setup/defaults/main.yml b/ansible/roles/nvidia_runtime_setup/defaults/main.yml deleted file mode 100644 index 721d68b..0000000 --- a/ansible/roles/nvidia_runtime_setup/defaults/main.yml +++ /dev/null @@ -1,25 +0,0 @@ ---- -# Fail if no NVIDIA hardware is detected. -nvidia_runtime_require_gpu: true - -# Install/repair NVIDIA driver packages. -nvidia_runtime_install_driver: true - -# Optional explicit driver package pin (for example: nvidia-driver-550). -# When empty on Ubuntu, the role will auto-detect the recommended package. -nvidia_runtime_driver_package: "" - -# Install CUDA toolkit from distro repository. -nvidia_runtime_install_cuda_toolkit: false -nvidia_runtime_cuda_package: nvidia-cuda-toolkit - -# Install NVIDIA container runtime package if available in configured repos. -nvidia_runtime_install_container_toolkit: false -nvidia_runtime_container_toolkit_package: nvidia-container-toolkit - -# Reboot handling -nvidia_runtime_reboot_if_needed: false -nvidia_runtime_reboot_timeout: 900 - -# Post-install validation behavior -nvidia_runtime_validate_after_install: true diff --git a/ansible/roles/nvidia_runtime_setup/tasks/detect.yml b/ansible/roles/nvidia_runtime_setup/tasks/detect.yml deleted file mode 100644 index 0562f74..0000000 --- a/ansible/roles/nvidia_runtime_setup/tasks/detect.yml +++ /dev/null @@ -1,64 +0,0 @@ ---- -- name: Ensure hardware detection utilities are present - ansible.builtin.apt: - name: - - pciutils - - ubuntu-drivers-common - state: present - update_cache: true - when: ansible_distribution == "Ubuntu" - -- name: Ensure hardware detection utilities are present (non-Ubuntu) - ansible.builtin.apt: - name: - - pciutils - state: present - update_cache: true - when: ansible_distribution != "Ubuntu" - -- name: Detect PCI devices - ansible.builtin.command: lspci - register: nvidia_runtime_lspci - changed_when: false - -- name: Set hardware detection fact - ansible.builtin.set_fact: - nvidia_runtime_has_gpu: "{{ 'NVIDIA' in nvidia_runtime_lspci.stdout }}" - -- name: Stop when GPU is required but missing - ansible.builtin.fail: - msg: "No NVIDIA GPU detected on this host." - when: - - nvidia_runtime_require_gpu | bool - - not nvidia_runtime_has_gpu | bool - -- name: Detect recommended Ubuntu NVIDIA driver - ansible.builtin.command: ubuntu-drivers devices - register: nvidia_runtime_ubuntu_drivers - changed_when: false - failed_when: false - when: - - ansible_distribution == "Ubuntu" - - nvidia_runtime_driver_package | length == 0 - -- name: Derive auto-selected driver package - ansible.builtin.set_fact: - nvidia_runtime_selected_driver: >- - {{ - nvidia_runtime_driver_package - if (nvidia_runtime_driver_package | length > 0) - else ( - (nvidia_runtime_ubuntu_drivers.stdout | default('')) - | regex_search('nvidia-driver-[0-9]+') - | default('') - ) - }} - -- name: Validate selected driver package - ansible.builtin.fail: - msg: >- - Could not determine an NVIDIA driver package automatically. - Set nvidia_runtime_driver_package explicitly. - when: - - nvidia_runtime_install_driver | bool - - nvidia_runtime_selected_driver | length == 0 diff --git a/ansible/roles/nvidia_runtime_setup/tasks/install.yml b/ansible/roles/nvidia_runtime_setup/tasks/install.yml deleted file mode 100644 index d710852..0000000 --- a/ansible/roles/nvidia_runtime_setup/tasks/install.yml +++ /dev/null @@ -1,19 +0,0 @@ ---- -- name: Install NVIDIA driver package - ansible.builtin.apt: - name: "{{ nvidia_runtime_selected_driver }}" - state: present - update_cache: true - when: nvidia_runtime_install_driver | bool - -- name: Install CUDA toolkit package - ansible.builtin.apt: - name: "{{ nvidia_runtime_cuda_package }}" - state: present - when: nvidia_runtime_install_cuda_toolkit | bool - -- name: Install NVIDIA container toolkit package - ansible.builtin.apt: - name: "{{ nvidia_runtime_container_toolkit_package }}" - state: present - when: nvidia_runtime_install_container_toolkit | bool diff --git a/ansible/roles/nvidia_runtime_setup/tasks/main.yml b/ansible/roles/nvidia_runtime_setup/tasks/main.yml deleted file mode 100644 index 45e319b..0000000 --- a/ansible/roles/nvidia_runtime_setup/tasks/main.yml +++ /dev/null @@ -1,18 +0,0 @@ ---- -- name: Validate role inputs - ansible.builtin.import_tasks: validate.yml - -- name: Detect NVIDIA hardware and tooling - ansible.builtin.import_tasks: detect.yml - -- name: Install driver and optional runtime packages - ansible.builtin.import_tasks: install.yml - -- name: Handle reboot requirements - ansible.builtin.import_tasks: reboot.yml - -- name: Validate NVIDIA runtime state - ansible.builtin.import_tasks: verify.yml - -- name: Print runtime summary - ansible.builtin.import_tasks: summary.yml diff --git a/ansible/roles/nvidia_runtime_setup/tasks/reboot.yml b/ansible/roles/nvidia_runtime_setup/tasks/reboot.yml deleted file mode 100644 index 073584a..0000000 --- a/ansible/roles/nvidia_runtime_setup/tasks/reboot.yml +++ /dev/null @@ -1,22 +0,0 @@ ---- -- name: Check whether reboot is required - ansible.builtin.stat: - path: /var/run/reboot-required - register: nvidia_runtime_reboot_required - -- name: Warn when reboot is required but disabled - ansible.builtin.debug: - msg: >- - NVIDIA packages were installed but reboot is required. - Set nvidia_runtime_reboot_if_needed=true to allow automatic reboot. - when: - - nvidia_runtime_reboot_required.stat.exists - - not nvidia_runtime_reboot_if_needed | bool - -- name: Reboot host when required and enabled - ansible.builtin.reboot: - msg: "Reboot triggered by nvidia_runtime_setup role" - reboot_timeout: "{{ nvidia_runtime_reboot_timeout }}" - when: - - nvidia_runtime_reboot_required.stat.exists - - nvidia_runtime_reboot_if_needed | bool diff --git a/ansible/roles/nvidia_runtime_setup/tasks/summary.yml b/ansible/roles/nvidia_runtime_setup/tasks/summary.yml deleted file mode 100644 index a59e658..0000000 --- a/ansible/roles/nvidia_runtime_setup/tasks/summary.yml +++ /dev/null @@ -1,12 +0,0 @@ ---- -- name: Print NVIDIA runtime summary - ansible.builtin.debug: - msg: - - "==========================================" - - "NVIDIA runtime setup complete for {{ inventory_hostname }}" - - "==========================================" - - "GPU detected via lspci: {{ nvidia_runtime_has_gpu | default(false) }}" - - "Driver package selected: {{ nvidia_runtime_selected_driver | default('not set') }}" - - "Reboot required: {{ nvidia_runtime_reboot_required.stat.exists | default(false) }}" - - "nvidia-smi ready: {{ (nvidia_runtime_smi.rc | default(1)) == 0 }}" - - "GPU details: {{ nvidia_runtime_gpu_lines | default(['not available']) | join('; ') }}" diff --git a/ansible/roles/nvidia_runtime_setup/tasks/validate.yml b/ansible/roles/nvidia_runtime_setup/tasks/validate.yml deleted file mode 100644 index cd73325..0000000 --- a/ansible/roles/nvidia_runtime_setup/tasks/validate.yml +++ /dev/null @@ -1,20 +0,0 @@ ---- -- name: Assert supported OS family - ansible.builtin.assert: - that: - - ansible_os_family == "Debian" - fail_msg: "nvidia_runtime_setup currently supports Debian-family distributions only." - -- name: Assert explicit driver package on non-Ubuntu systems - ansible.builtin.assert: - that: - - ansible_distribution == "Ubuntu" or nvidia_runtime_driver_package | length > 0 - fail_msg: >- - On non-Ubuntu systems set nvidia_runtime_driver_package explicitly to a valid package name. - -- name: Assert optional package names are not empty when enabled - ansible.builtin.assert: - that: - - not nvidia_runtime_install_cuda_toolkit or nvidia_runtime_cuda_package | length > 0 - - not nvidia_runtime_install_container_toolkit or nvidia_runtime_container_toolkit_package | length > 0 - fail_msg: "Optional package toggles are enabled but package names are missing." diff --git a/ansible/roles/nvidia_runtime_setup/tasks/verify.yml b/ansible/roles/nvidia_runtime_setup/tasks/verify.yml deleted file mode 100644 index 072f7a6..0000000 --- a/ansible/roles/nvidia_runtime_setup/tasks/verify.yml +++ /dev/null @@ -1,20 +0,0 @@ ---- -- name: Check nvidia-smi status - ansible.builtin.command: nvidia-smi --query-gpu=name,memory.total,driver_version --format=csv,noheader - register: nvidia_runtime_smi - changed_when: false - failed_when: false - -- name: Fail when post-install validation is required and nvidia-smi is unavailable - ansible.builtin.fail: - msg: >- - nvidia-smi is unavailable after installation. - This usually means a reboot is still required or the selected driver is incompatible. - when: - - nvidia_runtime_validate_after_install | bool - - nvidia_runtime_smi.rc != 0 - -- name: Capture GPU info lines - ansible.builtin.set_fact: - nvidia_runtime_gpu_lines: "{{ nvidia_runtime_smi.stdout_lines | default([]) }}" - when: nvidia_runtime_smi.rc == 0 diff --git a/ansible/roles/proxmox_post_install/defaults/main.yml b/ansible/roles/proxmox_post_install/defaults/main.yml deleted file mode 100644 index 03aec6b..0000000 --- a/ansible/roles/proxmox_post_install/defaults/main.yml +++ /dev/null @@ -1,32 +0,0 @@ ---- -# Default variables for proxmox_post_install role -# These defaults assume you "approve all risks" similar to the original script - -# General -proxmox_post_install_enabled: true - -# Behavior toggles roughly mirroring the whiptail prompts -proxmox_fix_sources: true -proxmox_disable_pve_enterprise: true -proxmox_enable_pve_no_subscription: true -proxmox_fix_ceph_repos: true -proxmox_add_pvetest_repo_disabled: true - -# Subscription nag removal -proxmox_disable_subscription_nag: true - -# HA behavior -proxmox_enable_ha: false # default: do not auto-enable HA on fresh node -proxmox_disable_ha_on_single_node: true -proxmox_disable_corosync_on_single_node: true - -# Update & reboot -proxmox_run_dist_upgrade: true -proxmox_reboot_after: true - -# PVE version restrictions (mirrors the script: 8.0-8.9.x and 9.0-9.1.x) -proxmox_supported_major_versions: [8, 9] -proxmox_8_min_minor: 0 -proxmox_8_max_minor: 9 -proxmox_9_min_minor: 0 -proxmox_9_max_minor: 1 diff --git a/ansible/roles/proxmox_post_install/tasks/main.yml b/ansible/roles/proxmox_post_install/tasks/main.yml deleted file mode 100644 index 1c7beaf..0000000 --- a/ansible/roles/proxmox_post_install/tasks/main.yml +++ /dev/null @@ -1,55 +0,0 @@ ---- -# Main entrypoint for proxmox_post_install role - -- name: "Check that role is enabled" - ansible.builtin.meta: end_play - when: not proxmox_post_install_enabled - -- name: "Detect Proxmox VE version (pveversion)" - ansible.builtin.command: "pveversion" - register: proxmox_pveversion_cmd - changed_when: false - -- name: "Parse Proxmox VE version" - ansible.builtin.set_fact: - proxmox_pve_version_full: "{{ proxmox_pveversion_cmd.stdout | trim }}" - # pveversion output: "pve-manager/9.1.1/42db4a6cf33dac83" - version is at index 1 - proxmox_pve_version: "{{ (proxmox_pveversion_cmd.stdout | trim).split('/')[1] }}" - proxmox_pve_major: "{{ (proxmox_pveversion_cmd.stdout | trim).split('/')[1].split('.')[0] | int }}" - proxmox_pve_minor: "{{ (proxmox_pveversion_cmd.stdout | trim).split('/')[1].split('.')[1] | int }}" - -- name: "Fail if Proxmox VE major version is unsupported" - ansible.builtin.fail: - msg: >- - Unsupported Proxmox VE major version: {{ proxmox_pve_major }}. - Supported: 8.0–8.9.x and 9.0–9.1.x (mirrors upstream post-pve-install.sh). - when: proxmox_pve_major not in proxmox_supported_major_versions - -- name: "Fail if Proxmox VE 8 minor version unsupported" - ansible.builtin.fail: - msg: >- - Unsupported Proxmox 8 version {{ proxmox_pve_version }}. - Supported minor range: {{ proxmox_8_min_minor }}–{{ proxmox_8_max_minor }}. - when: - - proxmox_pve_major == 8 - - proxmox_pve_minor < proxmox_8_min_minor or proxmox_pve_minor > proxmox_8_max_minor - -- name: "Fail if Proxmox VE 9 minor version unsupported" - ansible.builtin.fail: - msg: >- - Unsupported Proxmox 9 version {{ proxmox_pve_version }}. - Supported minor range: {{ proxmox_9_min_minor }}–{{ proxmox_9_max_minor }}. - when: - - proxmox_pve_major == 9 - - proxmox_pve_minor < proxmox_9_min_minor or proxmox_pve_minor > proxmox_9_max_minor - -- name: "Include version-specific tasks for PVE 8" - ansible.builtin.import_tasks: pve8.yml - when: proxmox_pve_major == 8 - -- name: "Include version-specific tasks for PVE 9" - ansible.builtin.import_tasks: pve9.yml - when: proxmox_pve_major == 9 - -- name: "Common post-routines (nag, HA, update, reboot)" - ansible.builtin.import_tasks: post_common.yml diff --git a/ansible/roles/proxmox_post_install/tasks/post_common.yml b/ansible/roles/proxmox_post_install/tasks/post_common.yml deleted file mode 100644 index 6dde402..0000000 --- a/ansible/roles/proxmox_post_install/tasks/post_common.yml +++ /dev/null @@ -1,76 +0,0 @@ ---- -# Common post-routines for PVE 8 and 9: subscription nag, HA, updates, reboot - -- name: "Deploy subscription nag removal script" - ansible.builtin.template: - src: pve-remove-nag.sh.j2 - dest: /usr/local/bin/pve-remove-nag.sh - owner: root - group: root - mode: '0755' - when: proxmox_disable_subscription_nag - -- name: "Configure dpkg Post-Invoke hook to run nag removal script" - ansible.builtin.copy: - dest: /etc/apt/apt.conf.d/no-nag-script - owner: root - group: root - mode: '0644' - content: | - DPkg::Post-Invoke { "/usr/local/bin/pve-remove-nag.sh"; }; - when: proxmox_disable_subscription_nag - -- name: "Remove subscription nag dpkg hook if disabled via vars" - ansible.builtin.file: - path: /etc/apt/apt.conf.d/no-nag-script - state: absent - when: not proxmox_disable_subscription_nag - -- name: "Ensure proxmox-widget-toolkit is reinstalled (like script)" - ansible.builtin.apt: - name: proxmox-widget-toolkit - state: latest - update_cache: false - force: true - register: proxmox_widget_reinstall - failed_when: false - -- name: "Optionally enable HA services on cluster nodes" - ansible.builtin.service: - name: "{{ item }}" - state: started - enabled: true - loop: - - pve-ha-lrm - - pve-ha-crm - - corosync - when: proxmox_enable_ha - -- name: "Optionally disable HA services on single-node setups" - ansible.builtin.service: - name: "{{ item }}" - state: stopped - enabled: false - loop: - - pve-ha-lrm - - pve-ha-crm - when: proxmox_disable_ha_on_single_node - -- name: "Optionally disable Corosync on single-node setups" - ansible.builtin.service: - name: corosync - state: stopped - enabled: false - when: proxmox_disable_corosync_on_single_node - -- name: "Run apt dist-upgrade (like original script)" - ansible.builtin.apt: - update_cache: true - upgrade: dist - when: proxmox_run_dist_upgrade - -- name: "Reboot Proxmox VE when requested" - ansible.builtin.reboot: - msg: "Rebooting after post-install routines (Ansible)." - reboot_timeout: 1800 - when: proxmox_reboot_after diff --git a/ansible/roles/proxmox_post_install/tasks/pve8.yml b/ansible/roles/proxmox_post_install/tasks/pve8.yml deleted file mode 100644 index 2ef890a..0000000 --- a/ansible/roles/proxmox_post_install/tasks/pve8.yml +++ /dev/null @@ -1,57 +0,0 @@ ---- -# Proxmox VE 8.x (Debian 12 / bookworm) sources and repo configuration - -- name: "Configure Debian bookworm APT sources (if enabled)" - ansible.builtin.copy: - dest: /etc/apt/sources.list - owner: root - group: root - mode: '0644' - content: | - deb http://deb.debian.org/debian bookworm main contrib - deb http://deb.debian.org/debian bookworm-updates main contrib - deb http://security.debian.org/debian-security bookworm-security main contrib - when: proxmox_fix_sources - -- name: "Disable pve-enterprise repository (list file) on 8.x" - ansible.builtin.copy: - dest: /etc/apt/sources.list.d/pve-enterprise.list - owner: root - group: root - mode: '0644' - content: | - # deb https://enterprise.proxmox.com/debian/pve bookworm pve-enterprise - when: proxmox_disable_pve_enterprise - -- name: "Enable pve-no-subscription repository on 8.x" - ansible.builtin.copy: - dest: /etc/apt/sources.list.d/pve-install-repo.list - owner: root - group: root - mode: '0644' - content: | - deb http://download.proxmox.com/debian/pve bookworm pve-no-subscription - when: proxmox_enable_pve_no_subscription - -- name: "Configure Ceph repositories for Proxmox VE 8.x" - ansible.builtin.copy: - dest: /etc/apt/sources.list.d/ceph.list - owner: root - group: root - mode: '0644' - content: | - # deb https://enterprise.proxmox.com/debian/ceph-quincy bookworm enterprise - # deb http://download.proxmox.com/debian/ceph-quincy bookworm no-subscription - # deb https://enterprise.proxmox.com/debian/ceph-reef bookworm enterprise - # deb http://download.proxmox.com/debian/ceph-reef bookworm no-subscription - when: proxmox_fix_ceph_repos - -- name: "Add disabled pvetest repository for 8.x" - ansible.builtin.copy: - dest: /etc/apt/sources.list.d/pvetest-for-beta.list - owner: root - group: root - mode: '0644' - content: | - # deb http://download.proxmox.com/debian/pve bookworm pvetest - when: proxmox_add_pvetest_repo_disabled diff --git a/ansible/roles/proxmox_post_install/tasks/pve9.yml b/ansible/roles/proxmox_post_install/tasks/pve9.yml deleted file mode 100644 index 0849b01..0000000 --- a/ansible/roles/proxmox_post_install/tasks/pve9.yml +++ /dev/null @@ -1,123 +0,0 @@ ---- -# Proxmox VE 9.x (Debian 13 / trixie) sources and repo configuration using deb822 - -- name: "Find legacy .list APT source files on 9.x" - ansible.builtin.find: - paths: - - /etc/apt/sources.list.d - patterns: "*.list" - file_type: file - register: proxmox_legacy_list_files - -- name: "Backup and disable entries in /etc/apt/sources.list (if any)" - ansible.builtin.copy: - src: /etc/apt/sources.list - dest: /etc/apt/sources.list.bak - owner: root - group: root - mode: '0644' - remote_src: true - when: - - proxmox_fix_sources - - ansible_facts['os_family'] is defined - ignore_errors: true - -- name: "Comment legacy deb lines in /etc/apt/sources.list (bookworm/proxmox)" - ansible.builtin.replace: - path: /etc/apt/sources.list - regexp: '^(\s*deb\s+.*(proxmox|bookworm).*)$' - replace: '# Disabled by Proxmox Helper Ansible role \1' - when: proxmox_fix_sources - ignore_errors: true - -- name: "Remove legacy .list files on 9.x when migrating to deb822" - ansible.builtin.file: - path: "{{ item.path }}" - state: absent - loop: "{{ proxmox_legacy_list_files.files | default([]) }}" - when: proxmox_fix_sources - -- name: "Configure Debian Trixie deb822 sources for 9.x" - ansible.builtin.copy: - dest: /etc/apt/sources.list.d/debian.sources - owner: root - group: root - mode: '0644' - content: | - Types: deb - URIs: http://deb.debian.org/debian - Suites: trixie - Components: main contrib - Signed-By: /usr/share/keyrings/debian-archive-keyring.gpg - - Types: deb - URIs: http://security.debian.org/debian-security - Suites: trixie-security - Components: main contrib - Signed-By: /usr/share/keyrings/debian-archive-keyring.gpg - - Types: deb - URIs: http://deb.debian.org/debian - Suites: trixie-updates - Components: main contrib - Signed-By: /usr/share/keyrings/debian-archive-keyring.gpg - when: proxmox_fix_sources - -- name: "Ensure pve-enterprise deb822 source is disabled on 9.x" - ansible.builtin.blockinfile: - path: /etc/apt/sources.list.d/pve-enterprise.sources - create: true - owner: root - group: root - mode: '0644' - block: | - Types: deb - URIs: https://enterprise.proxmox.com/debian/pve - Suites: trixie - Components: pve-enterprise - Signed-By: /usr/share/keyrings/proxmox-archive-keyring.gpg - Enabled: false - when: proxmox_disable_pve_enterprise - -- name: "Configure pve-no-subscription deb822 source on 9.x" - ansible.builtin.copy: - dest: /etc/apt/sources.list.d/proxmox.sources - owner: root - group: root - mode: '0644' - content: | - Types: deb - URIs: http://download.proxmox.com/debian/pve - Suites: trixie - Components: pve-no-subscription - Signed-By: /usr/share/keyrings/proxmox-archive-keyring.gpg - when: proxmox_enable_pve_no_subscription - -- name: "Configure Ceph deb822 source on 9.x (no-subscription)" - ansible.builtin.copy: - dest: /etc/apt/sources.list.d/ceph.sources - owner: root - group: root - mode: '0644' - content: | - Types: deb - URIs: http://download.proxmox.com/debian/ceph-squid - Suites: trixie - Components: no-subscription - Signed-By: /usr/share/keyrings/proxmox-archive-keyring.gpg - when: proxmox_fix_ceph_repos - -- name: "Add disabled pve-test deb822 source on 9.x" - ansible.builtin.copy: - dest: /etc/apt/sources.list.d/pve-test.sources - owner: root - group: root - mode: '0644' - content: | - Types: deb - URIs: http://download.proxmox.com/debian/pve - Suites: trixie - Components: pve-test - Signed-By: /usr/share/keyrings/proxmox-archive-keyring.gpg - Enabled: false - when: proxmox_add_pvetest_repo_disabled diff --git a/ansible/roles/proxmox_post_install/templates/pve-remove-nag.sh.j2 b/ansible/roles/proxmox_post_install/templates/pve-remove-nag.sh.j2 deleted file mode 100644 index de1271f..0000000 --- a/ansible/roles/proxmox_post_install/templates/pve-remove-nag.sh.j2 +++ /dev/null @@ -1,45 +0,0 @@ -#!/bin/sh -WEB_JS=/usr/share/javascript/proxmox-widget-toolkit/proxmoxlib.js -if [ -s "$WEB_JS" ] && ! grep -q NoMoreNagging "$WEB_JS"; then - echo "Patching Web UI nag..." - sed -i -e "/data\.status/ s/!//" -e "/data\.status/ s/active/NoMoreNagging/" "$WEB_JS" -fi - -MOBILE_TPL=/usr/share/pve-yew-mobile-gui/index.html.tpl -MARKER="" -if [ -f "$MOBILE_TPL" ] && ! grep -q "$MARKER" "$MOBILE_TPL"; then - echo "Patching Mobile UI nag..." - printf "%s\n" \ - "$MARKER" \ - "" \ - "" >>"$MOBILE_TPL" -fi diff --git a/ansible/roles/proxmox_vm_deploy/defaults/main.yml b/ansible/roles/proxmox_vm_deploy/defaults/main.yml deleted file mode 100644 index 3c1213f..0000000 --- a/ansible/roles/proxmox_vm_deploy/defaults/main.yml +++ /dev/null @@ -1,33 +0,0 @@ ---- -# Proxmox API endpoint and auth -proxmox_api_host: "10.0.0.201" -proxmox_api_user: "ansible@pve" -proxmox_api_token_id: "ansible" -proxmox_api_token_secret: "SET_IN_VAULT" -proxmox_api_password: "" -proxmox_validate_certs: false - -# VM placement -proxmox_node: "pve01" -proxmox_vmid: 9210 -proxmox_vm_name: "aitutor" -proxmox_template: "ubuntu-2404-cloudinit-template" -proxmox_storage: "local-lvm" - -# VM sizing -proxmox_cores: 2 -proxmox_memory_mb: 4096 -proxmox_disk_gb: 32 - -# Network and cloud-init -proxmox_bridge: "vmbr0" -vm_ipconfig0: "ip=10.0.0.210/24,gw=10.0.0.2" -vm_nameserver: "10.0.0.2" -vm_searchdomain: "lan" -vm_ci_user: "chester" -vm_ci_password: "SET_IN_VAULT" -vm_ssh_public_key: "" -vm_ssh_private_key_file: "~/.ssh/id_ed25519" - -# Timing -vm_boot_timeout_seconds: 300 diff --git a/ansible/roles/proxmox_vm_deploy/tasks/main.yml b/ansible/roles/proxmox_vm_deploy/tasks/main.yml deleted file mode 100644 index 0d40b5d..0000000 --- a/ansible/roles/proxmox_vm_deploy/tasks/main.yml +++ /dev/null @@ -1,123 +0,0 @@ ---- -- name: Validate required Proxmox variables - ansible.builtin.assert: - that: - - proxmox_api_host | length > 0 - - proxmox_api_user | length > 0 - - >- - (proxmox_api_password | default('') | length > 0) - or - ( - proxmox_api_token_id | length > 0 - and proxmox_api_token_secret | length > 0 - ) - - proxmox_node | length > 0 - - proxmox_template | length > 0 - - proxmox_vmid | int > 99 - - vm_ci_user | length > 0 - - vm_ipconfig0 is match('^ip=.+') - fail_msg: "Missing required VM provisioning variables or Proxmox credentials." - -- name: Gather current VMs on Proxmox node - community.proxmox.proxmox_vm_info: - api_host: "{{ proxmox_api_host }}" - api_user: "{{ proxmox_api_user }}" - api_password: "{{ proxmox_api_password if (proxmox_api_password | default('') | length > 0) else omit }}" - api_token_id: "{{ proxmox_api_token_id if (proxmox_api_password | default('') | length == 0) else omit }}" - api_token_secret: "{{ proxmox_api_token_secret if (proxmox_api_password | default('') | length == 0) else omit }}" - validate_certs: "{{ proxmox_validate_certs }}" - node: "{{ proxmox_node }}" - register: proxmox_vms - -- name: Detect whether target VM already exists - ansible.builtin.set_fact: - vm_exists: >- - {{ - (proxmox_vms.proxmox_vms | default([]) - | selectattr('vmid', 'equalto', proxmox_vmid | int) - | list - | length) > 0 - }} - -- name: Clone VM from cloud-init template when missing - community.proxmox.proxmox_kvm: - api_host: "{{ proxmox_api_host }}" - api_user: "{{ proxmox_api_user }}" - api_password: "{{ proxmox_api_password if (proxmox_api_password | default('') | length > 0) else omit }}" - api_token_id: "{{ proxmox_api_token_id if (proxmox_api_password | default('') | length == 0) else omit }}" - api_token_secret: "{{ proxmox_api_token_secret if (proxmox_api_password | default('') | length == 0) else omit }}" - validate_certs: "{{ proxmox_validate_certs }}" - node: "{{ proxmox_node }}" - clone: "{{ proxmox_template }}" - newid: "{{ proxmox_vmid }}" - name: "{{ proxmox_vm_name }}" - storage: "{{ proxmox_storage }}" - full: true - timeout: 600 - state: present - when: not vm_exists - -- name: Apply VM hardware, network, and cloud-init settings - community.proxmox.proxmox_kvm: - api_host: "{{ proxmox_api_host }}" - api_user: "{{ proxmox_api_user }}" - api_password: "{{ proxmox_api_password if (proxmox_api_password | default('') | length > 0) else omit }}" - api_token_id: "{{ proxmox_api_token_id if (proxmox_api_password | default('') | length == 0) else omit }}" - api_token_secret: "{{ proxmox_api_token_secret if (proxmox_api_password | default('') | length == 0) else omit }}" - validate_certs: "{{ proxmox_validate_certs }}" - node: "{{ proxmox_node }}" - vmid: "{{ proxmox_vmid }}" - name: "{{ proxmox_vm_name }}" - cores: "{{ proxmox_cores }}" - memory: "{{ proxmox_memory_mb }}" - scsihw: virtio-scsi-pci - scsi: - scsi0: "{{ proxmox_storage }}:{{ proxmox_disk_gb }}" - net: - net0: "virtio,bridge={{ proxmox_bridge }}" - ciuser: "{{ vm_ci_user }}" - cipassword: "{{ vm_ci_password }}" - ipconfig: - ipconfig0: "{{ vm_ipconfig0 }}" - nameservers: - - "{{ vm_nameserver }}" - searchdomains: - - "{{ vm_searchdomain }}" - sshkeys: "{{ vm_ssh_public_key | default(omit) }}" - onboot: true - agent: true - update: true - state: present - -- name: Ensure VM is running - community.proxmox.proxmox_kvm: - api_host: "{{ proxmox_api_host }}" - api_user: "{{ proxmox_api_user }}" - api_password: "{{ proxmox_api_password if (proxmox_api_password | default('') | length > 0) else omit }}" - api_token_id: "{{ proxmox_api_token_id if (proxmox_api_password | default('') | length == 0) else omit }}" - api_token_secret: "{{ proxmox_api_token_secret if (proxmox_api_password | default('') | length == 0) else omit }}" - validate_certs: "{{ proxmox_validate_certs }}" - node: "{{ proxmox_node }}" - vmid: "{{ proxmox_vmid }}" - state: started - -- name: Derive VM IPv4 address from cloud-init ipconfig - ansible.builtin.set_fact: - vm_ipv4: "{{ (vm_ipconfig0.split('ip=')[1].split(',')[0]).split('/')[0] }}" - -- name: Wait for SSH on provisioned VM - ansible.builtin.wait_for: - host: "{{ vm_ipv4 }}" - port: 22 - timeout: "{{ vm_boot_timeout_seconds }}" - delay: 5 - delegate_to: localhost - -- name: Add new VM to in-memory inventory - ansible.builtin.add_host: - name: "{{ proxmox_vm_name }}" - groups: aitutor_vm - ansible_host: "{{ vm_ipv4 }}" - ansible_user: "{{ vm_ci_user }}" - ansible_ssh_private_key_file: "{{ vm_ssh_private_key_file }}" - ansible_python_interpreter: /usr/bin/python3 diff --git a/ansible/validate-environment.sh b/ansible/validate-environment.sh deleted file mode 100755 index 6e2f4eb..0000000 --- a/ansible/validate-environment.sh +++ /dev/null @@ -1,165 +0,0 @@ -#!/bin/bash -# Ansible Control Node Environment Validation Script -# Purpose: Quick health check for Watchtower Ansible setup -# Usage: ./validate-environment.sh - -set -e - -echo "================================================" -echo "Ansible Control Node Health Check" -echo "================================================" -echo "" - -# Color codes for output -GREEN='\033[0;32m' -RED='\033[0;31m' -YELLOW='\033[1;33m' -NC='\033[0m' # No Color - -# Function to print status -check_status() { - if [ $1 -eq 0 ]; then - echo -e "${GREEN}βœ… PASS${NC}: $2" - else - echo -e "${RED}❌ FAIL${NC}: $2" - fi -} - -# Function to print info -print_info() { - echo -e "${YELLOW}ℹ️ INFO${NC}: $1" -} - -# Check 1: Ansible installed -echo "1. Checking Ansible installation..." -if command -v ansible &> /dev/null; then - ANSIBLE_VERSION=$(ansible --version | head -1) - check_status 0 "Ansible installed: $ANSIBLE_VERSION" -else - check_status 1 "Ansible not found" - exit 1 -fi -echo "" - -# Check 2: ansible-lint installed -echo "2. Checking ansible-lint..." -if command -v ansible-lint &> /dev/null; then - LINT_VERSION=$(ansible-lint --version | head -1) - check_status 0 "ansible-lint installed: $LINT_VERSION" -else - check_status 1 "ansible-lint not found" -fi -echo "" - -# Check 3: SSH keys exist -echo "3. Checking SSH keys..." -if [ -f ~/.ssh/id_ed25519 ] && [ -f ~/.ssh/id_ed25519.pub ]; then - check_status 0 "ED25519 SSH keys present" - print_info "Public key fingerprint:" - ssh-keygen -l -f ~/.ssh/id_ed25519.pub | awk '{print " " $2 " " $4}' -else - check_status 1 "ED25519 keys missing" -fi -echo "" - -# Check 4: ansible.cfg exists -echo "4. Checking ansible.cfg..." -if [ -f ./ansible.cfg ]; then - check_status 0 "ansible.cfg found" - print_info "Inventory: $(grep '^inventory' ansible.cfg | awk '{print $3}')" - print_info "Vault password file: $(grep '^vault_password_file' ansible.cfg | awk '{print $3}')" -else - check_status 1 "ansible.cfg not found" -fi -echo "" - -# Check 5: Inventory exists -echo "5. Checking inventory..." -if [ -f ./inventory/hosts.ini ]; then - check_status 0 "Inventory file found" - NODE_COUNT=$(ansible-inventory --list 2>/dev/null | grep -c '"ansible_host":' || echo "0") - print_info "Managed nodes: $NODE_COUNT" -else - check_status 1 "Inventory file missing" -fi -echo "" - -# Check 6: Vault password file -echo "6. Checking Ansible Vault setup..." -if [ -f ./vault/.vault_pass ]; then - check_status 0 "Vault password file exists" - PERMS=$(stat -c '%a' ./vault/.vault_pass) - if [ "$PERMS" = "600" ]; then - check_status 0 "Vault password file permissions secure (600)" - else - check_status 1 "Vault password file permissions insecure ($PERMS, should be 600)" - fi -else - check_status 1 "Vault password file missing" -fi -echo "" - -# Check 7: Node connectivity -echo "7. Testing node connectivity..." -if ansible all -m ping &> /dev/null; then - check_status 0 "All nodes reachable" - REACHABLE=$(ansible all -m ping 2>/dev/null | grep -c 'SUCCESS' || echo "0") - print_info "Responding nodes: $REACHABLE" - echo "" - ansible all -m ping -o 2>/dev/null | sed 's/^/ /' -else - check_status 1 "Node connectivity issues detected" -fi -echo "" - -# Check 8: Playbooks exist -echo "8. Checking playbooks..." -PLAYBOOK_COUNT=$(find ./playbooks -name "*.yml" 2>/dev/null | wc -l) -if [ "$PLAYBOOK_COUNT" -gt 0 ]; then - check_status 0 "Found $PLAYBOOK_COUNT playbook(s)" - echo " Available playbooks:" - find ./playbooks -name "*.yml" -exec basename {} \; | sed 's/^/ - /' -else - check_status 1 "No playbooks found" -fi -echo "" - -# Check 9: Roles directory -echo "9. Checking roles..." -ROLE_COUNT=$(find ./roles -maxdepth 1 -type d ! -path ./roles | wc -l) -if [ "$ROLE_COUNT" -gt 0 ]; then - check_status 0 "Found $ROLE_COUNT role(s)" - find ./roles -maxdepth 1 -type d ! -path ./roles -exec basename {} \; | sed 's/^/ - /' -else - print_info "No custom roles created yet" -fi -echo "" - -# Check 10: Python dependencies -echo "10. Checking Python dependencies..." -MISSING_DEPS=0 -for pkg in proxmoxer requests; do - if python3 -c "import $pkg" &> /dev/null; then - check_status 0 "Python package '$pkg' installed" - else - check_status 1 "Python package '$pkg' missing" - ((MISSING_DEPS++)) - fi -done -echo "" - -# Final summary -echo "================================================" -echo "Environment Status Summary" -echo "================================================" -if [ $MISSING_DEPS -eq 0 ]; then - echo -e "${GREEN}🟒 ENVIRONMENT READY${NC}" - echo "All critical components are operational." - echo "" - echo "Quick test command:" - echo " ansible all -m ping" -else - echo -e "${YELLOW}🟑 MINOR ISSUES DETECTED${NC}" - echo "Some optional components are missing but core functionality works." -fi -echo "" diff --git a/ansible/vault/.gitignore b/ansible/vault/.gitignore deleted file mode 100644 index f7aa267..0000000 --- a/ansible/vault/.gitignore +++ /dev/null @@ -1,6 +0,0 @@ -# Vault password files should NEVER be committed -.vault_pass -*.vault_pass - -# Encrypted variables can be committed -# vault.yml