homelab/plan-ansibleArchiveRecovery.prompt.md at homelab-mcp

nathan 0bc82cfbe0 feat(prompts): add plan for Ansible Archive Recovery and standalone Docker adaptation

2026-04-12 17:24:07 -04:00

13 KiB

Raw Permalink Blame History

Plan: Ansible Archive Recovery & Standalone Docker Adaptation

TL;DR: Extract and adapt reusable infrastructure code from the Swarm-based archive to support standalone Docker hosts. Focus on host onboarding, NFS storage, Proxmox management (if needed), and quality standards. Strip Swarm-specific orchestration while preserving idempotent patterns.

Context:

Post-data-loss rebuild with standalone Docker (no Swarm)
Archive contains production-grade framework for Proxmox + 6-node Swarm
Current main ansible folder is empty (just linting/standards)
Immediate need: Bare metal host onboarding

Phase 1: Foundation Setup

Restore Ansible controller configuration and inventory structure

Restore ansible.cfg from archive → main folder (adapt if needed for new control node)
Inventory scaffolding — Create minimal inventory/hosts.ini with current standalone hosts (depends on step 1)
Group variables — Copy group_vars/all.yml structure, update for new topology (remove Swarm VM definitions) (depends on step 2)
Vault infrastructure — Copy secrets_onboarding role to restore encrypted credential management (parallel with step 3)
Validation — Verify ansible all -m ping succeeds

Relevant files:

ansible/archive/ansible.cfg — Connection settings, inventory path, vault password file location
ansible/archive/group_vars/all.yml — VLAN schema, host registry (SOT), lab metadata
ansible/archive/roles/secrets_onboarding/ — Creates vault directories, password file infrastructure
ansible/archive/inventory/hosts.ini — Reference template for host grouping

Verification:

Run ansible-config dump --only-changed to confirm configuration loaded
Execute ansible all -m ping to verify connectivity to all hosts
Check vault password file exists and contains valid credential

Decisions:

Assumption: User has a control node (likely Watchtower or local workstation) — will need SSH keys distributed
Excluded: Swarm-specific group_vars (swarm_managers, swarm_workers groups)
Included: VLAN definitions, NAS paths, lab-wide standards from group_vars

Phase 2: Core Infrastructure Roles

Migrate Swarm-agnostic utility roles

Copy storage_mounts role — General-purpose NFS mounting (fstab + runtime) (parallel with step 7)
Copy control_node_sanity role — Pre-flight validation for Ansible controller (parallel with step 6)
Copy disk_grow role (if using Proxmox VMs) — Non-disruptive disk enlargement (parallel with steps 6-7)
Validation — Create test playbook that mounts a single NFS share on one host

Relevant files:

ansible/archive/roles/storage_mounts/ — Idempotent NFS mount logic; reusable 100% as-is
ansible/archive/roles/control_node_sanity/ — Validates Ansible version, Python, OS constraints
ansible/archive/roles/disk_grow/ — Proxmox-specific; skip if not using VMs

Verification:

Run test playbook with --check mode to validate syntax
Execute against one test host to verify NFS mount appears in df -h
Re-run playbook to confirm idempotency (no changes on second run)

Phase 3: Host Onboarding Adaptation

Convert Swarm-oriented onboarding to standalone Docker workflow

Strip Swarm tasks — Copy playbooks/onboarding/generic_host.yml, remove swarm_bootstrap, swarm_node_exporter, swarm_cadvisor tasks
Preserve Docker install — Retain Docker Engine installation logic (use archive/get-docker.sh or equivalent)
Adapt monitoring — Replace Swarm collectors with standalone Prometheus node-exporter (systemd, not Swarm service)
Create standalone onboarding playbook — Name: onboard_docker_host.yml (depends on steps 10-12)

Relevant files:

ansible/archive/playbooks/onboarding/generic_host.yml — Base onboarding (SSH, NFS, monitoring); remove Swarm join logic
ansible/archive/scripts/day0bootstrap.sh — Pre-Ansible setup; reusable pattern for Python, SSH, DHCP
ansible/archive/roles/swarm_node_exporter/ — Reference only for metrics collection pattern; deploy as systemd not Swarm

Verification:

Run onboard_docker_host.yml against a fresh bare metal host in check mode
Execute full run, verify Docker installed (docker --version), NFS mounts present, SSH hardened
Check Prometheus can scrape node-exporter on port 9100 (if monitoring deployed)

Decisions:

Include: SSH key distribution, NFS mounting, Docker Engine install, firewall rules, hostname configuration
Exclude: Swarm join tokens, overlay networks, Swarm-specific labels, manager election logic
Monitoring approach: Standalone systemd services OR wait for Phase 5 to decide on monitoring stack deployment

Phase 4: Proxmox Management (Conditional)

Migrate hypervisor tooling if Proxmox is in use

Assess Proxmox usage — Clarify if user is running Proxmox VE or just bare metal Docker
Copy proxmox_post_install role — If yes, migrate post-install configuration (depends on step 14)
Copy proxmox_cluster_reconcile_v2 role — If clustered Proxmox, migrate reconciliation logic (depends on step 14)
Skip or defer — If no Proxmox, mark this phase complete

Relevant files:

ansible/archive/roles/proxmox_post_install/ — Post-install config for Proxmox VE 8.0–9.1.x
ansible/archive/roles/proxmox_cluster_reconcile_v2/ — Cluster state reconciliation
ansible/archive/playbooks/proxmox/ — VM provisioning, node replacement workflows

Verification:

If applicable: Run proxmox_post_install role against test Proxmox node
Validate cluster quorum and networking via Ansible facts collection

Phase 5: Docker Stack Deployment Adaptation

Convert Swarm stack orchestration to docker-compose on standalone hosts

Analyze swarm_stack_deploy role — Understand orchestration pattern (validate manager, dry-run, idempotency checks)
Create docker_compose_deploy role — New role using community.docker.docker_compose_v2 module for standalone deployment
Port stack templates — Copy useful stacks from archive/templates/stacks/ (Authentik, Gitea, Plex), remove Swarm-specific directives (deploy.replicas, placement constraints)
Test deployment — Deploy one adapted stack (e.g., Portainer standalone agent)

Relevant files:

ansible/archive/roles/swarm_stack_deploy/ — Template pattern for idempotent stack deployment; adapt logic for docker-compose
ansible/archive/templates/stacks/*.yml — Docker Compose v3.9 definitions; remove deploy: sections, convert to Compose v2 format
ansible/archive/playbooks/docker/deploy_*.yml — Stack-specific orchestrators; create standalone equivalents

Verification:

Run docker_compose_deploy with dry-run/validate mode
Deploy test stack, verify containers running (docker ps)
Re-run playbook to confirm idempotency (no container recreation)
Test service accessibility (e.g., Portainer UI if deployed)

Decisions:

Docker Compose version: Use v2 (plugin-based docker compose, not standalone docker-compose)
Stack storage: Keep templates in templates/stacks/ or move to role defaults
Network strategy: Bridge networks (not overlay) OR host networking for simplicity

Phase 6: Network Automation (Omada VLAN Management)

Phased VLAN deployment with family-safe rollback strategy

API credential setup — Generate new Omada Open API credentials from ER7212PC controller UI
Copy Omada integration — Migrate omada_api_smoke_test.yml, omada_health_inventory.yml from archive (depends on 22)
Baseline capture — Run read-only playbook to snapshot current flat network config (pre-VLAN state) (depends on 23)
VLAN design approval — Define 2-3 starter VLANs (Main + Management, optionally IoT), document device placement (depends on 24)
VLAN config playbook — Create configure_omada_vlans.yml with dry-run/validate mode and rollback capability (depends on 25)
Staged execution — Apply VLANs in test window with family notification, verify connectivity, rollback if issues

Relevant files:

omada_api_smoke_test.yml — OAuth2 auth validation against Omada Open API
omada_health_inventory.yml — Reads sites, devices, clients from controller
group_vars/all.yml lines 20-50 — Reference VLAN schema (5 VLANs: main/infra/iot/guest/compute)
Omada API documentation — TP-Link Open API docs for VLAN/network config endpoints

Verification:

Omada API authentication succeeds, credential vault decrypts properly
Baseline capture shows current flat network state (all devices VLAN 1/untagged)
Dry-run VLAN config playbook validates without applying changes
After approval: VLAN deployment completes, all family devices maintain connectivity
Rollback test: Can restore flat network config from baseline snapshot

Equipment Details:

Controller: ER7212PC (built-in Omada SDN Controller)
Switch: 1x SG2210MP v4.20 (10-port managed PoE+)
APs: 2x EAP655-Wall(US) v1.0 (WiFi 6 in-wall)

Phase 7: Documentation & Standards

Consolidate governance and operational docs

Copy standards — Migrate archive/documentation/standards/ to ansible/ or root documentation/ (parallel with step 29)
Adapt playbook guides — Update archive/documentation/playbooks/ guides for standalone Docker context (parallel with step 28)
Create new runbook — Write operator guide for standalone Docker host lifecycle (onboard, deploy stack, decommission)
Network change runbook — Document VLAN deployment process, rollback procedure, family communication template
Archive obsolete — Document what was intentionally left behind (Swarm bootstrap, Swarm monitoring, etc.) in lessons-learned

Relevant files:

ansible/archive/documentation/standards/naming-conventions.md — Universal; copy directly
ansible/archive/documentation/standards/ansible-quality-gates.md — Idempotency rules; copy directly
ansible/archive/documentation/contracts/ — Architecture specs; update compute plane for standalone Docker
ansible/archive/documentation/playbooks/ — Operator runbooks; adapt for new workflows

Verification:

Review documentation index to ensure all active playbooks have runbook guides
Validate naming conventions applied consistently across new roles/playbooks
Run linting checks using existing .ansible-lint config
Network change runbook includes family communication template and rollback SOP

Further Considerations

Monitoring Strategy — Deploy standalone Prometheus stack on one host OR reuse Watchtower monitoring pattern from archive?
- Option A: Standalone Prometheus + Grafana on dedicated monitoring host (simpler, no cluster awareness)
- Option B: Adapt monitoring_stack role to scrape standalone Docker hosts (more sophisticated)
- Recommendation: Option A if <5 hosts; Option B if scaling
Inventory Management — Keep generate_inventory.py script or hand-maintain hosts.ini?
- Option A: Manually edit hosts.ini (simpler for small static infrastructure)
- Option B: Adapt script to generate from updated SOT in group_vars/all.yml
- Recommendation: Option B if >3 hosts or frequent topology changes
Secrets Management — Ansible Vault vs external secret store (Authentik, Bitwarden, etc.)?
- Current: Archive uses Ansible Vault for DB passwords, API keys, SSH keys
- Alternative: Integrate with external secret manager if deploying Authentik/Vaultwarden
- Recommendation: Continue Vault for Ansible-managed secrets; use external for application credentials
VLAN Segmentation Strategy — Design before Phase 6 execution
- Proposed starter design:
  - VLAN 1 (Main/Family): 10.0.0.0/24 — Default, existing devices, family computers/phones/tablets (zero disruption)
  - VLAN 10 (Management/Infra): 10.0.10.0/24 — Proxmox, NAS, Docker hosts, network gear management interfaces
  - VLAN 50 (IoT) [Optional future]: 10.0.50.0/24 — Smart home devices, isolated from main network
- Migration order: Deploy VLANs → Move infrastructure devices → Validate → Optionally add IoT VLAN later
- Safety: Family devices stay on VLAN 1 untouched, only infrastructure moves (ER7212PC, switch, NAS, servers)
- Rollback: Ansible captures pre-change baseline, can restore flat network config in <5 minutes
Omada API Write Capabilities — Research before Phase 6 implementation
- Known endpoints (from archive): Sites, devices, clients (GET operations confirmed working)
- Required for automation: Network/VLAN create/update, switch port VLAN assignment, SSID-to-VLAN binding
- Action: Generate API credentials from ER7212PC UI, check Omada controller version for API compatibility
- Alternative: If Open API lacks write support for network config, create hybrid approach: Ansible generates change scripts + manual verification steps with guided UI workflow

13 KiB Raw Permalink Blame History Unescape Escape

Plan: Ansible Archive Recovery & Standalone Docker Adaptation

Phase 1: Foundation Setup

Phase 2: Core Infrastructure Roles

Phase 3: Host Onboarding Adaptation

Phase 4: Proxmox Management (Conditional)

Phase 5: Docker Stack Deployment Adaptation

Phase 6: Network Automation (Omada VLAN Management)

Phase 7: Documentation & Standards

Further Considerations

13 KiB

Raw Permalink Blame History