homelab/.github/prompts/plan-ansibleArchiveRecovery.prompt.md

214 lines
13 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

## Plan: Ansible Archive Recovery & Standalone Docker Adaptation
**TL;DR:** Extract and adapt reusable infrastructure code from the Swarm-based archive to support standalone Docker hosts. Focus on host onboarding, NFS storage, Proxmox management (if needed), and quality standards. Strip Swarm-specific orchestration while preserving idempotent patterns.
**Context:**
- Post-data-loss rebuild with standalone Docker (no Swarm)
- Archive contains production-grade framework for Proxmox + 6-node Swarm
- Current main ansible folder is empty (just linting/standards)
- Immediate need: Bare metal host onboarding
---
### **Phase 1: Foundation Setup**
*Restore Ansible controller configuration and inventory structure*
1. **Restore `ansible.cfg`** from archive → main folder *(adapt if needed for new control node)*
2. **Inventory scaffolding** — Create minimal `inventory/hosts.ini` with current standalone hosts *(depends on step 1)*
3. **Group variables** — Copy `group_vars/all.yml` structure, update for new topology *(remove Swarm VM definitions)* *(depends on step 2)*
4. **Vault infrastructure** — Copy `secrets_onboarding` role to restore encrypted credential management *(parallel with step 3)*
5. **Validation** — Verify `ansible all -m ping` succeeds
**Relevant files:**
- `ansible/archive/ansible.cfg` — Connection settings, inventory path, vault password file location
- `ansible/archive/group_vars/all.yml` — VLAN schema, host registry (SOT), lab metadata
- `ansible/archive/roles/secrets_onboarding/` — Creates vault directories, password file infrastructure
- `ansible/archive/inventory/hosts.ini` — Reference template for host grouping
**Verification:**
1. Run `ansible-config dump --only-changed` to confirm configuration loaded
2. Execute `ansible all -m ping` to verify connectivity to all hosts
3. Check vault password file exists and contains valid credential
**Decisions:**
- **Assumption:** User has a control node (likely Watchtower or local workstation) — will need SSH keys distributed
- **Excluded:** Swarm-specific group_vars (swarm_managers, swarm_workers groups)
- **Included:** VLAN definitions, NAS paths, lab-wide standards from group_vars
---
### **Phase 2: Core Infrastructure Roles**
*Migrate Swarm-agnostic utility roles*
6. **Copy `storage_mounts` role** — General-purpose NFS mounting (fstab + runtime) *(parallel with step 7)*
7. **Copy `control_node_sanity` role** — Pre-flight validation for Ansible controller *(parallel with step 6)*
8. **Copy `disk_grow` role** (if using Proxmox VMs) — Non-disruptive disk enlargement *(parallel with steps 6-7)*
9. **Validation** — Create test playbook that mounts a single NFS share on one host
**Relevant files:**
- `ansible/archive/roles/storage_mounts/` — Idempotent NFS mount logic; reusable 100% as-is
- `ansible/archive/roles/control_node_sanity/` — Validates Ansible version, Python, OS constraints
- `ansible/archive/roles/disk_grow/` — Proxmox-specific; skip if not using VMs
**Verification:**
1. Run test playbook with `--check` mode to validate syntax
2. Execute against one test host to verify NFS mount appears in `df -h`
3. Re-run playbook to confirm idempotency (no changes on second run)
---
### **Phase 3: Host Onboarding Adaptation**
*Convert Swarm-oriented onboarding to standalone Docker workflow*
10. **Strip Swarm tasks** — Copy `playbooks/onboarding/generic_host.yml`, remove `swarm_bootstrap`, `swarm_node_exporter`, `swarm_cadvisor` tasks
11. **Preserve Docker install** — Retain Docker Engine installation logic (use `archive/get-docker.sh` or equivalent)
12. **Adapt monitoring** — Replace Swarm collectors with standalone Prometheus node-exporter (systemd, not Swarm service)
13. **Create standalone onboarding playbook** — Name: `onboard_docker_host.yml` *(depends on steps 10-12)*
**Relevant files:**
- `ansible/archive/playbooks/onboarding/generic_host.yml` — Base onboarding (SSH, NFS, monitoring); **remove** Swarm join logic
- `ansible/archive/scripts/day0bootstrap.sh` — Pre-Ansible setup; reusable pattern for Python, SSH, DHCP
- `ansible/archive/roles/swarm_node_exporter/`**Reference only** for metrics collection pattern; deploy as systemd not Swarm
**Verification:**
1. Run `onboard_docker_host.yml` against a fresh bare metal host in check mode
2. Execute full run, verify Docker installed (`docker --version`), NFS mounts present, SSH hardened
3. Check Prometheus can scrape node-exporter on port 9100 (if monitoring deployed)
**Decisions:**
- **Include:** SSH key distribution, NFS mounting, Docker Engine install, firewall rules, hostname configuration
- **Exclude:** Swarm join tokens, overlay networks, Swarm-specific labels, manager election logic
- **Monitoring approach:** Standalone systemd services OR wait for Phase 5 to decide on monitoring stack deployment
---
### **Phase 4: Proxmox Management (Conditional)**
*Migrate hypervisor tooling if Proxmox is in use*
14. **Assess Proxmox usage** — Clarify if user is running Proxmox VE or just bare metal Docker
15. **Copy `proxmox_post_install` role** — If yes, migrate post-install configuration *(depends on step 14)*
16. **Copy `proxmox_cluster_reconcile_v2` role** — If clustered Proxmox, migrate reconciliation logic *(depends on step 14)*
17. **Skip or defer** — If no Proxmox, mark this phase complete
**Relevant files:**
- `ansible/archive/roles/proxmox_post_install/` — Post-install config for Proxmox VE 8.09.1.x
- `ansible/archive/roles/proxmox_cluster_reconcile_v2/` — Cluster state reconciliation
- `ansible/archive/playbooks/proxmox/` — VM provisioning, node replacement workflows
**Verification:**
1. If applicable: Run `proxmox_post_install` role against test Proxmox node
2. Validate cluster quorum and networking via Ansible facts collection
---
### **Phase 5: Docker Stack Deployment Adaptation**
*Convert Swarm stack orchestration to docker-compose on standalone hosts*
18. **Analyze `swarm_stack_deploy` role** — Understand orchestration pattern (validate manager, dry-run, idempotency checks)
19. **Create `docker_compose_deploy` role** — New role using `community.docker.docker_compose_v2` module for standalone deployment
20. **Port stack templates** — Copy useful stacks from `archive/templates/stacks/` (Authentik, Gitea, Plex), remove Swarm-specific directives (deploy.replicas, placement constraints)
21. **Test deployment** — Deploy one adapted stack (e.g., Portainer standalone agent)
**Relevant files:**
- `ansible/archive/roles/swarm_stack_deploy/`**Template pattern** for idempotent stack deployment; adapt logic for docker-compose
- `ansible/archive/templates/stacks/*.yml` — Docker Compose v3.9 definitions; remove `deploy:` sections, convert to Compose v2 format
- `ansible/archive/playbooks/docker/deploy_*.yml` — Stack-specific orchestrators; create standalone equivalents
**Verification:**
1. Run `docker_compose_deploy` with dry-run/validate mode
2. Deploy test stack, verify containers running (`docker ps`)
3. Re-run playbook to confirm idempotency (no container recreation)
4. Test service accessibility (e.g., Portainer UI if deployed)
**Decisions:**
- **Docker Compose version:** Use v2 (plugin-based `docker compose`, not standalone `docker-compose`)
- **Stack storage:** Keep templates in `templates/stacks/` or move to role defaults
- **Network strategy:** Bridge networks (not overlay) OR host networking for simplicity
---
### **Phase 6: Network Automation (Omada VLAN Management)**
*Phased VLAN deployment with family-safe rollback strategy*
22. **API credential setup** — Generate new Omada Open API credentials from ER7212PC controller UI
23. **Copy Omada integration** — Migrate `omada_api_smoke_test.yml`, `omada_health_inventory.yml` from archive *(depends on 22)*
24. **Baseline capture** — Run read-only playbook to snapshot current flat network config (pre-VLAN state) *(depends on 23)*
25. **VLAN design approval** — Define 2-3 starter VLANs (Main + Management, optionally IoT), document device placement *(depends on 24)*
26. **VLAN config playbook** — Create `configure_omada_vlans.yml` with dry-run/validate mode and rollback capability *(depends on 25)*
27. **Staged execution** — Apply VLANs in test window with family notification, verify connectivity, rollback if issues
**Relevant files:**
- [omada_api_smoke_test.yml](ansible/archive/playbooks/network/omada_api_smoke_test.yml) — OAuth2 auth validation against Omada Open API
- [omada_health_inventory.yml](ansible/archive/playbooks/network/omada_health_inventory.yml) — Reads sites, devices, clients from controller
- [group_vars/all.yml](ansible/archive/group_vars/all.yml) lines 20-50 — **Reference VLAN schema** (5 VLANs: main/infra/iot/guest/compute)
- Omada API documentation — TP-Link Open API docs for VLAN/network config endpoints
**Verification:**
1. Omada API authentication succeeds, credential vault decrypts properly
2. Baseline capture shows current flat network state (all devices VLAN 1/untagged)
3. Dry-run VLAN config playbook validates without applying changes
4. After approval: VLAN deployment completes, all family devices maintain connectivity
5. Rollback test: Can restore flat network config from baseline snapshot
**Equipment Details:**
- **Controller:** ER7212PC (built-in Omada SDN Controller)
- **Switch:** 1x SG2210MP v4.20 (10-port managed PoE+)
- **APs:** 2x EAP655-Wall(US) v1.0 (WiFi 6 in-wall)
---
### **Phase 7: Documentation & Standards**
*Consolidate governance and operational docs*
28. **Copy standards** — Migrate `archive/documentation/standards/` to `ansible/` or root `documentation/` *(parallel with step 29)*
29. **Adapt playbook guides** — Update `archive/documentation/playbooks/` guides for standalone Docker context *(parallel with step 28)*
30. **Create new runbook** — Write operator guide for standalone Docker host lifecycle (onboard, deploy stack, decommission)
31. **Network change runbook** — Document VLAN deployment process, rollback procedure, family communication template
32. **Archive obsolete** — Document what was intentionally left behind (Swarm bootstrap, Swarm monitoring, etc.) in lessons-learned
**Relevant files:**
- `ansible/archive/documentation/standards/naming-conventions.md` — Universal; copy directly
- `ansible/archive/documentation/standards/ansible-quality-gates.md` — Idempotency rules; copy directly
- `ansible/archive/documentation/contracts/` — Architecture specs; update compute plane for standalone Docker
- `ansible/archive/documentation/playbooks/` — Operator runbooks; adapt for new workflows
**Verification:**
1. Review documentation index to ensure all active playbooks have runbook guides
2. Validate naming conventions applied consistently across new roles/playbooks
3. Run linting checks using existing `.ansible-lint` config
4. Network change runbook includes family communication template and rollback SOP
---
### **Further Considerations**
1. **Monitoring Strategy** — Deploy standalone Prometheus stack on one host OR reuse Watchtower monitoring pattern from archive?
- **Option A:** Standalone Prometheus + Grafana on dedicated monitoring host (simpler, no cluster awareness)
- **Option B:** Adapt `monitoring_stack` role to scrape standalone Docker hosts (more sophisticated)
- **Recommendation:** Option A if <5 hosts; Option B if scaling
2. **Inventory Management** Keep `generate_inventory.py` script or hand-maintain `hosts.ini`?
- **Option A:** Manually edit `hosts.ini` (simpler for small static infrastructure)
- **Option B:** Adapt script to generate from updated SOT in `group_vars/all.yml`
- **Recommendation:** Option B if >3 hosts or frequent topology changes
3. **Secrets Management** — Ansible Vault vs external secret store (Authentik, Bitwarden, etc.)?
- **Current:** Archive uses Ansible Vault for DB passwords, API keys, SSH keys
- **Alternative:** Integrate with external secret manager if deploying Authentik/Vaultwarden
- **Recommendation:** Continue Vault for Ansible-managed secrets; use external for application credentials
4. **VLAN Segmentation Strategy** — Design before Phase 6 execution
- **Proposed starter design:**
- **VLAN 1 (Main/Family):** 10.0.0.0/24 — Default, existing devices, family computers/phones/tablets (zero disruption)
- **VLAN 10 (Management/Infra):** 10.0.10.0/24 — Proxmox, NAS, Docker hosts, network gear management interfaces
- **VLAN 50 (IoT) [Optional future]:** 10.0.50.0/24 — Smart home devices, isolated from main network
- **Migration order:** Deploy VLANs → Move infrastructure devices → Validate → Optionally add IoT VLAN later
- **Safety:** Family devices stay on VLAN 1 untouched, only infrastructure moves (ER7212PC, switch, NAS, servers)
- **Rollback:** Ansible captures pre-change baseline, can restore flat network config in <5 minutes
5. **Omada API Write Capabilities** Research before Phase 6 implementation
- **Known endpoints (from archive):** Sites, devices, clients (GET operations confirmed working)
- **Required for automation:** Network/VLAN create/update, switch port VLAN assignment, SSID-to-VLAN binding
- **Action:** Generate API credentials from ER7212PC UI, check Omada controller version for API compatibility
- **Alternative:** If Open API lacks write support for network config, create hybrid approach: Ansible generates change scripts + manual verification steps with guided UI workflow