feat(prompts): add plan for Ansible Archive Recovery and standalone Docker adaptation
This commit is contained in:
parent
016d38d5ab
commit
0bc82cfbe0
213
.github/prompts/plan-ansibleArchiveRecovery.prompt.md
vendored
Normal file
213
.github/prompts/plan-ansibleArchiveRecovery.prompt.md
vendored
Normal file
@ -0,0 +1,213 @@
|
||||
## Plan: Ansible Archive Recovery & Standalone Docker Adaptation
|
||||
|
||||
**TL;DR:** Extract and adapt reusable infrastructure code from the Swarm-based archive to support standalone Docker hosts. Focus on host onboarding, NFS storage, Proxmox management (if needed), and quality standards. Strip Swarm-specific orchestration while preserving idempotent patterns.
|
||||
|
||||
**Context:**
|
||||
- Post-data-loss rebuild with standalone Docker (no Swarm)
|
||||
- Archive contains production-grade framework for Proxmox + 6-node Swarm
|
||||
- Current main ansible folder is empty (just linting/standards)
|
||||
- Immediate need: Bare metal host onboarding
|
||||
|
||||
---
|
||||
|
||||
### **Phase 1: Foundation Setup**
|
||||
*Restore Ansible controller configuration and inventory structure*
|
||||
|
||||
1. **Restore `ansible.cfg`** from archive → main folder *(adapt if needed for new control node)*
|
||||
2. **Inventory scaffolding** — Create minimal `inventory/hosts.ini` with current standalone hosts *(depends on step 1)*
|
||||
3. **Group variables** — Copy `group_vars/all.yml` structure, update for new topology *(remove Swarm VM definitions)* *(depends on step 2)*
|
||||
4. **Vault infrastructure** — Copy `secrets_onboarding` role to restore encrypted credential management *(parallel with step 3)*
|
||||
5. **Validation** — Verify `ansible all -m ping` succeeds
|
||||
|
||||
**Relevant files:**
|
||||
- `ansible/archive/ansible.cfg` — Connection settings, inventory path, vault password file location
|
||||
- `ansible/archive/group_vars/all.yml` — VLAN schema, host registry (SOT), lab metadata
|
||||
- `ansible/archive/roles/secrets_onboarding/` — Creates vault directories, password file infrastructure
|
||||
- `ansible/archive/inventory/hosts.ini` — Reference template for host grouping
|
||||
|
||||
**Verification:**
|
||||
1. Run `ansible-config dump --only-changed` to confirm configuration loaded
|
||||
2. Execute `ansible all -m ping` to verify connectivity to all hosts
|
||||
3. Check vault password file exists and contains valid credential
|
||||
|
||||
**Decisions:**
|
||||
- **Assumption:** User has a control node (likely Watchtower or local workstation) — will need SSH keys distributed
|
||||
- **Excluded:** Swarm-specific group_vars (swarm_managers, swarm_workers groups)
|
||||
- **Included:** VLAN definitions, NAS paths, lab-wide standards from group_vars
|
||||
|
||||
---
|
||||
|
||||
### **Phase 2: Core Infrastructure Roles**
|
||||
*Migrate Swarm-agnostic utility roles*
|
||||
|
||||
6. **Copy `storage_mounts` role** — General-purpose NFS mounting (fstab + runtime) *(parallel with step 7)*
|
||||
7. **Copy `control_node_sanity` role** — Pre-flight validation for Ansible controller *(parallel with step 6)*
|
||||
8. **Copy `disk_grow` role** (if using Proxmox VMs) — Non-disruptive disk enlargement *(parallel with steps 6-7)*
|
||||
9. **Validation** — Create test playbook that mounts a single NFS share on one host
|
||||
|
||||
**Relevant files:**
|
||||
- `ansible/archive/roles/storage_mounts/` — Idempotent NFS mount logic; reusable 100% as-is
|
||||
- `ansible/archive/roles/control_node_sanity/` — Validates Ansible version, Python, OS constraints
|
||||
- `ansible/archive/roles/disk_grow/` — Proxmox-specific; skip if not using VMs
|
||||
|
||||
**Verification:**
|
||||
1. Run test playbook with `--check` mode to validate syntax
|
||||
2. Execute against one test host to verify NFS mount appears in `df -h`
|
||||
3. Re-run playbook to confirm idempotency (no changes on second run)
|
||||
|
||||
---
|
||||
|
||||
### **Phase 3: Host Onboarding Adaptation**
|
||||
*Convert Swarm-oriented onboarding to standalone Docker workflow*
|
||||
|
||||
10. **Strip Swarm tasks** — Copy `playbooks/onboarding/generic_host.yml`, remove `swarm_bootstrap`, `swarm_node_exporter`, `swarm_cadvisor` tasks
|
||||
11. **Preserve Docker install** — Retain Docker Engine installation logic (use `archive/get-docker.sh` or equivalent)
|
||||
12. **Adapt monitoring** — Replace Swarm collectors with standalone Prometheus node-exporter (systemd, not Swarm service)
|
||||
13. **Create standalone onboarding playbook** — Name: `onboard_docker_host.yml` *(depends on steps 10-12)*
|
||||
|
||||
**Relevant files:**
|
||||
- `ansible/archive/playbooks/onboarding/generic_host.yml` — Base onboarding (SSH, NFS, monitoring); **remove** Swarm join logic
|
||||
- `ansible/archive/scripts/day0bootstrap.sh` — Pre-Ansible setup; reusable pattern for Python, SSH, DHCP
|
||||
- `ansible/archive/roles/swarm_node_exporter/` — **Reference only** for metrics collection pattern; deploy as systemd not Swarm
|
||||
|
||||
**Verification:**
|
||||
1. Run `onboard_docker_host.yml` against a fresh bare metal host in check mode
|
||||
2. Execute full run, verify Docker installed (`docker --version`), NFS mounts present, SSH hardened
|
||||
3. Check Prometheus can scrape node-exporter on port 9100 (if monitoring deployed)
|
||||
|
||||
**Decisions:**
|
||||
- **Include:** SSH key distribution, NFS mounting, Docker Engine install, firewall rules, hostname configuration
|
||||
- **Exclude:** Swarm join tokens, overlay networks, Swarm-specific labels, manager election logic
|
||||
- **Monitoring approach:** Standalone systemd services OR wait for Phase 5 to decide on monitoring stack deployment
|
||||
|
||||
---
|
||||
|
||||
### **Phase 4: Proxmox Management (Conditional)**
|
||||
*Migrate hypervisor tooling if Proxmox is in use*
|
||||
|
||||
14. **Assess Proxmox usage** — Clarify if user is running Proxmox VE or just bare metal Docker
|
||||
15. **Copy `proxmox_post_install` role** — If yes, migrate post-install configuration *(depends on step 14)*
|
||||
16. **Copy `proxmox_cluster_reconcile_v2` role** — If clustered Proxmox, migrate reconciliation logic *(depends on step 14)*
|
||||
17. **Skip or defer** — If no Proxmox, mark this phase complete
|
||||
|
||||
**Relevant files:**
|
||||
- `ansible/archive/roles/proxmox_post_install/` — Post-install config for Proxmox VE 8.0–9.1.x
|
||||
- `ansible/archive/roles/proxmox_cluster_reconcile_v2/` — Cluster state reconciliation
|
||||
- `ansible/archive/playbooks/proxmox/` — VM provisioning, node replacement workflows
|
||||
|
||||
**Verification:**
|
||||
1. If applicable: Run `proxmox_post_install` role against test Proxmox node
|
||||
2. Validate cluster quorum and networking via Ansible facts collection
|
||||
|
||||
---
|
||||
|
||||
### **Phase 5: Docker Stack Deployment Adaptation**
|
||||
*Convert Swarm stack orchestration to docker-compose on standalone hosts*
|
||||
|
||||
18. **Analyze `swarm_stack_deploy` role** — Understand orchestration pattern (validate manager, dry-run, idempotency checks)
|
||||
19. **Create `docker_compose_deploy` role** — New role using `community.docker.docker_compose_v2` module for standalone deployment
|
||||
20. **Port stack templates** — Copy useful stacks from `archive/templates/stacks/` (Authentik, Gitea, Plex), remove Swarm-specific directives (deploy.replicas, placement constraints)
|
||||
21. **Test deployment** — Deploy one adapted stack (e.g., Portainer standalone agent)
|
||||
|
||||
**Relevant files:**
|
||||
- `ansible/archive/roles/swarm_stack_deploy/` — **Template pattern** for idempotent stack deployment; adapt logic for docker-compose
|
||||
- `ansible/archive/templates/stacks/*.yml` — Docker Compose v3.9 definitions; remove `deploy:` sections, convert to Compose v2 format
|
||||
- `ansible/archive/playbooks/docker/deploy_*.yml` — Stack-specific orchestrators; create standalone equivalents
|
||||
|
||||
**Verification:**
|
||||
1. Run `docker_compose_deploy` with dry-run/validate mode
|
||||
2. Deploy test stack, verify containers running (`docker ps`)
|
||||
3. Re-run playbook to confirm idempotency (no container recreation)
|
||||
4. Test service accessibility (e.g., Portainer UI if deployed)
|
||||
|
||||
**Decisions:**
|
||||
- **Docker Compose version:** Use v2 (plugin-based `docker compose`, not standalone `docker-compose`)
|
||||
- **Stack storage:** Keep templates in `templates/stacks/` or move to role defaults
|
||||
- **Network strategy:** Bridge networks (not overlay) OR host networking for simplicity
|
||||
|
||||
---
|
||||
|
||||
### **Phase 6: Network Automation (Omada VLAN Management)**
|
||||
*Phased VLAN deployment with family-safe rollback strategy*
|
||||
|
||||
22. **API credential setup** — Generate new Omada Open API credentials from ER7212PC controller UI
|
||||
23. **Copy Omada integration** — Migrate `omada_api_smoke_test.yml`, `omada_health_inventory.yml` from archive *(depends on 22)*
|
||||
24. **Baseline capture** — Run read-only playbook to snapshot current flat network config (pre-VLAN state) *(depends on 23)*
|
||||
25. **VLAN design approval** — Define 2-3 starter VLANs (Main + Management, optionally IoT), document device placement *(depends on 24)*
|
||||
26. **VLAN config playbook** — Create `configure_omada_vlans.yml` with dry-run/validate mode and rollback capability *(depends on 25)*
|
||||
27. **Staged execution** — Apply VLANs in test window with family notification, verify connectivity, rollback if issues
|
||||
|
||||
**Relevant files:**
|
||||
- [omada_api_smoke_test.yml](ansible/archive/playbooks/network/omada_api_smoke_test.yml) — OAuth2 auth validation against Omada Open API
|
||||
- [omada_health_inventory.yml](ansible/archive/playbooks/network/omada_health_inventory.yml) — Reads sites, devices, clients from controller
|
||||
- [group_vars/all.yml](ansible/archive/group_vars/all.yml) lines 20-50 — **Reference VLAN schema** (5 VLANs: main/infra/iot/guest/compute)
|
||||
- Omada API documentation — TP-Link Open API docs for VLAN/network config endpoints
|
||||
|
||||
**Verification:**
|
||||
1. Omada API authentication succeeds, credential vault decrypts properly
|
||||
2. Baseline capture shows current flat network state (all devices VLAN 1/untagged)
|
||||
3. Dry-run VLAN config playbook validates without applying changes
|
||||
4. After approval: VLAN deployment completes, all family devices maintain connectivity
|
||||
5. Rollback test: Can restore flat network config from baseline snapshot
|
||||
|
||||
**Equipment Details:**
|
||||
- **Controller:** ER7212PC (built-in Omada SDN Controller)
|
||||
- **Switch:** 1x SG2210MP v4.20 (10-port managed PoE+)
|
||||
- **APs:** 2x EAP655-Wall(US) v1.0 (WiFi 6 in-wall)
|
||||
|
||||
---
|
||||
|
||||
### **Phase 7: Documentation & Standards**
|
||||
*Consolidate governance and operational docs*
|
||||
|
||||
28. **Copy standards** — Migrate `archive/documentation/standards/` to `ansible/` or root `documentation/` *(parallel with step 29)*
|
||||
29. **Adapt playbook guides** — Update `archive/documentation/playbooks/` guides for standalone Docker context *(parallel with step 28)*
|
||||
30. **Create new runbook** — Write operator guide for standalone Docker host lifecycle (onboard, deploy stack, decommission)
|
||||
31. **Network change runbook** — Document VLAN deployment process, rollback procedure, family communication template
|
||||
32. **Archive obsolete** — Document what was intentionally left behind (Swarm bootstrap, Swarm monitoring, etc.) in lessons-learned
|
||||
|
||||
**Relevant files:**
|
||||
- `ansible/archive/documentation/standards/naming-conventions.md` — Universal; copy directly
|
||||
- `ansible/archive/documentation/standards/ansible-quality-gates.md` — Idempotency rules; copy directly
|
||||
- `ansible/archive/documentation/contracts/` — Architecture specs; update compute plane for standalone Docker
|
||||
- `ansible/archive/documentation/playbooks/` — Operator runbooks; adapt for new workflows
|
||||
|
||||
**Verification:**
|
||||
1. Review documentation index to ensure all active playbooks have runbook guides
|
||||
2. Validate naming conventions applied consistently across new roles/playbooks
|
||||
3. Run linting checks using existing `.ansible-lint` config
|
||||
4. Network change runbook includes family communication template and rollback SOP
|
||||
|
||||
---
|
||||
|
||||
### **Further Considerations**
|
||||
|
||||
1. **Monitoring Strategy** — Deploy standalone Prometheus stack on one host OR reuse Watchtower monitoring pattern from archive?
|
||||
- **Option A:** Standalone Prometheus + Grafana on dedicated monitoring host (simpler, no cluster awareness)
|
||||
- **Option B:** Adapt `monitoring_stack` role to scrape standalone Docker hosts (more sophisticated)
|
||||
- **Recommendation:** Option A if <5 hosts; Option B if scaling
|
||||
|
||||
2. **Inventory Management** — Keep `generate_inventory.py` script or hand-maintain `hosts.ini`?
|
||||
- **Option A:** Manually edit `hosts.ini` (simpler for small static infrastructure)
|
||||
- **Option B:** Adapt script to generate from updated SOT in `group_vars/all.yml`
|
||||
- **Recommendation:** Option B if >3 hosts or frequent topology changes
|
||||
|
||||
3. **Secrets Management** — Ansible Vault vs external secret store (Authentik, Bitwarden, etc.)?
|
||||
- **Current:** Archive uses Ansible Vault for DB passwords, API keys, SSH keys
|
||||
- **Alternative:** Integrate with external secret manager if deploying Authentik/Vaultwarden
|
||||
- **Recommendation:** Continue Vault for Ansible-managed secrets; use external for application credentials
|
||||
|
||||
4. **VLAN Segmentation Strategy** — Design before Phase 6 execution
|
||||
- **Proposed starter design:**
|
||||
- **VLAN 1 (Main/Family):** 10.0.0.0/24 — Default, existing devices, family computers/phones/tablets (zero disruption)
|
||||
- **VLAN 10 (Management/Infra):** 10.0.10.0/24 — Proxmox, NAS, Docker hosts, network gear management interfaces
|
||||
- **VLAN 50 (IoT) [Optional future]:** 10.0.50.0/24 — Smart home devices, isolated from main network
|
||||
- **Migration order:** Deploy VLANs → Move infrastructure devices → Validate → Optionally add IoT VLAN later
|
||||
- **Safety:** Family devices stay on VLAN 1 untouched, only infrastructure moves (ER7212PC, switch, NAS, servers)
|
||||
- **Rollback:** Ansible captures pre-change baseline, can restore flat network config in <5 minutes
|
||||
|
||||
5. **Omada API Write Capabilities** — Research before Phase 6 implementation
|
||||
- **Known endpoints (from archive):** Sites, devices, clients (GET operations confirmed working)
|
||||
- **Required for automation:** Network/VLAN create/update, switch port VLAN assignment, SSID-to-VLAN binding
|
||||
- **Action:** Generate API credentials from ER7212PC UI, check Omada controller version for API compatibility
|
||||
- **Alternative:** If Open API lacks write support for network config, create hybrid approach: Ansible generates change scripts + manual verification steps with guided UI workflow
|
||||
Loading…
x
Reference in New Issue
Block a user