13 KiB
Plan: Ansible Archive Recovery & Standalone Docker Adaptation
TL;DR: Extract and adapt reusable infrastructure code from the Swarm-based archive to support standalone Docker hosts. Focus on host onboarding, NFS storage, Proxmox management (if needed), and quality standards. Strip Swarm-specific orchestration while preserving idempotent patterns.
Context:
- Post-data-loss rebuild with standalone Docker (no Swarm)
- Archive contains production-grade framework for Proxmox + 6-node Swarm
- Current main ansible folder is empty (just linting/standards)
- Immediate need: Bare metal host onboarding
Phase 1: Foundation Setup
Restore Ansible controller configuration and inventory structure
- Restore
ansible.cfgfrom archive → main folder (adapt if needed for new control node) - Inventory scaffolding — Create minimal
inventory/hosts.iniwith current standalone hosts (depends on step 1) - Group variables — Copy
group_vars/all.ymlstructure, update for new topology (remove Swarm VM definitions) (depends on step 2) - Vault infrastructure — Copy
secrets_onboardingrole to restore encrypted credential management (parallel with step 3) - Validation — Verify
ansible all -m pingsucceeds
Relevant files:
ansible/archive/ansible.cfg— Connection settings, inventory path, vault password file locationansible/archive/group_vars/all.yml— VLAN schema, host registry (SOT), lab metadataansible/archive/roles/secrets_onboarding/— Creates vault directories, password file infrastructureansible/archive/inventory/hosts.ini— Reference template for host grouping
Verification:
- Run
ansible-config dump --only-changedto confirm configuration loaded - Execute
ansible all -m pingto verify connectivity to all hosts - Check vault password file exists and contains valid credential
Decisions:
- Assumption: User has a control node (likely Watchtower or local workstation) — will need SSH keys distributed
- Excluded: Swarm-specific group_vars (swarm_managers, swarm_workers groups)
- Included: VLAN definitions, NAS paths, lab-wide standards from group_vars
Phase 2: Core Infrastructure Roles
Migrate Swarm-agnostic utility roles
- Copy
storage_mountsrole — General-purpose NFS mounting (fstab + runtime) (parallel with step 7) - Copy
control_node_sanityrole — Pre-flight validation for Ansible controller (parallel with step 6) - Copy
disk_growrole (if using Proxmox VMs) — Non-disruptive disk enlargement (parallel with steps 6-7) - Validation — Create test playbook that mounts a single NFS share on one host
Relevant files:
ansible/archive/roles/storage_mounts/— Idempotent NFS mount logic; reusable 100% as-isansible/archive/roles/control_node_sanity/— Validates Ansible version, Python, OS constraintsansible/archive/roles/disk_grow/— Proxmox-specific; skip if not using VMs
Verification:
- Run test playbook with
--checkmode to validate syntax - Execute against one test host to verify NFS mount appears in
df -h - Re-run playbook to confirm idempotency (no changes on second run)
Phase 3: Host Onboarding Adaptation
Convert Swarm-oriented onboarding to standalone Docker workflow
- Strip Swarm tasks — Copy
playbooks/onboarding/generic_host.yml, removeswarm_bootstrap,swarm_node_exporter,swarm_cadvisortasks - Preserve Docker install — Retain Docker Engine installation logic (use
archive/get-docker.shor equivalent) - Adapt monitoring — Replace Swarm collectors with standalone Prometheus node-exporter (systemd, not Swarm service)
- Create standalone onboarding playbook — Name:
onboard_docker_host.yml(depends on steps 10-12)
Relevant files:
ansible/archive/playbooks/onboarding/generic_host.yml— Base onboarding (SSH, NFS, monitoring); remove Swarm join logicansible/archive/scripts/day0bootstrap.sh— Pre-Ansible setup; reusable pattern for Python, SSH, DHCPansible/archive/roles/swarm_node_exporter/— Reference only for metrics collection pattern; deploy as systemd not Swarm
Verification:
- Run
onboard_docker_host.ymlagainst a fresh bare metal host in check mode - Execute full run, verify Docker installed (
docker --version), NFS mounts present, SSH hardened - Check Prometheus can scrape node-exporter on port 9100 (if monitoring deployed)
Decisions:
- Include: SSH key distribution, NFS mounting, Docker Engine install, firewall rules, hostname configuration
- Exclude: Swarm join tokens, overlay networks, Swarm-specific labels, manager election logic
- Monitoring approach: Standalone systemd services OR wait for Phase 5 to decide on monitoring stack deployment
Phase 4: Proxmox Management (Conditional)
Migrate hypervisor tooling if Proxmox is in use
- Assess Proxmox usage — Clarify if user is running Proxmox VE or just bare metal Docker
- Copy
proxmox_post_installrole — If yes, migrate post-install configuration (depends on step 14) - Copy
proxmox_cluster_reconcile_v2role — If clustered Proxmox, migrate reconciliation logic (depends on step 14) - Skip or defer — If no Proxmox, mark this phase complete
Relevant files:
ansible/archive/roles/proxmox_post_install/— Post-install config for Proxmox VE 8.0–9.1.xansible/archive/roles/proxmox_cluster_reconcile_v2/— Cluster state reconciliationansible/archive/playbooks/proxmox/— VM provisioning, node replacement workflows
Verification:
- If applicable: Run
proxmox_post_installrole against test Proxmox node - Validate cluster quorum and networking via Ansible facts collection
Phase 5: Docker Stack Deployment Adaptation
Convert Swarm stack orchestration to docker-compose on standalone hosts
- Analyze
swarm_stack_deployrole — Understand orchestration pattern (validate manager, dry-run, idempotency checks) - Create
docker_compose_deployrole — New role usingcommunity.docker.docker_compose_v2module for standalone deployment - Port stack templates — Copy useful stacks from
archive/templates/stacks/(Authentik, Gitea, Plex), remove Swarm-specific directives (deploy.replicas, placement constraints) - Test deployment — Deploy one adapted stack (e.g., Portainer standalone agent)
Relevant files:
ansible/archive/roles/swarm_stack_deploy/— Template pattern for idempotent stack deployment; adapt logic for docker-composeansible/archive/templates/stacks/*.yml— Docker Compose v3.9 definitions; removedeploy:sections, convert to Compose v2 formatansible/archive/playbooks/docker/deploy_*.yml— Stack-specific orchestrators; create standalone equivalents
Verification:
- Run
docker_compose_deploywith dry-run/validate mode - Deploy test stack, verify containers running (
docker ps) - Re-run playbook to confirm idempotency (no container recreation)
- Test service accessibility (e.g., Portainer UI if deployed)
Decisions:
- Docker Compose version: Use v2 (plugin-based
docker compose, not standalonedocker-compose) - Stack storage: Keep templates in
templates/stacks/or move to role defaults - Network strategy: Bridge networks (not overlay) OR host networking for simplicity
Phase 6: Network Automation (Omada VLAN Management)
Phased VLAN deployment with family-safe rollback strategy
- API credential setup — Generate new Omada Open API credentials from ER7212PC controller UI
- Copy Omada integration — Migrate
omada_api_smoke_test.yml,omada_health_inventory.ymlfrom archive (depends on 22) - Baseline capture — Run read-only playbook to snapshot current flat network config (pre-VLAN state) (depends on 23)
- VLAN design approval — Define 2-3 starter VLANs (Main + Management, optionally IoT), document device placement (depends on 24)
- VLAN config playbook — Create
configure_omada_vlans.ymlwith dry-run/validate mode and rollback capability (depends on 25) - Staged execution — Apply VLANs in test window with family notification, verify connectivity, rollback if issues
Relevant files:
- omada_api_smoke_test.yml — OAuth2 auth validation against Omada Open API
- omada_health_inventory.yml — Reads sites, devices, clients from controller
- group_vars/all.yml lines 20-50 — Reference VLAN schema (5 VLANs: main/infra/iot/guest/compute)
- Omada API documentation — TP-Link Open API docs for VLAN/network config endpoints
Verification:
- Omada API authentication succeeds, credential vault decrypts properly
- Baseline capture shows current flat network state (all devices VLAN 1/untagged)
- Dry-run VLAN config playbook validates without applying changes
- After approval: VLAN deployment completes, all family devices maintain connectivity
- Rollback test: Can restore flat network config from baseline snapshot
Equipment Details:
- Controller: ER7212PC (built-in Omada SDN Controller)
- Switch: 1x SG2210MP v4.20 (10-port managed PoE+)
- APs: 2x EAP655-Wall(US) v1.0 (WiFi 6 in-wall)
Phase 7: Documentation & Standards
Consolidate governance and operational docs
- Copy standards — Migrate
archive/documentation/standards/toansible/or rootdocumentation/(parallel with step 29) - Adapt playbook guides — Update
archive/documentation/playbooks/guides for standalone Docker context (parallel with step 28) - Create new runbook — Write operator guide for standalone Docker host lifecycle (onboard, deploy stack, decommission)
- Network change runbook — Document VLAN deployment process, rollback procedure, family communication template
- Archive obsolete — Document what was intentionally left behind (Swarm bootstrap, Swarm monitoring, etc.) in lessons-learned
Relevant files:
ansible/archive/documentation/standards/naming-conventions.md— Universal; copy directlyansible/archive/documentation/standards/ansible-quality-gates.md— Idempotency rules; copy directlyansible/archive/documentation/contracts/— Architecture specs; update compute plane for standalone Dockeransible/archive/documentation/playbooks/— Operator runbooks; adapt for new workflows
Verification:
- Review documentation index to ensure all active playbooks have runbook guides
- Validate naming conventions applied consistently across new roles/playbooks
- Run linting checks using existing
.ansible-lintconfig - Network change runbook includes family communication template and rollback SOP
Further Considerations
-
Monitoring Strategy — Deploy standalone Prometheus stack on one host OR reuse Watchtower monitoring pattern from archive?
- Option A: Standalone Prometheus + Grafana on dedicated monitoring host (simpler, no cluster awareness)
- Option B: Adapt
monitoring_stackrole to scrape standalone Docker hosts (more sophisticated) - Recommendation: Option A if <5 hosts; Option B if scaling
-
Inventory Management — Keep
generate_inventory.pyscript or hand-maintainhosts.ini?- Option A: Manually edit
hosts.ini(simpler for small static infrastructure) - Option B: Adapt script to generate from updated SOT in
group_vars/all.yml - Recommendation: Option B if >3 hosts or frequent topology changes
- Option A: Manually edit
-
Secrets Management — Ansible Vault vs external secret store (Authentik, Bitwarden, etc.)?
- Current: Archive uses Ansible Vault for DB passwords, API keys, SSH keys
- Alternative: Integrate with external secret manager if deploying Authentik/Vaultwarden
- Recommendation: Continue Vault for Ansible-managed secrets; use external for application credentials
-
VLAN Segmentation Strategy — Design before Phase 6 execution
- Proposed starter design:
- VLAN 1 (Main/Family): 10.0.0.0/24 — Default, existing devices, family computers/phones/tablets (zero disruption)
- VLAN 10 (Management/Infra): 10.0.10.0/24 — Proxmox, NAS, Docker hosts, network gear management interfaces
- VLAN 50 (IoT) [Optional future]: 10.0.50.0/24 — Smart home devices, isolated from main network
- Migration order: Deploy VLANs → Move infrastructure devices → Validate → Optionally add IoT VLAN later
- Safety: Family devices stay on VLAN 1 untouched, only infrastructure moves (ER7212PC, switch, NAS, servers)
- Rollback: Ansible captures pre-change baseline, can restore flat network config in <5 minutes
- Proposed starter design:
-
Omada API Write Capabilities — Research before Phase 6 implementation
- Known endpoints (from archive): Sites, devices, clients (GET operations confirmed working)
- Required for automation: Network/VLAN create/update, switch port VLAN assignment, SSID-to-VLAN binding
- Action: Generate API credentials from ER7212PC UI, check Omada controller version for API compatibility
- Alternative: If Open API lacks write support for network config, create hybrid approach: Ansible generates change scripts + manual verification steps with guided UI workflow