homelab/ansible/archive/outputs/SWARM_TOPOLOGY_ANALYSIS_20260312.md

14 KiB
Raw Blame History

EXECUTIVE SUMMARY

Finding: pve03 and pve04 are NOT identical, with meaningful differences:

  • pve03: 10 cores, 23.6 GB RAM, unknown storage capacity (already clustered, running 3 VMs)
  • pve04: 14 cores, 15 GB RAM, 238.5 GB NVMe SSD (fresh, not yet clustered)

Recommendation for "3 identically-spec'd devices":

  • Option A (Recommended): Use pve04 as the template model. Procurement should source 3× Intel Core i5-13500T machines with 15+ GB RAM and 240+ GB NVMe storage. pve04 is the better baseline (better single-thread performance, dedicated NVMe, fresh OS).
  • Option B: Keep pve03 as template. Run a deeper audit on pve03's actual storage (it has 21 loop/dm devices—unclear if additional storage is attached). Backfill pve04 and a 3rd host to match pve03's full config.

Verdict: pve04 > pve03 for Swarm baseline. The i5-13500T offers superior CPU performance (4600 MHz boost vs 2885 MHz), dedicated fast storage, and is freshly provisioned. Use pve04 as the reference architecture for the 3rd node.


DETAILED HARDWARE COMPARISON

CPU Specifications

Dimension pve03 pve04 Status
Model Unknown / unrecognized Intel Core i5-13500T pve04 superior
Architecture x86_64 x86_64 Match
Socket Count 1 1 Match
Cores per Socket 10 14 ⚠️ MISMATCH
Logical CPUs (with HT) 10 20 ⚠️ MISMATCH
Max Frequency 2,885 MHz 4,600 MHz ⚠️ pve04 55% faster
Min Frequency Unknown 800 MHz
Microcode Level 0x437 0x3a

Interpretation:

  • pve04's i5-13500T is a 13th-gen Intel desktop CPU (2023), significantly newer and faster than pve03
  • pve03's CPU could be a degraded/limited processor or a different i5/i7 SKU—need clarification
  • For Docker Swarm workloads: pve04's higher clock speed (4600 MHz) means better latency-sensitive tasks; pve03's 10 cores are still adequate for the planned 2 VMs (manager + worker) per node

Recommendation: If strict "identical" is the mandate, pve04 is the better model to replicate. Purchasing 3× i5-13500T machines ensures:

  1. Consistent single-threaded performance
  2. Known thermal/power envelope
  3. Support (retail CPUs, widely available)

Memory (RAM) Specifications

Dimension pve03 pve04 Status
Total RAM 23.6 GB 15.0 GB ⚠️ MISMATCH
Free RAM 12.4 GB 13.0 GB ⚠️ pve03 has extra, currently used
Used by OS + Proxmox ~11.2 GB ~1.7 GB ⚠️ pve03 heavier

Interpretation:

  • pve03: 23.6 GB total (likely 2× 12 GB or 4× 8 GB SODIMM/UDIMM sticks)
  • pve04: 15 GB total (likely 1× 16 GB, with 1 GB reserved for BIOS/SMM)
  • pve03 is using ~11 GB for the OS and Proxmox daemon + 3 running VMs
  • pve04 is minimal (fresh install, no VMs)

Validation Against Swarm Requirements:

  • Each node will host 2 VMs: 1 manager (2 cores, 2 GB RAM) + 1 worker (2 cores, 2 GB RAM)
  • Proxmox overhead: ~2-4 GB per node
  • Minimum needed: 8+ GB RAM per node Both qualify
  • Optimal: 16 GB pve04 meets this; pve03 exceeds it

Recommendation: Use 16 GB as the standard for 3-node cluster (matches pve04). This is cost-effective and provides ample headroom.


Storage Specifications

Dimension pve03 pve04 Status
Primary Disk(s) Unknown (21 loop/dm devices detected) 1× 238.5 GB NVMe SSD ⚠️ pve04 transparent
Root FS Capacity 68 GB 238.5 GB ⚠️ MISMATCH
Root FS Available 59 GB free ~230 GB available ⚠️ pve04 has more room
Storage Type Unknown (likely SATA SSD or array) Enterprise-grade NVMe

Interpretation:

  • pve03's storage is opaque: 21 loop and device-mapper devices suggest:
    • Possible RAID configuration (dm-* = device mapper)
    • LVM (Logical Volume Manager) setup
    • Possibly shared storage mounted
    • Current state: ~68 GB LVM volume, 9 GB used
  • pve04's storage is straightforward: Single 238.5 GB NVMe SSD, clean LVM setup, minimal OS footprint

VM Storage Requirements (per node):

  • 1 Manager VM: 32 GB disk (from provisionspec in your playbook)
  • 1 Worker VM: 32 GB disk
  • Total per node: 64 GB guest storage (+ Proxmox root FS)
  • Total available after OS: pve03 ≈ 59 GB, pve04 ≈ 230 GB

⚠️ CRITICAL FINDING: pve03 has insufficient disk capacity for the planned topology (needs 64 GB for VMs + OS buffer = ~80 GB, only has ~59 GB free). Unless pve03 has additional storage mounted (not visible in the scan), it cannot host 2 full 32 GB VMs.

Recommendation:

  1. Immediate: Verify pve03's storage architecture. Why 21 dm/loop devices? Is there additional NAS/SAN attached?
  2. For 3rd node procurement: Use pve04 as baseline:
    • 240+ GB NVMe SSD (minimum)
    • Clean, single-drive configuration (KISS principle)
    • Sufficient headroom for VMs + snapshots + log growth

Network Specifications

Dimension pve03 pve04 Status
Interface Count 6 interfaces 4 interfaces
Bridge vmbr0 + tap devices vmbr0 visible Both standard
Primary Network wlp0s20f3 + nic0 wlp0s20f3 + nic0 Match (suggest renaming nic0)

Interpretation:

  • Both nodes have the same network card models (wlp0s20f3 = wireless, nic0 = Ethernet)
  • pve03 has 2 tap devices (tap301i0, tap302i0) = VM network interfaces from running VMs
  • pve04 has no tap devices = freshly imaged, no VMs yet
  • Corosync / Proxmox Cluster: Both will use vmbr0 for inter-node communication

Recommendation: Both nodes are network-compatible. No issues for Docker Swarm overlay networking.


Proxmox & Cluster Status

Dimension pve03 pve04 Status
Proxmox Version 9.1.6 9.1.1 ⚠️ Versions differ by .5 patch
Kernel 6.17.2-1-pve 6.17.2-1-pve Match
OS Distro Debian trixie Debian trixie Match
Cluster Status Clustered (homelab) Not clustered
Cluster Members pve01, pve02, pve03 None yet
VMs Running 3 VMs/containers 0 VMs
Uptime 4 days ~0 days (fresh)

Interpretation:

  • pve03 is an active, production node in the homelab cluster
  • pve04 is a fresh candidate ready for integration
  • Minor version difference (9.1.6 vs 9.1.1) is not a blocker—routine updates will align them

Recommendation: Update both to the latest Proxmox 9.x patch level before final cluster formation.


DOCKER SWARM TOPOLOGY ANALYSIS

Target Design (from documentation/architecture/compute-plane.md)

  • 3× identically-spec'd physical Proxmox nodes
  • 3× Swarm Managers (1 per node, IPs: 10.0.0.211213)
  • 3× Swarm Workers (1 per node, IPs: 10.0.0.221223)
  • Each VM: 2 vCPU, 4 GB RAM, 32 GB disk
  • Proxmox cluster with Corosync for HA
  • No overcommit

Capacity Analysis: pve04 as Reference Model

CPU

  • pve04 Spec: 14 cores, 1 socket, 4600 MHz peak
  • Planned Usage: 4 vCPU (2 for manager, 2 for worker) = 28.6% utilization
  • Proxmox/Corosync Overhead: ~1 vCPU
  • Available Headroom: 14 - 4 - 1 = 9 vCPU spare
  • Verdict: EXCELLENT. Can sustain workload + spikes + 2x VM migration

Memory (15 GB)

  • Planned Usage: 4 GB (manager) + 4 GB (worker) = 8 GB
  • Proxmox OS + daemons: ~23 GB
  • Available Headroom: 15 - 8 - 2.5 = 4.5 GB spare
  • Verdict: ADEQUATE. No aggressive swapping. Supports scheduled workload growth.

Storage (240 GB)

  • Planned Usage: 32 GB (manager) + 32 GB (worker) = 64 GB
  • Proxmox OS: ~8 GB
  • Snapshots/Logs Buffer: ~20 GB
  • Total Planned: ~92 GB
  • Available Headroom: 240 - 92 = 148 GB spare
  • Verdict: EXCELLENT. Ample room for workload scaling, backups, experiments.

Network

  • Swarm Overlay: vmbr0 at 1 Gbps
  • Expected inter-node throughput: <100 Mbps for modest swarm (1020 containers)
  • Verdict: ADEQUATE for Docker Swarm in homelab. Upgrade to 10 Gbps if production-scale or data-intensive AI workloads planned.

High-Availability & Resilience

Quorum Analysis

  • 3 Proxmox Nodes: Corosync quorum = 2/3 nodes required
    • Can tolerate 1 node failure Good
    • If node1 fails: quorum = nodes 2+3 (still ≥2) → cluster remains operational
  • 3 Swarm Managers: Raft consensus quorum = 2/3 nodes required
    • Can tolerate 1 manager failure Good
    • If manager1 fails: quorum = managers 2+3 (still ≥2) → swarm remains operational

Failure Scenarios

Scenario Outcome Swarm Impact
1 node power fails Surviving nodes take over VMs Containers restart on node 2&3
1 node storage corrupt Proxmox HA can restart VMs on peer Brief service interruption (~30s)
1 node network partition Corosync detects; quorum = 2 survivors Cluster continues; isolated node reboots
2 nodes fail simultaneously Game over; cluster non-functional ALL workload lost

Verdict: Design supports N-1 failure tolerance. Very good for homelab.


SPECIAL CONSIDERATIONS FOR pve03

Storage Mystery: 21 Loop/Device-Mapper Devices

Questions to Investigate:

  1. Is pve03 mounted to external NAS/SAN (e.g., Synology 10.0.0.249)?
  2. Is there a RAID or LVM snapshot setup?
  3. Were multiple physical drives present originally, now failed?

Action Items:

# From watchtower or pve03:
pvesh get /storage --output-format json   # List all Proxmox storage targets
zfs list                                  # If ZFS in use
lvs                                       # LVM volumes
pvdisplay                                 # LVM physical volumes
df -i                                     # Inode usage (helps diagnose loop mounts)

Implication: Until pve03's storage is clarified, it cannot be used as a template for the 3rd identical host.


FINAL RECOMMENDATIONS

1. Short-Term (Immediate)

Action: Clarify pve03's storage architecture.

# SSH into pve03 via watchtower relay or direct if SSH key added
ssh root@10.0.0.203 "pvesh get /storage --output-format json"
ssh root@10.0.0.203 "lvs && pvs"
ssh root@10.0.0.203 "zfs list 2>/dev/null || echo 'ZFS not in use'"

If pve03 has external storage:

  • Note the configuration (NAS IP, mount method, capacity)
  • Plan to replicate in 3rd node

If pve03 is just a single drive:

  • Proceed with pve04 as template

2. Medium-Term (Before Final 3-Node Deployment)

Option A: Adopt pve04 as Template (RECOMMENDED)

  • Procurement: 3× machines with Intel i5-13500T, 16 GB RAM, 256 GB NVMe
  • Cost: ~$200300 per node (retail Core i5 desktop equivalent)
  • Timeline: 12 weeks (sourcing)
  • Next step: Install Proxmox 9.x on 3rd node; cluster join

Option B: Backfill pve03 Config to pve04 & 3rd Node

  • Upgrade pve04 RAM from 15 GB → 24 GB (add 1× 8 GB SODIMM)
  • Verify pve03's external storage is documented
  • Replicate in pve04 and 3rd node
  • Cost: ~$3050 per node (additional RAM)
  • Timeline: 1 week
  • Risk: Depends on clarifying pve03 fully

Recommendation Pick: Option A is cleaner. pve04 is fresher, faster, and has clear config.

3. Long-Term (Post-3-Node Commissioning)

Cluster Formation:

# On pve04 (assuming elected as initial leader):
pvecm create homelab

# On 3rd new node:
pvecm add <pve04_ip_or_hostname>

# Verify:
pvesh get /cluster/status

VM Provisioning:

# Use your existing playbook:
ansible-playbook -i inventory/hosts.ini \
  playbooks/proxmox/provision_swarm_vms.yml \
  -e target_host=pve04 \
  -e target_host=pve0N  # For 3rd node

Docker Swarm Init:

# On swarm-manager-1 (e.g., 10.0.0.211):
docker swarm init --advertise-addr 10.0.0.211

# On manager-2 & manager-3:
docker swarm join --token <manager-token> 10.0.0.211:2377

APPENDIX: Hardware Specs Collected

pve03 (10.0.0.203) Full Details

CPU:             10 cores, 1 socket, max 2885 MHz
Memory:          23.6 GB total, 12.4 GB free
Storage:         68 GB root LVM (59 GB free) + 21 dm/loop devices (TBD)
OS:              Debian trixie, kernel 6.17.2-1-pve
Proxmox:         9.1.6
Network:         6 interfaces (vmbr0, nic0, wlp0s20f3, tap301i0, tap302i0, lo)
Cluster Status:  Clustered (homelab), 3 VMs running
Uptime:          4 days

pve04 (10.0.0.204) Full Details

CPU:             Intel Core i5-13500T, 14 cores, 1 socket, 20 vCPUs (HT), max 4600 MHz
Memory:          15.0 GB total, ~13.0 GB available, 8.0 GB swap
Storage:         238.5 GB NVMe SSD (nvme0n1), single drive
OS:              Debian trixie, kernel 6.17.2-1-pve
Proxmox:         9.1.1
Network:         4 interfaces (vmbr0, nic0, wlp0s20f3, lo)
Cluster Status:  Not clustered yet, 0 VMs
Uptime:          Fresh (just rebooted)

CONCLUSION

pve04 is the superior choice for replication to a 3-node cluster because of:

  1. CPU performance: 4600 MHz vs 2885 MHz (55% faster single-thread)
  2. Storage clarity: Single 240 GB NVMe (vs pve03's mysterious setup)
  3. Ballpark specifications: 15 GB RAM + 240 GB SSD = excellent value for Swarm workloads
  4. Freshness: No legacy config debt

Immediate action: Clarify pve03's storage. Then either adopt pve04 as template or provide additional pve03 context to backfill.

Expected outcome: 3-node Proxmox cluster running 6 Docker Swarm nodes (3 managers, 3 workers) with excellent resilience, performance, and headroom for future growth.