homelab/SWARM_TOPOLOGY_ANALYSIS_20260312.md at d76ec8c9cc2f5a03437e52a56dc7fcb375f6b7a3

nathan/homelab

Fork 0

nathan bcd4688523 renamed folder to make contents clearer

2026-04-12 16:24:56 -04:00

14 KiB

Raw Blame History

EXECUTIVE SUMMARY

Finding: pve03 and pve04 are NOT identical, with meaningful differences:

pve03: 10 cores, 23.6 GB RAM, unknown storage capacity (already clustered, running 3 VMs)
pve04: 14 cores, 15 GB RAM, 238.5 GB NVMe SSD (fresh, not yet clustered)

Recommendation for "3 identically-spec'd devices":

Option A (Recommended): Use pve04 as the template model. Procurement should source 3× Intel Core i5-13500T machines with 15+ GB RAM and 240+ GB NVMe storage. pve04 is the better baseline (better single-thread performance, dedicated NVMe, fresh OS).
Option B: Keep pve03 as template. Run a deeper audit on pve03's actual storage (it has 21 loop/dm devices—unclear if additional storage is attached). Backfill pve04 and a 3rd host to match pve03's full config.

Verdict: pve04 > pve03 for Swarm baseline. The i5-13500T offers superior CPU performance (4600 MHz boost vs 2885 MHz), dedicated fast storage, and is freshly provisioned. Use pve04 as the reference architecture for the 3rd node.

DETAILED HARDWARE COMPARISON

CPU Specifications

Dimension	pve03	pve04	Status
Model	Unknown / unrecognized	Intel Core i5-13500T	✅ pve04 superior
Architecture	x86_64	x86_64	✅ Match
Socket Count	1	1	✅ Match
Cores per Socket	10	14	⚠️ MISMATCH
Logical CPUs (with HT)	10	20	⚠️ MISMATCH
Max Frequency	2,885 MHz	4,600 MHz	⚠️ pve04 55% faster
Min Frequency	Unknown	800 MHz	—
Microcode Level	0x437	0x3a	—

Interpretation:

pve04's i5-13500T is a 13th-gen Intel desktop CPU (2023), significantly newer and faster than pve03
pve03's CPU could be a degraded/limited processor or a different i5/i7 SKU—need clarification
For Docker Swarm workloads: pve04's higher clock speed (4600 MHz) means better latency-sensitive tasks; pve03's 10 cores are still adequate for the planned 2 VMs (manager + worker) per node

Recommendation: If strict "identical" is the mandate, pve04 is the better model to replicate. Purchasing 3× i5-13500T machines ensures:

Consistent single-threaded performance
Known thermal/power envelope
Support (retail CPUs, widely available)

Memory (RAM) Specifications

Dimension	pve03	pve04	Status
Total RAM	23.6 GB	15.0 GB	⚠️ MISMATCH
Free RAM	12.4 GB	13.0 GB	⚠️ pve03 has extra, currently used
Used by OS + Proxmox	~11.2 GB	~1.7 GB	⚠️ pve03 heavier

Interpretation:

pve03: 23.6 GB total (likely 2× 12 GB or 4× 8 GB SODIMM/UDIMM sticks)
pve04: 15 GB total (likely 1× 16 GB, with 1 GB reserved for BIOS/SMM)
pve03 is using ~11 GB for the OS and Proxmox daemon + 3 running VMs
pve04 is minimal (fresh install, no VMs)

Validation Against Swarm Requirements:

Each node will host 2 VMs: 1 manager (2 cores, 2 GB RAM) + 1 worker (2 cores, 2 GB RAM)
Proxmox overhead: ~2-4 GB per node
Minimum needed: 8+ GB RAM per node ✅ Both qualify
Optimal: 16 GB ✅ pve04 meets this; pve03 exceeds it

Recommendation: Use 16 GB as the standard for 3-node cluster (matches pve04). This is cost-effective and provides ample headroom.

Storage Specifications

Dimension	pve03	pve04	Status
Primary Disk(s)	Unknown (21 loop/dm devices detected)	1× 238.5 GB NVMe SSD	⚠️ pve04 transparent
Root FS Capacity	68 GB	238.5 GB	⚠️ MISMATCH
Root FS Available	59 GB free	~230 GB available	⚠️ pve04 has more room
Storage Type	Unknown (likely SATA SSD or array)	Enterprise-grade NVMe	—

Interpretation:

pve03's storage is opaque: 21 loop and device-mapper devices suggest:
- Possible RAID configuration (dm-* = device mapper)
- LVM (Logical Volume Manager) setup
- Possibly shared storage mounted
- Current state: ~68 GB LVM volume, 9 GB used
pve04's storage is straightforward: Single 238.5 GB NVMe SSD, clean LVM setup, minimal OS footprint

VM Storage Requirements (per node):

1 Manager VM: 32 GB disk (from provisionspec in your playbook)
1 Worker VM: 32 GB disk
Total per node: 64 GB guest storage (+ Proxmox root FS)
Total available after OS: pve03 ≈ 59 GB, pve04 ≈ 230 GB

⚠️ CRITICAL FINDING: pve03 has insufficient disk capacity for the planned topology (needs 64 GB for VMs + OS buffer = ~80 GB, only has ~59 GB free). Unless pve03 has additional storage mounted (not visible in the scan), it cannot host 2 full 32 GB VMs.

Recommendation:

Immediate: Verify pve03's storage architecture. Why 21 dm/loop devices? Is there additional NAS/SAN attached?
For 3rd node procurement: Use pve04 as baseline:
- 240+ GB NVMe SSD (minimum)
- Clean, single-drive configuration (KISS principle)
- Sufficient headroom for VMs + snapshots + log growth

Network Specifications

Dimension	pve03	pve04	Status
Interface Count	6 interfaces	4 interfaces	—
Bridge	vmbr0 + tap devices	vmbr0 visible	✅ Both standard
Primary Network	wlp0s20f3 + nic0	wlp0s20f3 + nic0	✅ Match (suggest renaming nic0)

Interpretation:

Both nodes have the same network card models (wlp0s20f3 = wireless, nic0 = Ethernet)
pve03 has 2 tap devices (tap301i0, tap302i0) = VM network interfaces from running VMs
pve04 has no tap devices = freshly imaged, no VMs yet
Corosync / Proxmox Cluster: Both will use vmbr0 for inter-node communication

Recommendation: Both nodes are network-compatible. No issues for Docker Swarm overlay networking.

Proxmox & Cluster Status

Dimension	pve03	pve04	Status
Proxmox Version	9.1.6	9.1.1	⚠️ Versions differ by .5 patch
Kernel	6.17.2-1-pve	6.17.2-1-pve	✅ Match
OS Distro	Debian trixie	Debian trixie	✅ Match
Cluster Status	✅ Clustered (homelab)	❌ Not clustered	—
Cluster Members	pve01, pve02, pve03	None yet	—
VMs Running	3 VMs/containers	0 VMs	—
Uptime	4 days	~0 days (fresh)	—

Interpretation:

pve03 is an active, production node in the homelab cluster
pve04 is a fresh candidate ready for integration
Minor version difference (9.1.6 vs 9.1.1) is not a blocker—routine updates will align them

Recommendation: Update both to the latest Proxmox 9.x patch level before final cluster formation.

DOCKER SWARM TOPOLOGY ANALYSIS

Target Design (from documentation/architecture/compute-plane.md)

3× identically-spec'd physical Proxmox nodes
3× Swarm Managers (1 per node, IPs: 10.0.0.211–213)
3× Swarm Workers (1 per node, IPs: 10.0.0.221–223)
Each VM: 2 vCPU, 4 GB RAM, 32 GB disk
Proxmox cluster with Corosync for HA
No overcommit

Capacity Analysis: pve04 as Reference Model

CPU

pve04 Spec: 14 cores, 1 socket, 4600 MHz peak
Planned Usage: 4 vCPU (2 for manager, 2 for worker) = 28.6% utilization
Proxmox/Corosync Overhead: ~1 vCPU
Available Headroom: 14 - 4 - 1 = 9 vCPU spare
Verdict: ✅ EXCELLENT. Can sustain workload + spikes + 2x VM migration

Memory (15 GB)

Planned Usage: 4 GB (manager) + 4 GB (worker) = 8 GB
Proxmox OS + daemons: ~2–3 GB
Available Headroom: 15 - 8 - 2.5 = 4.5 GB spare
Verdict: ✅ ADEQUATE. No aggressive swapping. Supports scheduled workload growth.

Storage (240 GB)

Planned Usage: 32 GB (manager) + 32 GB (worker) = 64 GB
Proxmox OS: ~8 GB
Snapshots/Logs Buffer: ~20 GB
Total Planned: ~92 GB
Available Headroom: 240 - 92 = 148 GB spare
Verdict: ✅ EXCELLENT. Ample room for workload scaling, backups, experiments.

Network

Swarm Overlay: vmbr0 at 1 Gbps
Expected inter-node throughput: <100 Mbps for modest swarm (10–20 containers)
Verdict: ✅ ADEQUATE for Docker Swarm in homelab. Upgrade to 10 Gbps if production-scale or data-intensive AI workloads planned.

High-Availability & Resilience

Quorum Analysis

3 Proxmox Nodes: Corosync quorum = 2/3 nodes required
- Can tolerate 1 node failure ✅ Good
- If node1 fails: quorum = nodes 2+3 (still ≥2) → cluster remains operational
3 Swarm Managers: Raft consensus quorum = 2/3 nodes required
- Can tolerate 1 manager failure ✅ Good
- If manager1 fails: quorum = managers 2+3 (still ≥2) → swarm remains operational

Failure Scenarios

Scenario	Outcome	Swarm Impact
1 node power fails	Surviving nodes take over VMs	Containers restart on node 2&3
1 node storage corrupt	Proxmox HA can restart VMs on peer	Brief service interruption (~30s)
1 node network partition	Corosync detects; quorum = 2 survivors	Cluster continues; isolated node reboots
2 nodes fail simultaneously	Game over; cluster non-functional	ALL workload lost

Verdict: Design supports N-1 failure tolerance. Very good for homelab.

SPECIAL CONSIDERATIONS FOR pve03

Storage Mystery: 21 Loop/Device-Mapper Devices

Questions to Investigate:

Is pve03 mounted to external NAS/SAN (e.g., Synology 10.0.0.249)?
Is there a RAID or LVM snapshot setup?
Were multiple physical drives present originally, now failed?

Action Items:

# From watchtower or pve03:
pvesh get /storage --output-format json   # List all Proxmox storage targets
zfs list                                  # If ZFS in use
lvs                                       # LVM volumes
pvdisplay                                 # LVM physical volumes
df -i                                     # Inode usage (helps diagnose loop mounts)

Implication: Until pve03's storage is clarified, it cannot be used as a template for the 3rd identical host.

FINAL RECOMMENDATIONS

1. Short-Term (Immediate)

Action: Clarify pve03's storage architecture.

# SSH into pve03 via watchtower relay or direct if SSH key added
ssh root@10.0.0.203 "pvesh get /storage --output-format json"
ssh root@10.0.0.203 "lvs && pvs"
ssh root@10.0.0.203 "zfs list 2>/dev/null || echo 'ZFS not in use'"

If pve03 has external storage:

Note the configuration (NAS IP, mount method, capacity)
Plan to replicate in 3rd node

If pve03 is just a single drive:

Proceed with pve04 as template

2. Medium-Term (Before Final 3-Node Deployment)

Option A: Adopt pve04 as Template (RECOMMENDED)

Procurement: 3× machines with Intel i5-13500T, 16 GB RAM, 256 GB NVMe
Cost: ~$200–300 per node (retail Core i5 desktop equivalent)
Timeline: 1–2 weeks (sourcing)
Next step: Install Proxmox 9.x on 3rd node; cluster join

Option B: Backfill pve03 Config to pve04 & 3rd Node

Upgrade pve04 RAM from 15 GB → 24 GB (add 1× 8 GB SODIMM)
Verify pve03's external storage is documented
Replicate in pve04 and 3rd node
Cost: ~$30–50 per node (additional RAM)
Timeline: 1 week
Risk: Depends on clarifying pve03 fully

Recommendation Pick: Option A is cleaner. pve04 is fresher, faster, and has clear config.

3. Long-Term (Post-3-Node Commissioning)

Cluster Formation:

# On pve04 (assuming elected as initial leader):
pvecm create homelab

# On 3rd new node:
pvecm add <pve04_ip_or_hostname>

# Verify:
pvesh get /cluster/status

VM Provisioning:

# Use your existing playbook:
ansible-playbook -i inventory/hosts.ini \
  playbooks/proxmox/provision_swarm_vms.yml \
  -e target_host=pve04 \
  -e target_host=pve0N  # For 3rd node

Docker Swarm Init:

# On swarm-manager-1 (e.g., 10.0.0.211):
docker swarm init --advertise-addr 10.0.0.211

# On manager-2 & manager-3:
docker swarm join --token <manager-token> 10.0.0.211:2377

APPENDIX: Hardware Specs Collected

pve03 (10.0.0.203) – Full Details

CPU:             10 cores, 1 socket, max 2885 MHz
Memory:          23.6 GB total, 12.4 GB free
Storage:         68 GB root LVM (59 GB free) + 21 dm/loop devices (TBD)
OS:              Debian trixie, kernel 6.17.2-1-pve
Proxmox:         9.1.6
Network:         6 interfaces (vmbr0, nic0, wlp0s20f3, tap301i0, tap302i0, lo)
Cluster Status:  Clustered (homelab), 3 VMs running
Uptime:          4 days

pve04 (10.0.0.204) – Full Details

CPU:             Intel Core i5-13500T, 14 cores, 1 socket, 20 vCPUs (HT), max 4600 MHz
Memory:          15.0 GB total, ~13.0 GB available, 8.0 GB swap
Storage:         238.5 GB NVMe SSD (nvme0n1), single drive
OS:              Debian trixie, kernel 6.17.2-1-pve
Proxmox:         9.1.1
Network:         4 interfaces (vmbr0, nic0, wlp0s20f3, lo)
Cluster Status:  Not clustered yet, 0 VMs
Uptime:          Fresh (just rebooted)

CONCLUSION

pve04 is the superior choice for replication to a 3-node cluster because of:

CPU performance: 4600 MHz vs 2885 MHz (55% faster single-thread)
Storage clarity: Single 240 GB NVMe (vs pve03's mysterious setup)
Ballpark specifications: 15 GB RAM + 240 GB SSD = excellent value for Swarm workloads
Freshness: No legacy config debt

Immediate action: Clarify pve03's storage. Then either adopt pve04 as template or provide additional pve03 context to backfill.

Expected outcome: 3-node Proxmox cluster running 6 Docker Swarm nodes (3 managers, 3 workers) with excellent resilience, performance, and headroom for future growth.

14 KiB Raw Blame History Unescape Escape

EXECUTIVE SUMMARY

DETAILED HARDWARE COMPARISON

CPU Specifications

Memory (RAM) Specifications

Storage Specifications

Network Specifications

Proxmox & Cluster Status

DOCKER SWARM TOPOLOGY ANALYSIS

Target Design (from documentation/architecture/compute-plane.md)

Capacity Analysis: pve04 as Reference Model

CPU

Memory (15 GB)

Storage (240 GB)

Network

High-Availability & Resilience

Quorum Analysis

Failure Scenarios

SPECIAL CONSIDERATIONS FOR pve03

Storage Mystery: 21 Loop/Device-Mapper Devices

FINAL RECOMMENDATIONS

1. Short-Term (Immediate)

2. Medium-Term (Before Final 3-Node Deployment)

3. Long-Term (Post-3-Node Commissioning)

APPENDIX: Hardware Specs Collected

pve03 (10.0.0.203) – Full Details

pve04 (10.0.0.204) – Full Details

CONCLUSION

14 KiB

Raw Blame History