nathan e16f98a183 feat(bootstrap)!: introduce unified bootstrap system with modular libraries
BREAKING CHANGE: day0bootstrap.sh deprecated in favor of bootstrap.sh

- Add scripts/bootstrap.sh (488 lines): Unified entrypoint supporting multiple hardware types (Proxmox/Docker VMs/Pi)
- Create scripts/lib/ modular library system:
  - detection.sh: OS/hardware/container detection (362 lines)
  - fingerprint.sh: System fingerprinting and inventory (494 lines)
  - network.sh: IP configuration and VLAN placement (356 lines)
  - proxmox.sh: PVE post-install automation (453 lines)
  - validation.sh: Comprehensive pre-flight checks (510 lines)
- Add validation tools: validate-node.sh, onboarding.sh, pi_init.sh
- Deprecate scripts/day0bootstrap.sh with graceful redirect wrapper
- Document architecture in scripts/README.md (495 lines) and PROXMOX-COMPARISON.md
- Update SOP-002 with new bootstrap workflow
- Add nodes/watchtower/compose.yaml (Raspberry Pi 5 stack)

Migration: Existing day0bootstrap.sh users automatically redirected to new system after 5-second warning. No manual intervention required.

Ref: Infrastructure automation modernization per active-tasks.md
2026-04-12 22:48:19 -04:00
..

Homelab Bootstrap & Utility Scripts

This directory contains the unified day0 onboarding infrastructure and operational utility scripts for homelab hardware provisioning.

🚀 Quick Start

New Hardware Onboarding (Recommended):

# Auto-detect hardware type and configure
./bootstrap.sh

# Preview changes without applying
./bootstrap.sh --dry-run

# Force specific hardware type
./bootstrap.sh --hardware-type proxmox --target-ip 10.0.0.201

Health Check on Existing Nodes:

# Run comprehensive validation
./validate-node.sh

# JSON output for monitoring
./validate-node.sh --json

📁 Directory Structure

scripts/
├── bootstrap.sh              # ⭐ Unified day0 onboarding (NEW)
├── validate-node.sh          # ⭐ Standalone health check tool (NEW)
├── lib/                      # ⭐ Modular libraries (NEW)
│   ├── detection.sh          #    OS/hardware auto-detection
│   ├── network.sh            #    Network configuration & validation
│   ├── validation.sh         #    Comprehensive health checks
│   ├── fingerprint.sh        #    Hardware inventory collection
│   └── proxmox.sh            #    Proxmox VE post-install (comprehensive)
├── day0bootstrap.sh          # 🔻 DEPRECATED - use bootstrap.sh
├── pi_init.sh                # 🔻 DEPRECATED - use bootstrap.sh
├── onboarding.sh             # 🔻 DEPRECATED - use bootstrap.sh
└── README.md                 # This file

🛠️ Core Tools

bootstrap.sh — Unified Day0 Onboarding

Purpose: Intelligent initialization for all homelab hardware types. Auto-detects OS, hardware, and applies appropriate configuration.

Replaces: day0bootstrap.sh, pi_init.sh, onboarding.sh

Features:

  • Auto-detection (Proxmox, Docker VM, Raspberry Pi, physical servers, AI workstations)
  • Static IP configuration via netplan
  • Docker & Ansible installation (with Debian Trixie compatibility)
  • SSH key generation (ED25519 preferred, RSA fallback)
  • Comprehensive validation suite
  • Hardware fingerprinting with YAML/JSON output
  • Ansible inventory auto-discovery

Usage:

./bootstrap.sh [OPTIONS]

Options:
  --help                Show detailed help
  --dry-run             Preview actions without changes
  --hardware-type TYPE  Override auto-detection
                        (proxmox|docker-vm|pi|physical-docker|ai-workstation)
  --skip-network        Skip network configuration
  --skip-validation     Skip post-bootstrap validation
  --output-json         Generate JSON output instead of YAML
  --target-ip IP        Set static IP address (default: auto-assigned)
  --gateway IP          Set gateway IP (default: 10.0.0.1)
  --dns IP              Set DNS server (default: 10.0.0.2)

Examples:

# Auto-detect and configure with defaults
./bootstrap.sh

# Preview what would be done (safe to run)
./bootstrap.sh --dry-run

# Force Proxmox mode with custom IP
./bootstrap.sh --hardware-type proxmox --target-ip 10.0.0.201

# Skip network config (already configured manually)
./bootstrap.sh --skip-network

# Generate JSON for monitoring integration
./bootstrap.sh --output-json

Output Files:

  • ../ansible/archive/outputs/hardware-facts-{hostname}-{timestamp}.yml
  • ../ansible/archive/outputs/bootstrap-validation-{hostname}-{timestamp}.log
  • ../ansible/archive/inventory/discovered-hosts.yml (auto-discovery)

Workflow:

  1. Detection → Identify OS, hardware type, CPU, GPU
  2. Network Config → Apply static IP via netplan (SSH will disconnect)
  3. Package Install → Docker, Ansible, proxmoxer (as needed)
  4. SSH Keys → Generate/verify ED25519 keys
  5. Validation → Comprehensive health checks (disk, memory, network, etc.)
  6. Fingerprinting → Generate hardware inventory YAML

Network Warning: Network configuration will disconnect SSH. Reconnect to new IP after ~10 seconds.


validate-node.sh — Standalone Health Check Tool

Purpose: Comprehensive validation suite for operational readiness. Can be run post-bootstrap or ad-hoc on any managed node.

Features:

  • Disk space & performance checks
  • Memory & swap validation
  • Network routing & connectivity tests
  • NFS client configuration
  • Docker daemon health
  • Proxmox API accessibility (if applicable)
  • SSH security audit
  • Time synchronization verification
  • JSON output for monitoring integration

Usage:

./validate-node.sh [OPTIONS]

Options:
  --help          Show detailed help
  --json          Output results in JSON format
  --critical-only Show only critical errors
  --verbose       Show detailed output for each check

Examples:

# Run all checks with standard output
./validate-node.sh

# Only show critical errors (useful in scripts)
./validate-node.sh --critical-only

# JSON output for monitoring/alerting
./validate-node.sh --json | jq '.validation.errors'

# Verbose mode with detection summary
./validate-node.sh --verbose

Exit Codes:

  • 0 → All checks passed (or warnings only)
  • 1 → Critical errors found
  • 2 → Invalid usage

Integration Example:

# Use in deployment pipeline
if ./validate-node.sh --critical-only; then
    echo "Node healthy, proceeding with deployment"
else
    echo "Critical issues found, aborting"
    exit 1
fi

📚 Library Components

lib/detection.sh — System Detection

Functions:

  • detect_os_family() → debian, ubuntu, raspbian
  • detect_os_version() → trixie, bookworm, noble, etc.
  • detect_hardware_type() → proxmox, docker-vm, pi, physical-docker, ai-workstation
  • detect_cpu_vendor() → intel, amd, arm
  • detect_cpu_generation() → Intel generation number (12, 13, 14, etc.)
  • detect_gpu() → nvidia, amd, intel, none
  • detect_primary_interface() → Network interface name
  • get_current_ip() → Current IPv4 address

Usage:

source lib/detection.sh
HW_TYPE=$(detect_hardware_type)
echo "Hardware type: $HW_TYPE"

lib/network.sh — Network Configuration

Functions:

  • apply_static_ip() → Configure netplan with static IP
  • configure_network_safe() → Apply network changes with reconnection instructions
  • validate_connectivity() → Test gateway, internet, DNS
  • check_nfs_accessibility() → Test NFS server reachability
  • get_desired_vlan_ip() → Determine VLAN IP based on hardware type (future)

Usage:

source lib/network.sh
configure_network_safe "10.0.0.151" "10.0.0.1" "10.0.0.2"

lib/validation.sh — Health Checks

Functions:

  • check_disk_space() → Validate sufficient disk space
  • check_memory() → Validate RAM meets hardware type requirements
  • check_network_routes() → Validate routing configuration
  • check_docker_daemon() → Validate Docker installation and health
  • check_proxmox_api() → Validate Proxmox VE (if applicable)
  • run_validation_suite() → Run all checks and return summary

Severity Levels:

  • Critical → Blocks Ansible playbook execution
  • Warning → Manual review recommended
  • Info → Logged for reference

lib/fingerprint.sh — Hardware Inventory

Functions:

  • collect_cpu_facts() → CPU model, cores, frequency
  • collect_memory_facts() → RAM and swap details
  • collect_disk_facts() → Storage capacity and type (SSD/HDD)
  • collect_network_facts() → Network interfaces and connectivity
  • generate_hardware_facts_yaml() → Full YAML output
  • generate_ansible_inventory_snippet() → Ansible inventory entry

Output Format (YAML):

hostname:
  os:
    family: " "bookworm"
  cpu:
    model: "Intel Core i7-12700"
    cores: 12
  memory:
    total_gb: 32
  storage:
    root:
      total_gb: 500
      type: "ssd"

lib/proxmox.sh — Proxmox VE Post-Install

Comprehensive Proxmox configuration (inspired by community-scripts/ProxmoxVE):

Repository Management:

  • configure_all_pve_repos() → Complete repository configuration
  • disable_pve_enterprise_repo() → Disable subscription-required repository
  • enable_pve_no_subscription_repo() → Enable free updates repository
  • configure_ceph_repos() → Configure Ceph storage repositories
  • add_pvetest_repo() → Add beta/test repository (disabled by default)

Subscription Nag Removal:

  • disable_subscription_nag() → Remove subscription warnings from UI
    • Patches web UI JavaScript (proxmoxlib.js)
    • Patches mobile UI templates
    • Creates apt hook to persist after updates

High Availability Management:

  • disable_ha_services() → Disable HA for single-node setups (saves resources)
  • enable_ha_services() → Enable HA for clustered environments
  • check_ha_status() → Check current HA service status

System Maintenance:

  • update_pve_system() → Run full system update (apt dist-upgrade)
  • check_reboot_required() → Check if reboot needed after updates

Comprehensive Routine:

  • run_proxmox_post_install()All-in-one post-install configuration
    • Repository fixes (PVE 8.x and 9.x compatible)
    • Subscription nag removal
    • Auto-detect single vs. cluster and adjust HA accordingly
    • System update checks
    • Reboot requirement detection

Validation:

  • validate_proxmox_config() → Verify Proxmox configuration
  • get_pve_version() → Get Proxmox VE version
  • is_pve_supported_version() → Check if version is supported (8.0-8.9.x, 9.0-9.1.x)

Usage:

source lib/proxmox.sh

# Run complete post-install (auto mode)
run_proxmox_post_install "auto"

# Individual functions
disable_subscription_nag
disable_ha_services  # For single-node setups
validate_proxmox_config

What It Does (Compared to Community Script):

  • Repository management (enterprise/no-subscription/Ceph/pvetest)
  • Subscription nag removal (web + mobile UI)
  • HA service management (auto-detect single vs. cluster)
  • Supports PVE 8.x (.list format) and 9.x (.sources deb822 format)
  • System update and reboot detection
  • Automated (non-interactive) for use in bootstrap.sh

Integration: When bootstrap.sh detects Proxmox hardware, it automatically runs run_proxmox_post_install "auto" after package installation, providing the same comprehensive post-install configuration as the community tteck/ProxmoxVE script. type: "ssd"


---

## 🔻 Deprecated Scripts

**These scripts are deprecated and will be removed in a future release. Use `bootstrap.sh` instead.**

### `day0bootstrap.sh` → `bootstrap.sh --hardware-type pi`
**Original Purpose:** Debian Trixie bootstrap for Raspberry Pi  
**Status:** Wrapper redirects to bootstrap.sh with deprecation warning

### `pi_init.sh` → `bootstrap.sh --hardware-type pi`
**Original Purpose:** Raspberry Pi initialization  
**Status:** Wrapper redirects to bootstrap.sh with deprecation warning

### `onboarding.sh` → `bootstrap.sh`
**Original Purpose:** Generic onboarding + Proxmox integration  
**Status:** Wrapper redirects to bootstrap.sh with deprecation warning

**Migration Path:**
All legacy scripts now display a deprecation warning and automatically redirect to `bootstrap.sh` with appropriate flags. You have 6 months to update your workflows before these wrappers are removed.

---

## 🎯 Common Workflows

### New Raspberry Pi Control Node
```bash
# 1. Transfer scripts to Pi
scp -r scripts/ chester@10.0.0.200:~/

# 2. SSH to Pi
ssh chester@10.0.0.200

# 3. Run bootstrap
cd scripts
./bootstrap.sh  # Auto-detects Pi hardware

# 4. SSH will disconnect - reconnect to new IP
ssh chester@10.0.0.200

New Proxmox Host

# On Proxmox console (pre-SSH setup)
curl -O https://git.castaldifamily.com/nathan/homelab/raw/branch/main/scripts/bootstrap.sh
chmod +x bootstrap.sh
./bootstrap.sh --hardware-type proxmox --target-ip 10.0.0.201

New Docker Swarm VM

# From control node (Watchtower)
scp -r scripts/ chester@10.0.0.211:~/
ssh chester@10.0.0.211
cd scripts
./bootstrap.sh --hardware-type docker-vm --target-ip 10.0.0.211

Health Check All Nodes (from Control Node)

# Create validation script
cat > validate-all.sh <<'EOF'
#!/bin/bash
for host in heimdall waldorf watchtower; do
    echo "=== $host ==="
    ssh $host "~/scripts/validate-node.sh --critical-only"
done
EOF

chmod +x validate-all.sh
./validate-all.sh

🧪 Testing & Development

Test Bootstrap in Dry-Run Mode

# Safe to run on production nodes
./bootstrap.sh --dry-run --hardware-type docker-vm

Manually Test Libraries

# Source library and test functions
source lib/detection.sh
print_detection_summary

source lib/validation.sh
run_validation_suite

Validate Script Syntax

# Check for syntax errors
bash -n bootstrap.sh
shellcheck bootstrap.sh lib/*.sh

📖 Documentation

Primary References:

Related Ansible Playbooks:


🐛 Troubleshooting

Bootstrap fails with "could not detect primary interface":

  • Check network cable/Wi-Fi connection
  • Run manually: ip -o link show to see interfaces
  • Override: INTERFACE=eth0 ./bootstrap.sh

SSH disconnects and doesn't reconnect:

  • Wait 30 seconds and retry
  • Check console access if available
  • Verify IP in router DHCP leases

Validation reports critical errors:

  • Review log: cat ../ansible/archive/outputs/bootstrap-validation-*.log
  • Address critical issues (disk space, memory, etc.)
  • Re-run validation: ./validate-node.sh

Docker installation fails on Debian Trixie:

  • Script automatically uses Bookworm repos as fallback
  • Check logs: journalctl -u docker

Hardware fingerprinting shows "unknown" values:

  • Some VMs may not expose full hardware info
  • Run sudo dmidecode --type system for manual inspection

🔮 Future Enhancements

Planned (Not Yet Implemented):

  • PXE boot + cloud-init provisioning (eliminate manual OS install)
  • Active VLAN migration logic (10.0.0.x → 10.0.10.x/10.0.200.x)
  • Proxmox VM auto-provisioning via API
  • Node retirement/decommissioning automation
  • Integration with Prometheus node exporter metrics
  • Containerized control node for disaster recovery

VLAN Placeholder: The get_desired_vlan_ip() function contains placeholders for future VLAN segmentation. Current implementation uses flat 10.0.0.x network. See comments in lib/network.sh for migration plan.


📝 Version History

  • v4.0.0 (2026-04-12): Initial unified bootstrap release
    • Consolidated 3 legacy scripts into single entry point
    • Added modular library architecture
    • Comprehensive validation and fingerprinting
    • Auto-discovery with Ansible inventory generation

Questions? See documentation/TECHNICAL_RUNBOOK.md or review session memory in /memories/session/plan.md.

scripts

Automation utilities and helper scripts for homelab infrastructure management.


Inventory

Script Purpose Status
onboarding.sh Bootstrap Ansible control node for Proxmox management 🟡 DRAFT - Testing Required

onboarding.sh

Purpose: Automated setup of Ansible control node for Proxmox infrastructure management.

What it does:

  1. Installs Ansible and Proxmoxer Python library
  2. Detects or generates SSH keypair (ED25519 preferred, RSA fallback)
  3. Copies public key to Proxmox server for passwordless authentication
  4. Generates Ansible inventory file (hosts.ini) with Proxmox connection details

Prerequisites:

  • Debian/Ubuntu-based system (uses apt)
  • Network access to Proxmox server
  • Initial SSH password for target Proxmox server

Configuration: Edit the following variables at the top of the script:

PROXMOX_IP="192.168.1.100"       # Target Proxmox server IP
PROXMOX_USER="root"              # Proxmox SSH user

Usage:

cd ~/dev/homelab/scripts
chmod +x onboarding.sh
./onboarding.sh

Verification:

ansible proxmox_nodes -m ping -i hosts.ini

⚠️ Development Status

Script Testing Status Known Issues
onboarding.sh Untested in production • Hardcoded Proxmox IP/user variables
• No error handling for failed SSH key copy
• Assumes Debian/Ubuntu package manager
• No validation of Proxmox connectivity

DO NOT USE IN PRODUCTION until the following are addressed:

  1. Error Handling: Add validation checks for each step
  2. Idempotency: Verify script can be safely re-run
  3. Multi-OS Support: Test on RHEL/Arch variants or add OS detection
  4. Interactive Mode: Prompt for PROXMOX_IP/USER instead of manual editing
  5. Rollback: Add cleanup mechanism for failed installations

Contributing

When adding new scripts:

  1. Update the Inventory table with script name and purpose
  2. Document prerequisites, configuration, and usage
  3. Mark status as 🟡 DRAFT until production-tested
  4. Add to Development Status table with known issues