From e16f98a183c872544988556f3cdc4dab459edc74 Mon Sep 17 00:00:00 2001 From: nathan Date: Sun, 12 Apr 2026 22:48:19 -0400 Subject: [PATCH] feat(bootstrap)!: introduce unified bootstrap system with modular libraries BREAKING CHANGE: day0bootstrap.sh deprecated in favor of bootstrap.sh - Add scripts/bootstrap.sh (488 lines): Unified entrypoint supporting multiple hardware types (Proxmox/Docker VMs/Pi) - Create scripts/lib/ modular library system: - detection.sh: OS/hardware/container detection (362 lines) - fingerprint.sh: System fingerprinting and inventory (494 lines) - network.sh: IP configuration and VLAN placement (356 lines) - proxmox.sh: PVE post-install automation (453 lines) - validation.sh: Comprehensive pre-flight checks (510 lines) - Add validation tools: validate-node.sh, onboarding.sh, pi_init.sh - Deprecate scripts/day0bootstrap.sh with graceful redirect wrapper - Document architecture in scripts/README.md (495 lines) and PROXMOX-COMPARISON.md - Update SOP-002 with new bootstrap workflow - Add nodes/watchtower/compose.yaml (Raspberry Pi 5 stack) Migration: Existing day0bootstrap.sh users automatically redirected to new system after 5-second warning. No manual intervention required. Ref: Infrastructure automation modernization per active-tasks.md --- ...P-002-Initial-Infrastructure-Deployment.md | 132 +++-- nodes/watchtower/compose.yaml | 85 +++ scripts/PROXMOX-COMPARISON.md | 256 +++++++++ scripts/README.md | 495 +++++++++++++++++ scripts/bootstrap.sh | 488 +++++++++++++++++ scripts/day0bootstrap.sh | 33 +- scripts/lib/detection.sh | 362 +++++++++++++ scripts/lib/fingerprint.sh | 494 +++++++++++++++++ scripts/lib/network.sh | 356 ++++++++++++ scripts/lib/proxmox.sh | 453 ++++++++++++++++ scripts/lib/validation.sh | 510 ++++++++++++++++++ scripts/onboarding.sh | 41 ++ scripts/pi_init.sh | 39 ++ scripts/validate-node.sh | 215 ++++++++ 14 files changed, 3917 insertions(+), 42 deletions(-) create mode 100644 nodes/watchtower/compose.yaml create mode 100644 scripts/PROXMOX-COMPARISON.md create mode 100644 scripts/bootstrap.sh create mode 100644 scripts/lib/detection.sh create mode 100644 scripts/lib/fingerprint.sh create mode 100644 scripts/lib/network.sh create mode 100644 scripts/lib/proxmox.sh create mode 100644 scripts/lib/validation.sh create mode 100644 scripts/validate-node.sh diff --git a/documentation/SOPs/SOP-002-Initial-Infrastructure-Deployment.md b/documentation/SOPs/SOP-002-Initial-Infrastructure-Deployment.md index ca1b912..c4c7836 100644 --- a/documentation/SOPs/SOP-002-Initial-Infrastructure-Deployment.md +++ b/documentation/SOPs/SOP-002-Initial-Infrastructure-Deployment.md @@ -132,49 +132,78 @@ Deploy the complete homelab infrastructure from a clean state using GitOps princ ## Ansible Control Node Setup -### Step 3: Configure Watchtower as Control Node +### Step 3: Bootstrap Watchtower as Control Node -**Time:** 25-35 minutes +**Time:** 15-20 minutes (reduced from 25-35 via automation) **Rationale:** Watchtower (Raspberry Pi 5) serves as the Ansible control node to manage all infrastructure, including itself. -1. **SSH to Watchtower:** +**New Method:** Use the unified bootstrap script for automated, idempotent configuration. + +1. **Transfer Bootstrap Script to Watchtower:** + + **Option A: From local repository (if cloned on workstation):** + ```bash + # From your workstation + scp -r homelab/scripts chester@10.0.0.200:~/ + ``` + + **Option B: Direct clone on Watchtower:** + ```bash + # SSH to Watchtower + ssh chester@10.0.0.200 + + # Minimal clone (scripts only) + git clone --depth=1 https://git.castaldifamily.com/nathan/homelab.git + cd homelab/scripts + ``` + +2. **Run Unified Bootstrap Script:** + ```bash + # Auto-detect and configure (Raspberry Pi will be detected) + ./bootstrap.sh + + # The script will: + # - Detect Raspberry Pi hardware + # - Configure static IP (10.0.0.200) + # - Install Docker with Debian Trixie compatibility + # - Install Ansible and proxmoxer + # - Generate ED25519 SSH keys + # - Run comprehensive validation + # - Generate hardware fingerprint + ``` + + **โš ๏ธ Important:** SSH connection will drop during network reconfiguration. + Reconnect after ~10 seconds: ```bash ssh chester@10.0.0.200 ``` -2. **Install Ansible Toolchain:** +3. **Verify Bootstrap Success:** ```bash - # Update package index - sudo apt update + # After reconnecting + cd homelab/scripts - # Install Ansible and dependencies - sudo apt install -y ansible ansible-lint sshpass python3-pip git + # Check validation report + cat ../ansible/archive/outputs/bootstrap-validation-watchtower-*.log - # Install Python libraries - pip3 install proxmoxer requests --break-system-packages + # Verify installations + docker --version # Should show Docker 24.x or newer + ansible --version # Should show ansible [core 2.x.x] - # Verify installation - ansible --version - # Expected: ansible [core 2.x.x] + # Check SSH key + ls -lh ~/.ssh/id_ed25519.pub + cat ~/.ssh/id_ed25519.pub # Copy this for distribution ``` -3. **Generate SSH Keys for Automation:** +4. **Distribute SSH Keys to Managed Nodes:** ```bash - # Generate ED25519 key (modern cryptography) - ssh-keygen -t ed25519 -C "ansible@watchtower" -f ~/.ssh/id_ed25519 -N "" + # The bootstrap script generated keys, now distribute them - # Set proper permissions - chmod 600 ~/.ssh/id_ed25519 - chmod 644 ~/.ssh/id_ed25519.pub - ``` - -4. **Distribute Keys to All Nodes:** - ```bash # Deploy to Heimdall ssh-copy-id -i ~/.ssh/id_ed25519.pub chester@10.0.0.151 - # Deploy to Waldorf + # Deploy to Waldorf ssh-copy-id -i ~/.ssh/id_ed25519.pub chester@10.0.0.251 # Deploy to localhost (self-management) @@ -194,9 +223,12 @@ Deploy the complete homelab infrastructure from a clean state using GitOps princ # Expected: watchtower ``` -6. **Clone Repository to Control Node:** +6. **Clone Full Repository (If Not Already Present):** ```bash cd ~ + + # If you only did shallow clone earlier, get full repo + rm -rf homelab # Remove shallow clone git clone https://git.castaldifamily.com/nathan/homelab.git cd homelab @@ -205,13 +237,19 @@ Deploy the complete homelab infrastructure from a clean state using GitOps princ git-crypt unlock ~/homelab-secrets.key ``` +**Troubleshooting:** +- **Bootstrap fails:** Run with `--dry-run` first to preview actions: `./bootstrap.sh --dry-run` +- **Network doesn't reconnect:** Wait 30 seconds and retry SSH +- **Validation errors:** Review the validation log, address critical errors before proceeding +- **Manual intervention needed:** Use `./validate-node.sh` to re-check after fixes + --- ## Core Infrastructure Deployment -### Step 4: Deploy Core Stack on Heimdall +### Step 4: Bootstrap and Deploy Core Stack on Heimdall -**Time:** 20-30 minutes +**Time:** 15-25 minutes (reduced from 20-30 via automation) **Core Stack Components:** - Docker Socket Proxy (security boundary) @@ -219,29 +257,41 @@ Deploy the complete homelab infrastructure from a clean state using GitOps princ - Redis (caching layer) - Komodo Core (container orchestration) -**Deployment Method:** Manual Docker Compose (Ansible automation planned for future state) - -1. **SSH to Heimdall:** +1. **Bootstrap Heimdall Node:** + + **Option A: Remote bootstrap from Watchtower (recommended):** ```bash + # From Watchtower control node + cd ~/homelab + + # Copy bootstrap script to Heimdall + scp -r scripts chester@10.0.0.151:~/ + + # SSH and run bootstrap + ssh chester@10.0.0.151 "cd scripts && ./bootstrap.sh --hardware-type docker-vm" + ``` + + **Option B: Direct console access:** + ```bash + # Login to Heimdall directly ssh chester@10.0.0.151 + + # Clone repo or copy scripts + git clone --depth=1 https://git.castaldifamily.com/nathan/homelab.git + cd homelab/scripts + + # Run bootstrap + ./bootstrap.sh --hardware-type docker-vm --target-ip 10.0.0.151 ``` -2. **Install Docker & Docker Compose:** +2. **Verify Docker Installation:** ```bash - # Install Docker - curl -fsSL https://get.docker.com -o get-docker.sh - sudo sh get-docker.sh - - # Add user to docker group - sudo usermod -aG docker $USER - - # Log out and back in for group to take effect - exit + # After bootstrap completes ssh chester@10.0.0.151 - # Verify Docker installation docker --version docker compose version + docker ps # Should return empty list (no containers yet) ``` 3. **Create Komodo Directory Structure:** diff --git a/nodes/watchtower/compose.yaml b/nodes/watchtower/compose.yaml new file mode 100644 index 0000000..26e5ee0 --- /dev/null +++ b/nodes/watchtower/compose.yaml @@ -0,0 +1,85 @@ +name: node-tools +services: + # ๐Ÿ”’ Local Security Layer for this Node + docker-socket-proxy: + image: tecnativa/docker-socket-proxy:latest + container_name: docker-socket-proxy + userns_mode: "host" + user: "0:0" + security_opt: + - apparmor=unconfined + privileged: true + networks: + - node-net + ports: + - "127.0.0.1:2375:2375" # Expose on localhost for host-mode periphery + group_add: + - "988" + volumes: + - /var/run/docker.sock:/var/run/docker.sock:ro + environment: + - CONTAINERS=1 + - NETWORKS=1 + - IMAGES=1 + - INFO=1 + - POST=1 + - ALLOW_START=1 + - ALLOW_STOP=1 + # Added for Stack Management + - SERVICES=1 # Required for stack/service operations + - TASKS=1 # Required for stack task management + - VOLUMES=1 # Required if stacks use volumes + - CONFIGS=1 # Required for Docker configs + - SECRETS=1 # Required for Docker secrets + + # ๐ŸฆŽ Komodo Periphery + periphery: + image: ghcr.io/moghtech/komodo-periphery:2 + container_name: komodo-perihery-watchtower + network_mode: host # Use host networking to access external IPs + depends_on: + - docker-socket-proxy + environment: + - DOCKER_HOST=tcp://127.0.0.1:2375 # Access via localhost + - PERIPHERY_CORE_ADDRESS=ws://10.0.0.151:9120 + - PERIPHERY_CONNECT_AS=Watchtower + - PERIPHERY_ONBOARDING_KEY=O_VegHtPxiQKrzsAd8MqlrJEs2WLxZ_O + volumes: + - /proc:/proc + - /mnt/appdata/komodo/watchtower/keys:/config/keys + - /mnt/appdata/komodo/watchtower/work:/etc/komodo + # โœ… Added for Stack Deployments + - /mnt/appdata/komodo/watchtower/stacks:/etc/komodo/stacks + # โœ… Added for Git-linked Stacks + - /mnt/appdata/komodo/watchtower/repos:/etc/komodo/repos + + # ๐Ÿ” Traefik-KOP (Kubernetes Operator for Traefik Discovery) + traefik-kop: + image: ghcr.io/jittering/traefik-kop:0.19.4 + container_name: traefik-kop + restart: unless-stopped + depends_on: + - docker-socket-proxy + networks: + - node-net + environment: + - DOCKER_HOST=tcp://docker-socket-proxy:2375 + - REDIS_ADDR=10.0.0.151:6379 + - BIND_IP=10.0.0.200 + - KOP_HOSTNAME=watchtower + # Optional: Enable debug logging + # - VERBOSE=true + +# ๐Ÿ“œ Dozzle Agent +# dozzle: +# image: amir20/dozzle:latest +# depends_on: +# - docker-socket-proxy +# networks: +# - node-net +# environment: +# - DOCKER_HOST=tcp://docker-socket-proxy:2375 + +networks: + node-net: + driver: bridge diff --git a/scripts/PROXMOX-COMPARISON.md b/scripts/PROXMOX-COMPARISON.md new file mode 100644 index 0000000..2c0f2b0 --- /dev/null +++ b/scripts/PROXMOX-COMPARISON.md @@ -0,0 +1,256 @@ +# Proxmox Post-Install Feature Comparison + +## Community Script vs. Bootstrap.sh + +This document compares the tteck/community-scripts ProxmoxVE post-install script with our unified bootstrap.sh implementation. + +--- + +## โœ… Features Now Implemented + +### Repository Management +| Feature | Community Script | bootstrap.sh | Status | +|---------|-----------------|--------------|--------| +| Disable enterprise repo | โœ… | โœ… | **COMPLETE** | +| Enable no-subscription repo | โœ… | โœ… | **COMPLETE** | +| Configure Ceph repos | โœ… | โœ… | **COMPLETE** | +| Add pvetest repo (disabled) | โœ… | โœ… | **COMPLETE** | +| PVE 8.x (.list format) | โœ… | โœ… | **COMPLETE** | +| PVE 9.x (.sources deb822) | โœ… | โœ… | **COMPLETE** | + +### UI Customization +| Feature | Community Script | bootstrap.sh | Status | +|---------|-----------------|--------------|--------| +| Web UI subscription nag removal | โœ… | โœ… | **COMPLETE** | +| Mobile UI subscription nag removal | โœ… | โœ… | **COMPLETE** | +| Persistent via apt hook | โœ… | โœ… | **COMPLETE** | + +### High Availability +| Feature | Community Script | bootstrap.sh | Status | +|---------|-----------------|--------------|--------| +| Disable HA on single nodes | โœ… | โœ… | **COMPLETE** | +| Auto-detect cluster vs. single | โœ… | โœ… | **COMPLETE** | +| Corosync management | โœ… | โœ… | **COMPLETE** | + +### System Maintenance +| Feature | Community Script | bootstrap.sh | Status | +|---------|-----------------|--------------|--------| +| System update (apt dist-upgrade) | โœ… | โš ๏ธ | **PARTIAL** | +| Reboot detection | โœ… | โœ… | **COMPLETE** | +| Interactive prompts | โœ… | โŒ | **N/A (Auto Mode)** | + +--- + +## ๐ŸŽฏ Key Differences + +### Community Script (Interactive) +- **Mode:** Interactive (whiptail menus) +- **Approach:** User chooses each action +- **Best For:** Manual post-install on fresh Proxmox + +### bootstrap.sh (Automated) +- **Mode:** Automated (sensible defaults) +- **Approach:** Auto-detect and apply best practices +- **Best For:** Scripted deployments, CI/CD, repeatability + +--- + +## ๐Ÿ“‹ What bootstrap.sh Does for Proxmox + +When `./bootstrap.sh --hardware-type proxmox` is run: + +### Phase 1: Detection +``` +โœ“ Detect Proxmox VE version (8.x or 9.x) +โœ“ Check cluster status (single node vs. cluster) +โœ“ Identify current repository configuration +``` + +### Phase 2: Repository Configuration +``` +โœ“ Disable pve-enterprise repository +โœ“ Enable pve-no-subscription repository +โœ“ Configure Ceph repos (disabled by default) +โœ“ Add pvetest repository (disabled) +โœ“ Auto-select .list (PVE 8.x) or .sources (PVE 9.x) format +``` + +### Phase 3: UI Customization +``` +โœ“ Remove subscription nag from web UI +โœ“ Remove subscription nag from mobile UI +โœ“ Create apt hook to persist after updates +``` + +### Phase 4: High Availability +``` +โœ“ Detect single-node setup โ†’ Disable HA services (saves resources) +โœ“ Detect cluster โ†’ Keep HA services enabled +โœ“ Manage pve-ha-lrm, pve-ha-crm, corosync +``` + +### Phase 5: Validation +``` +โœ“ Verify PVE version supported +โœ“ Check repository configuration +โœ“ Validate cluster status +โœ“ Check for required reboot +``` + +--- + +## ๐Ÿš€ Usage Examples + +### Basic Proxmox Bootstrap +```bash +# Auto-detect and configure Proxmox host +./bootstrap.sh --hardware-type proxmox --target-ip 10.0.0.201 +``` + +**Output:** +``` +[PHASE 1] System Detection + Auto-detected hardware type: proxmox + +[PHASE 3] Package Installation + [โš™] Installing Proxmox-specific packages... + [โœ“] proxmoxer installed + +=== PROXMOX POST-INSTALL CONFIGURATION === +Version: 8.2.4 +Mode: auto + +[โš™] Disabling Proxmox enterprise repository... + [โœ“] Enterprise repository disabled + +[โš™] Enabling Proxmox no-subscription repository... + [โœ“] No-subscription repository enabled + +[โš™] Configuring Ceph package repositories... + [โœ“] Ceph repositories configured (disabled) + +[โš™] Adding pvetest repository (disabled)... + [โœ“] pvetest repository added (disabled) + +[โš™] Disabling subscription nag dialog... + [โœ“] Subscription nag disabled (clear browser cache) + +[โš™] Single-node setup detected +[โš™] Disabling high availability services... + [โœ“] HA services disabled + +[โœ“] Proxmox repository configuration complete +========================================== + +Next Steps: + 1. Clear browser cache (Ctrl+Shift+R) + 2. Reboot if kernel was updated + 3. Configure storage/networking in web UI +``` + +### Standalone Proxmox Configuration (Existing Host) +```bash +# Source library and run post-install only +source lib/proxmox.sh +run_proxmox_post_install "auto" +``` + +### Validate Proxmox Configuration +```bash +source lib/proxmox.sh +validate_proxmox_config +``` + +**Output:** +``` +[โš™] Validating Proxmox configuration... + [โœ“] PVE version: 8.2.4 + [โœ“] No-subscription repository configured + [โœ“] Enterprise repository disabled + [โœ“] Standalone node +[โœ“] Proxmox validation passed +``` + +--- + +## ๐Ÿ” Technical Implementation + +### Subscription Nag Removal + +**Web UI (proxmoxlib.js):** +```javascript +// Patches data.status check +// Before: if (!data.status || data.status !== 'active') +// After: if (data.status || data.status !== 'NoMoreNagging') +``` + +**Mobile UI (index.html.tpl):** +```javascript +// Injects MutationObserver to remove subscription dialogs/cards +function removeSubscriptionElements() { + // Remove dialogs with subscription text + // Remove cards without buttons containing subscription text +} +``` + +**Persistence (apt hook):** +```bash +# /etc/apt/apt.conf.d/90-pve-no-nag +DPkg::Post-Invoke { "/usr/local/bin/pve-remove-nag.sh"; }; +``` + +### HA Detection Logic +```bash +# Count cluster nodes +cluster_nodes=$(pvesh get /nodes --output-format=json | jq -r 'length') + +if [ "$cluster_nodes" -eq 1 ]; then + # Single node โ†’ disable HA services + systemctl disable --now pve-ha-lrm pve-ha-crm corosync +else + # Multi-node cluster โ†’ keep HA enabled + echo "HA services left enabled" +fi +``` + +### Repository Format Detection +```bash +major_version=$(get_pve_major_version) + +if [ "$major_version" == "9" ]; then + # PVE 9.x uses deb822 .sources format + cat >/etc/apt/sources.list.d/proxmox.sources < /etc/apt/sources.list.d/pve-no-subscription.list +fi +``` + +--- + +## ๐Ÿ“– Related Documentation + +- **Community Script Source:** [tteck/ProxmoxVE](https://github.com/community-scripts/ProxmoxVE/blob/main/tools/pve/post-pve-install.sh) +- **Bootstrap README:** [scripts/README.md](README.md) +- **Proxmox Ansible Playbook:** [ansible/archive/playbooks/onboarding/proxmox_host.yml](../../ansible/archive/playbooks/onboarding/proxmox_host.yml) + +--- + +## ๐ŸŽ“ Lessons Learned + +1. **Idempotency Is Critical:** All operations can be safely re-run without side effects +2. **Version Detection Matters:** PVE 8.x vs. 9.x have different repository formats +3. **HA Context Awareness:** Single-node setups don't need HA services (wastes resources) +4. **Persistence Mechanisms:** Apt hooks ensure UI patches survive widget updates +5. **Non-Interactive Design:** Automation requires sensible defaults, not user prompts + +--- + +**Last Updated:** 2026-04-12 +**Version:** 4.0.0 (aligned with bootstrap.sh) diff --git a/scripts/README.md b/scripts/README.md index 441126c..b54522c 100644 --- a/scripts/README.md +++ b/scripts/README.md @@ -1,3 +1,498 @@ +# Homelab Bootstrap & Utility Scripts + +This directory contains the unified day0 onboarding infrastructure and operational utility scripts for homelab hardware provisioning. + +## ๐Ÿš€ Quick Start + +**New Hardware Onboarding (Recommended):** +```bash +# Auto-detect hardware type and configure +./bootstrap.sh + +# Preview changes without applying +./bootstrap.sh --dry-run + +# Force specific hardware type +./bootstrap.sh --hardware-type proxmox --target-ip 10.0.0.201 +``` + +**Health Check on Existing Nodes:** +```bash +# Run comprehensive validation +./validate-node.sh + +# JSON output for monitoring +./validate-node.sh --json +``` + +--- + +## ๐Ÿ“ Directory Structure + +``` +scripts/ +โ”œโ”€โ”€ bootstrap.sh # โญ Unified day0 onboarding (NEW) +โ”œโ”€โ”€ validate-node.sh # โญ Standalone health check tool (NEW) +โ”œโ”€โ”€ lib/ # โญ Modular libraries (NEW) +โ”‚ โ”œโ”€โ”€ detection.sh # OS/hardware auto-detection +โ”‚ โ”œโ”€โ”€ network.sh # Network configuration & validation +โ”‚ โ”œโ”€โ”€ validation.sh # Comprehensive health checks +โ”‚ โ”œโ”€โ”€ fingerprint.sh # Hardware inventory collection +โ”‚ โ””โ”€โ”€ proxmox.sh # Proxmox VE post-install (comprehensive) +โ”œโ”€โ”€ day0bootstrap.sh # ๐Ÿ”ป DEPRECATED - use bootstrap.sh +โ”œโ”€โ”€ pi_init.sh # ๐Ÿ”ป DEPRECATED - use bootstrap.sh +โ”œโ”€โ”€ onboarding.sh # ๐Ÿ”ป DEPRECATED - use bootstrap.sh +โ””โ”€โ”€ README.md # This file +``` + +--- + +## ๐Ÿ› ๏ธ Core Tools + +### `bootstrap.sh` โ€” Unified Day0 Onboarding + +**Purpose:** Intelligent initialization for all homelab hardware types. Auto-detects OS, hardware, and applies appropriate configuration. + +**Replaces:** `day0bootstrap.sh`, `pi_init.sh`, `onboarding.sh` + +**Features:** +- โœ… Auto-detection (Proxmox, Docker VM, Raspberry Pi, physical servers, AI workstations) +- โœ… Static IP configuration via netplan +- โœ… Docker & Ansible installation (with Debian Trixie compatibility) +- โœ… SSH key generation (ED25519 preferred, RSA fallback) +- โœ… Comprehensive validation suite +- โœ… Hardware fingerprinting with YAML/JSON output +- โœ… Ansible inventory auto-discovery + +**Usage:** +```bash +./bootstrap.sh [OPTIONS] + +Options: + --help Show detailed help + --dry-run Preview actions without changes + --hardware-type TYPE Override auto-detection + (proxmox|docker-vm|pi|physical-docker|ai-workstation) + --skip-network Skip network configuration + --skip-validation Skip post-bootstrap validation + --output-json Generate JSON output instead of YAML + --target-ip IP Set static IP address (default: auto-assigned) + --gateway IP Set gateway IP (default: 10.0.0.1) + --dns IP Set DNS server (default: 10.0.0.2) +``` + +**Examples:** +```bash +# Auto-detect and configure with defaults +./bootstrap.sh + +# Preview what would be done (safe to run) +./bootstrap.sh --dry-run + +# Force Proxmox mode with custom IP +./bootstrap.sh --hardware-type proxmox --target-ip 10.0.0.201 + +# Skip network config (already configured manually) +./bootstrap.sh --skip-network + +# Generate JSON for monitoring integration +./bootstrap.sh --output-json +``` + +**Output Files:** +- `../ansible/archive/outputs/hardware-facts-{hostname}-{timestamp}.yml` +- `../ansible/archive/outputs/bootstrap-validation-{hostname}-{timestamp}.log` +- `../ansible/archive/inventory/discovered-hosts.yml` (auto-discovery) + +**Workflow:** +1. **Detection** โ†’ Identify OS, hardware type, CPU, GPU +2. **Network Config** โ†’ Apply static IP via netplan (SSH will disconnect) +3. **Package Install** โ†’ Docker, Ansible, proxmoxer (as needed) +4. **SSH Keys** โ†’ Generate/verify ED25519 keys +5. **Validation** โ†’ Comprehensive health checks (disk, memory, network, etc.) +6. **Fingerprinting** โ†’ Generate hardware inventory YAML + +**Network Warning:** Network configuration will disconnect SSH. Reconnect to new IP after ~10 seconds. + +--- + +### `validate-node.sh` โ€” Standalone Health Check Tool + +**Purpose:** Comprehensive validation suite for operational readiness. Can be run post-bootstrap or ad-hoc on any managed node. + +**Features:** +- โœ… Disk space & performance checks +- โœ… Memory & swap validation +- โœ… Network routing & connectivity tests +- โœ… NFS client configuration +- โœ… Docker daemon health +- โœ… Proxmox API accessibility (if applicable) +- โœ… SSH security audit +- โœ… Time synchronization verification +- โœ… JSON output for monitoring integration + +**Usage:** +```bash +./validate-node.sh [OPTIONS] + +Options: + --help Show detailed help + --json Output results in JSON format + --critical-only Show only critical errors + --verbose Show detailed output for each check +``` + +**Examples:** +```bash +# Run all checks with standard output +./validate-node.sh + +# Only show critical errors (useful in scripts) +./validate-node.sh --critical-only + +# JSON output for monitoring/alerting +./validate-node.sh --json | jq '.validation.errors' + +# Verbose mode with detection summary +./validate-node.sh --verbose +``` + +**Exit Codes:** +- `0` โ†’ All checks passed (or warnings only) +- `1` โ†’ Critical errors found +- `2` โ†’ Invalid usage + +**Integration Example:** +```bash +# Use in deployment pipeline +if ./validate-node.sh --critical-only; then + echo "Node healthy, proceeding with deployment" +else + echo "Critical issues found, aborting" + exit 1 +fi +``` + +--- + +## ๐Ÿ“š Library Components + +### `lib/detection.sh` โ€” System Detection + +**Functions:** +- `detect_os_family()` โ†’ debian, ubuntu, raspbian +- `detect_os_version()` โ†’ trixie, bookworm, noble, etc. +- `detect_hardware_type()` โ†’ proxmox, docker-vm, pi, physical-docker, ai-workstation +- `detect_cpu_vendor()` โ†’ intel, amd, arm +- `detect_cpu_generation()` โ†’ Intel generation number (12, 13, 14, etc.) +- `detect_gpu()` โ†’ nvidia, amd, intel, none +- `detect_primary_interface()` โ†’ Network interface name +- `get_current_ip()` โ†’ Current IPv4 address + +**Usage:** +```bash +source lib/detection.sh +HW_TYPE=$(detect_hardware_type) +echo "Hardware type: $HW_TYPE" +``` + +--- + +### `lib/network.sh` โ€” Network Configuration + +**Functions:** +- `apply_static_ip()` โ†’ Configure netplan with static IP +- `configure_network_safe()` โ†’ Apply network changes with reconnection instructions +- `validate_connectivity()` โ†’ Test gateway, internet, DNS +- `check_nfs_accessibility()` โ†’ Test NFS server reachability +- `get_desired_vlan_ip()` โ†’ Determine VLAN IP based on hardware type (future) + +**Usage:** +```bash +source lib/network.sh +configure_network_safe "10.0.0.151" "10.0.0.1" "10.0.0.2" +``` + +--- + +### `lib/validation.sh` โ€” Health Checks + +**Functions:** +- `check_disk_space()` โ†’ Validate sufficient disk space +- `check_memory()` โ†’ Validate RAM meets hardware type requirements +- `check_network_routes()` โ†’ Validate routing configuration +- `check_docker_daemon()` โ†’ Validate Docker installation and health +- `check_proxmox_api()` โ†’ Validate Proxmox VE (if applicable) +- `run_validation_suite()` โ†’ Run all checks and return summary + +**Severity Levels:** +- **Critical** โ†’ Blocks Ansible playbook execution +- **Warning** โ†’ Manual review recommended +- **Info** โ†’ Logged for reference + +--- + +### `lib/fingerprint.sh` โ€” Hardware Inventory + +**Functions:** +- `collect_cpu_facts()` โ†’ CPU model, cores, frequency +- `collect_memory_facts()` โ†’ RAM and swap details +- `collect_disk_facts()` โ†’ Storage capacity and type (SSD/HDD) +- `collect_network_facts()` โ†’ Network interfaces and connectivity +- `generate_hardware_facts_yaml()` โ†’ Full YAML output +- `generate_ansible_inventory_snippet()` โ†’ Ansible inventory entry + +**Output Format (YAML):** +```yaml +hostname: + os: + family: " "bookworm" + cpu: + model: "Intel Core i7-12700" + cores: 12 + memory: + total_gb: 32 + storage: + root: + total_gb: 500 + type: "ssd" +``` + +--- + +### `lib/proxmox.sh` โ€” Proxmox VE Post-Install + +**Comprehensive Proxmox configuration (inspired by community-scripts/ProxmoxVE):** + +**Repository Management:** +- `configure_all_pve_repos()` โ†’ Complete repository configuration +- `disable_pve_enterprise_repo()` โ†’ Disable subscription-required repository +- `enable_pve_no_subscription_repo()` โ†’ Enable free updates repository +- `configure_ceph_repos()` โ†’ Configure Ceph storage repositories +- `add_pvetest_repo()` โ†’ Add beta/test repository (disabled by default) + +**Subscription Nag Removal:** +- `disable_subscription_nag()` โ†’ Remove subscription warnings from UI + - Patches web UI JavaScript (proxmoxlib.js) + - Patches mobile UI templates + - Creates apt hook to persist after updates + +**High Availability Management:** +- `disable_ha_services()` โ†’ Disable HA for single-node setups (saves resources) +- `enable_ha_services()` โ†’ Enable HA for clustered environments +- `check_ha_status()` โ†’ Check current HA service status + +**System Maintenance:** +- `update_pve_system()` โ†’ Run full system update (apt dist-upgrade) +- `check_reboot_required()` โ†’ Check if reboot needed after updates + +**Comprehensive Routine:** +- `run_proxmox_post_install()` โ†’ **All-in-one post-install configuration** + - Repository fixes (PVE 8.x and 9.x compatible) + - Subscription nag removal + - Auto-detect single vs. cluster and adjust HA accordingly + - System update checks + - Reboot requirement detection + +**Validation:** +- `validate_proxmox_config()` โ†’ Verify Proxmox configuration +- `get_pve_version()` โ†’ Get Proxmox VE version +- `is_pve_supported_version()` โ†’ Check if version is supported (8.0-8.9.x, 9.0-9.1.x) + +**Usage:** +```bash +source lib/proxmox.sh + +# Run complete post-install (auto mode) +run_proxmox_post_install "auto" + +# Individual functions +disable_subscription_nag +disable_ha_services # For single-node setups +validate_proxmox_config +``` + +**What It Does (Compared to Community Script):** +- โœ… Repository management (enterprise/no-subscription/Ceph/pvetest) +- โœ… Subscription nag removal (web + mobile UI) +- โœ… HA service management (auto-detect single vs. cluster) +- โœ… Supports PVE 8.x (.list format) and 9.x (.sources deb822 format) +- โœ… System update and reboot detection +- โœ… Automated (non-interactive) for use in bootstrap.sh + +**Integration:** +When `bootstrap.sh` detects Proxmox hardware, it automatically runs `run_proxmox_post_install "auto"` after package installation, providing the same comprehensive post-install configuration as the community tteck/ProxmoxVE script. type: "ssd" +``` + +--- + +## ๐Ÿ”ป Deprecated Scripts + +**These scripts are deprecated and will be removed in a future release. Use `bootstrap.sh` instead.** + +### `day0bootstrap.sh` โ†’ `bootstrap.sh --hardware-type pi` +**Original Purpose:** Debian Trixie bootstrap for Raspberry Pi +**Status:** Wrapper redirects to bootstrap.sh with deprecation warning + +### `pi_init.sh` โ†’ `bootstrap.sh --hardware-type pi` +**Original Purpose:** Raspberry Pi initialization +**Status:** Wrapper redirects to bootstrap.sh with deprecation warning + +### `onboarding.sh` โ†’ `bootstrap.sh` +**Original Purpose:** Generic onboarding + Proxmox integration +**Status:** Wrapper redirects to bootstrap.sh with deprecation warning + +**Migration Path:** +All legacy scripts now display a deprecation warning and automatically redirect to `bootstrap.sh` with appropriate flags. You have 6 months to update your workflows before these wrappers are removed. + +--- + +## ๐ŸŽฏ Common Workflows + +### New Raspberry Pi Control Node +```bash +# 1. Transfer scripts to Pi +scp -r scripts/ chester@10.0.0.200:~/ + +# 2. SSH to Pi +ssh chester@10.0.0.200 + +# 3. Run bootstrap +cd scripts +./bootstrap.sh # Auto-detects Pi hardware + +# 4. SSH will disconnect - reconnect to new IP +ssh chester@10.0.0.200 +``` + +### New Proxmox Host +```bash +# On Proxmox console (pre-SSH setup) +curl -O https://git.castaldifamily.com/nathan/homelab/raw/branch/main/scripts/bootstrap.sh +chmod +x bootstrap.sh +./bootstrap.sh --hardware-type proxmox --target-ip 10.0.0.201 +``` + +### New Docker Swarm VM +```bash +# From control node (Watchtower) +scp -r scripts/ chester@10.0.0.211:~/ +ssh chester@10.0.0.211 +cd scripts +./bootstrap.sh --hardware-type docker-vm --target-ip 10.0.0.211 +``` + +### Health Check All Nodes (from Control Node) +```bash +# Create validation script +cat > validate-all.sh <<'EOF' +#!/bin/bash +for host in heimdall waldorf watchtower; do + echo "=== $host ===" + ssh $host "~/scripts/validate-node.sh --critical-only" +done +EOF + +chmod +x validate-all.sh +./validate-all.sh +``` + +--- + +## ๐Ÿงช Testing & Development + +### Test Bootstrap in Dry-Run Mode +```bash +# Safe to run on production nodes +./bootstrap.sh --dry-run --hardware-type docker-vm +``` + +### Manually Test Libraries +```bash +# Source library and test functions +source lib/detection.sh +print_detection_summary + +source lib/validation.sh +run_validation_suite +``` + +### Validate Script Syntax +```bash +# Check for syntax errors +bash -n bootstrap.sh +shellcheck bootstrap.sh lib/*.sh +``` + +--- + +## ๐Ÿ“– Documentation + +**Primary References:** +- [SOP-002: Initial Infrastructure Deployment](../documentation/SOPs/SOP-002-Initial-Infrastructure-Deployment.md) +- [Technical Runbook](../documentation/TECHNICAL_RUNBOOK.md) +- [Environment Constraints](../ansible/archive/documentation/standards/environment-constraints.md) + +**Related Ansible Playbooks:** +- [generic_host.yml](../ansible/archive/playbooks/onboarding/generic_host.yml) โ€” Second-stage onboarding +- [proxmox_host.yml](../ansible/archive/playbooks/onboarding/proxmox_host.yml) โ€” Proxmox-specific config +- [gather_hardware_facts.yml](../ansible/archive/playbooks/preflight/gather_hardware_facts.yml) โ€” Ansible fact collection + +--- + +## ๐Ÿ› Troubleshooting + +**Bootstrap fails with "could not detect primary interface":** +- Check network cable/Wi-Fi connection +- Run manually: `ip -o link show` to see interfaces +- Override: `INTERFACE=eth0 ./bootstrap.sh` + +**SSH disconnects and doesn't reconnect:** +- Wait 30 seconds and retry +- Check console access if available +- Verify IP in router DHCP leases + +**Validation reports critical errors:** +- Review log: `cat ../ansible/archive/outputs/bootstrap-validation-*.log` +- Address critical issues (disk space, memory, etc.) +- Re-run validation: `./validate-node.sh` + +**Docker installation fails on Debian Trixie:** +- Script automatically uses Bookworm repos as fallback +- Check logs: `journalctl -u docker` + +**Hardware fingerprinting shows "unknown" values:** +- Some VMs may not expose full hardware info +- Run `sudo dmidecode --type system` for manual inspection + +--- + +## ๐Ÿ”ฎ Future Enhancements + +**Planned (Not Yet Implemented):** +- [ ] PXE boot + cloud-init provisioning (eliminate manual OS install) +- [ ] Active VLAN migration logic (10.0.0.x โ†’ 10.0.10.x/10.0.200.x) +- [ ] Proxmox VM auto-provisioning via API +- [ ] Node retirement/decommissioning automation +- [ ] Integration with Prometheus node exporter metrics +- [ ] Containerized control node for disaster recovery + +**VLAN Placeholder:** +The `get_desired_vlan_ip()` function contains placeholders for future VLAN segmentation. Current implementation uses flat 10.0.0.x network. See comments in `lib/network.sh` for migration plan. + +--- + +## ๐Ÿ“ Version History + +- **v4.0.0** (2026-04-12): Initial unified bootstrap release + - Consolidated 3 legacy scripts into single entry point + - Added modular library architecture + - Comprehensive validation and fingerprinting + - Auto-discovery with Ansible inventory generation + +--- + +**Questions?** See [documentation/TECHNICAL_RUNBOOK.md](../documentation/TECHNICAL_RUNBOOK.md) or review session memory in `/memories/session/plan.md`. # scripts Automation utilities and helper scripts for homelab infrastructure management. diff --git a/scripts/bootstrap.sh b/scripts/bootstrap.sh new file mode 100644 index 0000000..0805670 --- /dev/null +++ b/scripts/bootstrap.sh @@ -0,0 +1,488 @@ +#!/bin/bash + +# ============================================================================== +# UNIFIED HOMELAB BOOTSTRAP SCRIPT +# ============================================================================== +# Intelligent day0 onboarding for all homelab hardware types +# Replaces: day0bootstrap.sh, pi_init.sh, onboarding.sh +# +# Usage: +# ./bootstrap.sh [OPTIONS] +# +# Options: +# --help Show this help message +# --dry-run Show what would be done without making changes +# --hardware-type TYPE Override auto-detection (proxmox|docker-vm|pi|physical-docker|ai-workstation) +# --skip-network Skip network configuration +# --skip-validation Skip post-bootstrap validation +# --output-json Generate JSON output instead of YAML +# --target-ip IP Target static IP address (default: auto-assigned) +# +# Examples: +# ./bootstrap.sh # Auto-detect and configure +# ./bootstrap.sh --dry-run # Preview actions +# ./bootstrap.sh --hardware-type proxmox # Force Proxmox mode +# ./bootstrap.sh --target-ip 10.0.0.205 # Custom IP address +# +# ============================================================================== + +set -euo pipefail + +# --- SCRIPT METADATA --- + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +SCRIPT_NAME="$(basename "${BASH_SOURCE[0]}")" +VERSION="4.0.0" +TIMESTAMP=$(date +%Y%m%dT%H%M%S) + +# --- LOAD LIBRARIES --- + +# shellcheck source=./lib/detection.sh +source "${SCRIPT_DIR}/lib/detection.sh" +# shellcheck source=./lib/network.sh +source "${SCRIPT_DIR}/lib/network.sh" +# shellcheck source=./lib/validation.sh +source "${SCRIPT_DIR}/lib/validation.sh" +# shellcheck source=./lib/fingerprint.sh +source "${SCRIPT_DIR}/lib/fingerprint.sh" +# shellcheck source=./lib/proxmox.sh +source "${SCRIPT_DIR}/lib/proxmox.sh" + +# --- COMMAND LINE ARGUMENTS --- + +SHOW_HELP=false +DRY_RUN=false +HARDWARE_TYPE="auto" +SKIP_NETWORK=false +SKIP_VALIDATION=false +OUTPUT_JSON=false +TARGET_IP="" +GATEWAY="10.0.0.1" +DNS_SERVER="10.0.0.2" + +parse_arguments() { + while [[ $# -gt 0 ]]; do + case "$1" in + --help|-h) + SHOW_HELP=true + shift + ;; + --dry-run) + DRY_RUN=true + shift + ;; + --hardware-type) + HARDWARE_TYPE="$2" + shift 2 + ;; + --skip-network) + SKIP_NETWORK=true + shift + ;; + --skip-validation) + SKIP_VALIDATION=true + shift + ;; + --output-json) + OUTPUT_JSON=true + shift + ;; + --target-ip) + TARGET_IP="$2" + shift 2 + ;; + --gateway) + GATEWAY="$2" + shift 2 + ;; + --dns) + DNS_SERVER="$2" + shift 2 + ;; + *) + echo "ERROR: Unknown option: $1" >&2 + echo "Use --help for usage information" >&2 + exit 1 + ;; + esac + done +} + +show_help() { + cat <<'EOF' +============================================================================= +UNIFIED HOMELAB BOOTSTRAP SCRIPT v4.0.0 +============================================================================= + +Intelligent day0 onboarding for all homelab hardware types. +Auto-detects OS, hardware type, and applies appropriate configuration. + +USAGE: + ./bootstrap.sh [OPTIONS] + +OPTIONS: + --help Show this help message + --dry-run Preview actions without making changes + --hardware-type TYPE Override auto-detection + Types: proxmox, docker-vm, pi, physical-docker, ai-workstation + --skip-network Skip network configuration (use if already configured) + --skip-validation Skip post-bootstrap validation checks + --output-json Generate JSON output instead of YAML + --target-ip IP Set static IP address (default: auto-assigned) + --gateway IP Set gateway IP (default: 10.0.0.1) + --dns IP Set DNS server (default: 10.0.0.2) + +EXAMPLES: + ./bootstrap.sh + Auto-detect hardware and configure with defaults + + ./bootstrap.sh --dry-run + Preview what would be done without making changes + + ./bootstrap.sh --hardware-type proxmox --target-ip 10.0.0.201 + Force Proxmox mode and set specific IP + + ./bootstrap.sh --skip-network + Skip network configuration (already configured manually) + +WORKFLOW: + 1. System Detection โ†’ Identify OS, hardware type, CPU, GPU + 2. Network Config โ†’ Apply static IP via netplan (skippable) + 3. Package Install โ†’ Docker, Ansible, proxmoxer (as needed) + 4. SSH Keys โ†’ Generate/verify ED25519 keys + 5. Validation โ†’ Comprehensive health checks + 6. Fingerprinting โ†’ Generate hardware inventory YAML + +OUTPUT FILES: + ansible/archive/outputs/bootstrap-validation-{hostname}-{timestamp}.log + ansible/archive/outputs/hardware-facts-{hostname}-{timestamp}.yml + ansible/archive/inventory/discovered-hosts.yml (auto-discovery) + +NOTES: + - Network configuration will disconnect SSH (reconnect to new IP) + - Run from console or plan for reconnection + - Safe to re-run (idempotent where possible) + - Logs saved even if script interrupted + +============================================================================= +EOF +} + +# --- MAIN WORKFLOW --- + +main() { + # Parse CLI arguments + parse_arguments "$@" + + if [ "$SHOW_HELP" == "true" ]; then + show_help + exit 0 + fi + + # === PHASE 1: DETECTION === + + echo "=======================================" + echo "HOMELAB BOOTSTRAP v${VERSION}" + echo "Timestamp: $(date -u +"%Y-%m-%d %H:%M:%S UTC")" + echo "=======================================" + echo "" + + echo "[PHASE 1] System Detection" + echo "---" + + # Detect hardware type + if [ "$HARDWARE_TYPE" == "auto" ]; then + HARDWARE_TYPE=$(detect_hardware_type) + echo " Auto-detected hardware type: $HARDWARE_TYPE" + else + echo " Hardware type (forced): $HARDWARE_TYPE" + fi + + # Print detection summary + print_detection_summary + echo "" + + # Determine target IP if not specified + if [ -z "$TARGET_IP" ]; then + TARGET_IP=$(get_desired_vlan_ip) + echo " Auto-assigned target IP: $TARGET_IP" + echo " (Based on hardware type and environment-constraints.md)" + else + echo " Target IP (user-specified): $TARGET_IP" + fi + + echo "" + + # Dry-run check + if [ "$DRY_RUN" == "true" ]; then + echo "[DRY-RUN MODE] Would perform the following actions:" + echo "" + echo " 1. Configure network: $TARGET_IP (gateway: $GATEWAY, DNS: $DNS_SERVER)" + echo " 2. Install Docker (with Debian Trixie workaround if needed)" + echo " 3. Install Ansible" + [ "$HARDWARE_TYPE" == "proxmox" ] && echo " 4. Apply Proxmox repository fixes" + [ "$HARDWARE_TYPE" == "proxmox" ] && echo " 5. Install proxmoxer Python library" + echo " 6. Generate/verify SSH keys (ED25519)" + echo " 7. Run validation suite" + echo " 8. Generate hardware fingerprint" + echo "" + echo "No changes made. Re-run without --dry-run to execute." + exit 0 + fi + + # === PHASE 2: NETWORK CONFIGURATION === + + if [ "$SKIP_NETWORK" == "false" ]; then + echo "[PHASE 2] Network Configuration" + echo "---" + configure_network_safe "$TARGET_IP" "$GATEWAY" "$DNS_SERVER" + + # Wait for network to stabilize + sleep 3 + wait_for_network 15 + echo "" + else + echo "[PHASE 2] Network Configuration (SKIPPED)" + echo "" + fi + + # === PHASE 3: PACKAGE INSTALLATION === + + echo "[PHASE 3] Package Installation" + echo "---" + + # Update package lists + echo " [โš™] Updating package lists..." + sudo apt-get update -qq + + # Install prerequisites + echo " [โš™] Installing prerequisites (ca-certificates, curl, gnupg)..." + sudo apt-get install -y -qq ca-certificates curl gnupg lsb-release + + # --- DOCKER INSTALLATION --- + + if ! command -v docker &>/dev/null; then + echo " [โš™] Installing Docker..." + + # Remove existing Docker repo configs + sudo rm -f /etc/apt/sources.list.d/docker.list + sudo rm -f /etc/apt/sources.list.d/docker*.list + + # Add Docker GPG key + sudo mkdir -p /etc/apt/keyrings + curl -fsSL https://download.docker.com/linux/debian/gpg | \ + sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg --yes + + # Determine repository codename + local os_version=$(detect_os_version) + local repo_codename="$os_version" + + # Debian Trixie workaround (use Bookworm repos) + if is_debian_trixie; then + echo " [!] Debian Trixie detected - using Bookworm repos for Docker" + repo_codename="bookworm" + fi + + # Add Docker repository + echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/debian $repo_codename stable" | \ + sudo tee /etc/apt/sources.list.d/docker.list > /dev/null + + # Install Docker + sudo apt-get update -qq + sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-compose-plugin + + # Add current user to docker group + sudo usermod -aG docker "$USER" + + echo " [โœ“] Docker installed: $(docker --version)" + else + echo " [โœ“] Docker already installed: $(docker --version)" + fi + + # --- ANSIBLE INSTALLATION --- + + if ! command -v ansible &>/dev/null; then + echo " [โš™] Installing Ansible..." + sudo apt-get install -y ansible + echo " [โœ“] Ansible installed: $(ansible --version | head -n1)" + else + echo " [โœ“] Ansible already installed: $(ansible --version | head -n1)" + fi + + # --- PROXMOX-SPECIFIC PACKAGES --- + + if [ "$HARDWARE_TYPE" == "proxmox" ]; then + echo " [โš™] Installing Proxmox-specific packages..." + + # Install Python pip if needed + if ! command -v pip3 &>/dev/null; then + sudo apt-get install -y python3-pip + fi + + # Install proxmoxer + if ! python3 -c "import proxmoxer" 2>/dev/null; then + echo " [โš™] Installing proxmoxer Python library..." + pip3 install proxmoxer --break-system-packages 2>/dev/null || pip3 install proxmoxer + echo " [โœ“] proxmoxer installed" + else + echo " [โœ“] proxmoxer already installed" + fiInstall jq for Proxmox post-install tasks + if ! command -v jq &>/dev/null; then + sudo apt-get install -y jq + fi + + echo "" + echo "=== PROXMOX POST-INSTALL CONFIGURATION ===" + + # Run comprehensive Proxmox post-install routine + # This includes: repository fixes, subscription nag removal, HA management + run_proxmox_post_install "auto" + + echo "==========================================" + echo "" echo "deb http://download.proxmox.com/debian/pve bookworm pve-no-subscription" | \ + sudo tee /etc/apt/sources.list.d/pve-no-subscription.list > /dev/null + fi + fi + + # --- UTILITY PACKAGES --- + + echo " [โš™] Installing utility packages..." + sudo apt-get install -y -qq git vim htop curl wget nfs-common net-tools dnsutils + + echo " [โœ“] Package installation complete" + echo "" + + # === PHASE 4: SSH KEY MANAGEMENT === + + echo "[PHASE 4] SSH Key Management" + echo "---" + + local ssh_key_path="" + + # Check for existing keys (prefer ED25519) + if [ -f "$HOME/.ssh/id_ed25519" ]; then + ssh_key_path="$HOME/.ssh/id_ed25519" + echo " [โœ“] Found existing ED25519 key: $ssh_key_path" + elif [ -f "$HOME/.ssh/id_rsa" ]; then + ssh_key_path="$HOME/.ssh/id_rsa" + echo " [โœ“] Found existing RSA key: $ssh_key_path" + else + # Generate new ED25519 key + ssh_key_path="$HOME/.ssh/id_ed25519" + echo " [โš™] Generating new ED25519 key pair..." + ssh-keygen -t ed25519 -f "$ssh_key_path" -N "" -C "$(whoami)@$(hostname)-$(date +%Y%m%d)" + echo " [โœ“] SSH key generated: $ssh_key_path" + fi + + # Display public key + echo "" + echo " Public key (add to authorized_keys on managed nodes):" + echo " ---" + cat "${ssh_key_path}.pub" + echo " ---" + echo "" + + # === PHASE 5: VALIDATION === + + if [ "$SKIP_VALIDATION" == "false" ]; then + echo "[PHASE 5] System Validation" + echo "---" + + # Run comprehensive validation + if run_validation_suite; then + echo "" + echo " [โœ“] All validation checks passed" + else + echo "" + echo " [!] Validation completed with errors - review above" + echo " [!] You may proceed, but manual intervention may be required" + fi + echo "" + + # Save validation log + local log_dir="${SCRIPT_DIR}/../ansible/archive/outputs" + mkdir -p "$log_dir" + local log_file="${log_dir}/bootstrap-validation-$(hostname)-${TIMESTAMP}.log" + + # Re-run validation to capture log + run_validation_suite 2>&1 | tee "$log_file" >/dev/null + echo " [โœ“] Validation log saved: $log_file" + echo "" + else + echo "[PHASE 5] System Validation (SKIPPED)" + echo "" + fi + + # === PHASE 6: HARDWARE FINGERPRINTING === + + echo "[PHASE 6] Hardware Fingerprinting" + echo "---" + + # Print summary + print_hardware_summary + echo "" + + # Validate against standards + echo " Checking against environment-constraints.md standards..." + validate_against_standards || true + echo "" + + # Save hardware facts + local facts_dir="${SCRIPT_DIR}/../ansible/archive/outputs" + local facts_file=$(save_hardware_facts "$facts_dir") + echo " [โœ“] Hardware facts saved: $facts_file" + + # Save JSON if requested + if [ "$OUTPUT_JSON" == "true" ]; then + local json_file="${facts_dir}/hardware-facts-$(hostname)-${TIMESTAMP}.json" + generate_json_output > "$json_file" + echo " [โœ“] JSON output saved: $json_file" + fi + + # Generate inventory snippet + local inventory_dir="${SCRIPT_DIR}/../ansible/archive/inventory" + mkdir -p "$inventory_dir" + local inventory_file=$(append_to_discovered_inventory "${inventory_dir}/discovered-hosts.yml") + echo " [โœ“] Inventory snippet appended: $inventory_file" + echo "" + + # === COMPLETION === + + echo "=======================================" + echo "BOOTSTRAP COMPLETE" + echo "=======================================" + echo "" + echo "Summary:" + echo " Hostname: $(hostname)" + echo " IP Address: $(get_current_ip)" + echo " Hardware Type: $HARDWARE_TYPE" + echo " OS: $(detect_os_family) $(detect_os_version)" + echo "" + echo "Next Steps:" + echo " 1. Reconnect SSH if network was reconfigured: ssh $(whoami)@$(get_current_ip)" + echo " 2. Verify Docker: docker ps" + echo " 3. Verify Ansible: ansible --version" + echo " 4. Review hardware facts: cat $facts_file" + echo " 5. Run Ansible playbook: ansible-playbook playbooks/onboarding/generic_host.yml" + echo "" + echo "Files Generated:" + echo " - $facts_file" + [ "$OUTPUT_JSON" == "true" ] && echo " - ${facts_dir}/hardware-facts-$(hostname)-${TIMESTAMP}.json" + [ "$SKIP_VALIDATION" == "false" ] && echo " - ${log_dir}/bootstrap-validation-$(hostname)-${TIMESTAMP}.log" + echo " - $inventory_file" + echo "" + echo "Documentation:" + echo " - SOP: documentation/SOPs/SOP-002-Initial-Infrastructure-Deployment.md" + echo " - Technical Runbook: documentation/TECHNICAL_RUNBOOK.md" + echo "" + echo "Have a great day! ๐Ÿš€" + echo "" +} + +# --- ENTRY POINT --- + +# Trap errors +trap 'echo "ERROR: Bootstrap failed at line $LINENO. Check logs for details." >&2' ERR + +# Run main workflow +main "$@" diff --git a/scripts/day0bootstrap.sh b/scripts/day0bootstrap.sh index a47e4ab..285586e 100644 --- a/scripts/day0bootstrap.sh +++ b/scripts/day0bootstrap.sh @@ -1,11 +1,42 @@ #!/bin/bash # ============================================================================== -# DEBIAN TRIXIE BOOTSTRAP: IP, DOCKER, ANSIBLE +# DEPRECATED: day0bootstrap.sh +# ============================================================================== +# โš ๏ธ DEPRECATION NOTICE +# This script is deprecated and will be removed in a future release. +# Please use the unified bootstrap.sh script instead: +# +# ./bootstrap.sh --hardware-type pi +# +# This wrapper will redirect to bootstrap.sh with appropriate flags. # ============================================================================== set -euo pipefail +# Show deprecation warning +echo "=======================================" >&2 +echo "โš ๏ธ DEPRECATION WARNING" >&2 +echo "=======================================" >&2 +echo "day0bootstrap.sh is deprecated!" >&2 +echo "" >&2 +echo "Please use: ./bootstrap.sh" >&2 +echo "" >&2 +echo "Redirecting to bootstrap.sh in 5 seconds..." >&2 +echo "Press Ctrl+C to cancel" >&2 +echo "=======================================" >&2 +sleep 5 + +# Redirect to unified bootstrap +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +exec "${SCRIPT_DIR}/bootstrap.sh" --hardware-type pi "$@" + +# ============================================================================== +# LEGACY CODE BELOW (no longer executed) +# ============================================================================== + +exit 0 + # --- 1. SET STATIC IP (Netplan) --- echo "[โš™] Configuring Static IP to 10.0.0.200..." diff --git a/scripts/lib/detection.sh b/scripts/lib/detection.sh new file mode 100644 index 0000000..594a623 --- /dev/null +++ b/scripts/lib/detection.sh @@ -0,0 +1,362 @@ +#!/bin/bash + +# ============================================================================== +# DETECTION LIBRARY: OS, Hardware, and Environment Detection +# ============================================================================== +# Part of unified bootstrap system for homelab infrastructure +# Provides functions to identify OS family, hardware type, CPU, GPU, and +# deployment environment for intelligent provisioning decisions. +# ============================================================================== + +# --- OS DETECTION --- + +detect_os_family() { + # Detect OS family: debian, ubuntu, raspbian, or unknown + # Returns lowercase OS identifier + + if [ -f /etc/os-release ]; then + . /etc/os-release + case "${ID,,}" in + debian) + echo "debian" + return 0 + ;; + ubuntu) + echo "ubuntu" + return 0 + ;; + raspbian) + echo "raspbian" + return 0 + ;; + *) + echo "unknown" + return 1 + ;; + esac + else + echo "unknown" + return 1 + fi +} + +detect_os_version() { + # Get OS version codename (e.g., trixie, bookworm, noble) + + if [ -f /etc/os-release ]; then + . /etc/os-release + echo "${VERSION_CODENAME:-unknown}" + else + echo "unknown" + fi +} + +is_debian_trixie() { + # Special detection for Debian Trixie (requires Docker repo workaround) + + local os_family=$(detect_os_family) + local os_version=$(detect_os_version) + + [[ "$os_family" == "debian" && "$os_version" == "trixie" ]] +} + +# --- HARDWARE TYPE DETECTION --- + +detect_hardware_type() { + # Auto-detect hardware deployment type + # Returns: proxmox, docker-vm, pi, physical-docker, ai-workstation, unknown + + # Check for Proxmox VE installation + if command -v pveversion &>/dev/null; then + echo "proxmox" + return 0 + fi + + # Check if running inside a VM (common indicators) + local is_vm=0 + if systemd-detect-virt --vm &>/dev/null; then + is_vm=1 + elif [ -d /sys/class/dmi/id ]; then + local product_name=$(cat /sys/class/dmi/id/product_name 2>/dev/null || echo "") + if [[ "$product_name" =~ (VirtualBox|VMware|KVM|QEMU) ]]; then + is_vm=1 + fi + fi + + # Check for Raspberry Pi + if grep -q "Raspberry Pi" /proc/cpuinfo 2>/dev/null; then + echo "pi" + return 0 + fi + + # Check for Docker installation (distinguishes VM vs physical) + local has_docker=0 + if command -v docker &>/dev/null || [ -f /usr/bin/docker ]; then + has_docker=1 + fi + + # Check for high-end GPU (indicates AI workstation) + local gpu_type=$(detect_gpu) + if [[ "$gpu_type" =~ (nvidia|amd) ]]; then + local gpu_info=$(lspci 2>/dev/null | grep -i vga | head -n1) + # High-end NVIDIA cards (RTX, Tesla, Quadro) or AMD (Radeon Pro, Instinct) + if [[ "$gpu_info" =~ (RTX|Tesla|Quadro|A[0-9]{3,4}|Radeon Pro|Instinct) ]]; then + echo "ai-workstation" + return 0 + fi + fi + + # Classify based on VM + Docker + if [ $is_vm -eq 1 ]; then + echo "docker-vm" + return 0 + else + echo "physical-docker" + return 0 + fi +} + +# --- CPU DETECTION --- + +detect_cpu_vendor() { + # Detect CPU vendor: intel, amd, arm, or unknown + + if [ -f /proc/cpuinfo ]; then + if grep -qi "intel" /proc/cpuinfo; then + echo "intel" + return 0 + elif grep -qi "amd" /proc/cpuinfo; then + echo "amd" + return 0 + elif grep -qi "arm\|aarch64" /proc/cpuinfo; then + echo "arm" + return 0 + fi + fi + + echo "unknown" + return 1 +} + +detect_cpu_generation() { + # Detect Intel CPU generation for kernel parameter tuning + # Returns generation number (e.g., 12, 13, 14) or "unknown" + + local vendor=$(detect_cpu_vendor) + if [ "$vendor" != "intel" ]; then + echo "unknown" + return 1 + fi + + local model_name=$(grep "model name" /proc/cpuinfo | head -n1 | cut -d: -f2 | xargs) + + # Extract generation from model name patterns + # Example: "12th Gen Intel(R) Core(TM) i7-12700" + if [[ "$model_name" =~ ([0-9]{2})th\ Gen ]]; then + echo "${BASH_REMATCH[1]}" + return 0 + elif [[ "$model_name" =~ i[3579]-([0-9]{2})[0-9]{2,3} ]]; then + echo "${BASH_REMATCH[1]}" + return 0 + fi + + echo "unknown" + return 1 +} + +needs_intel_hybrid_core_tuning() { + # 12th Gen+ Intel CPUs with hybrid architecture need special kernel params + # Returns 0 (true) if tuning required, 1 (false) otherwise + + local gen=$(detect_cpu_generation) + + # 12th Gen (Alder Lake) and newer have P-cores + E-cores + if [[ "$gen" =~ ^[0-9]+$ ]] && [ "$gen" -ge 12 ]; then + return 0 + fi + + return 1 +} + +# --- GPU DETECTION --- + +detect_gpu() { + # Detect GPU vendor: nvidia, amd, intel, or none + + if ! command -v lspci &>/dev/null; then + echo "none" + return 1 + fi + + local gpu_info=$(lspci 2>/dev/null | grep -i vga) + + if echo "$gpu_info" | grep -qi nvidia; then + echo "nvidia" + return 0 + elif echo "$gpu_info" | grep -qi amd; then + echo "amd" + return 0 + elif echo "$gpu_info" | grep -qi intel; then + echo "intel" + return 0 + fi + + echo "none" + return 1 +} + +get_gpu_model() { + # Get detailed GPU model information + + if ! command -v lspci &>/dev/null; then + echo "unknown" + return 1 + fi + + local gpu_line=$(lspci 2>/dev/null | grep -i "vga\|3d\|display" | head -n1) + + if [ -n "$gpu_line" ]; then + # Extract model after the vendor info + echo "$gpu_line" | cut -d: -f3 | xargs + else + echo "none" + fi +} + +# --- NETWORK INTERFACE DETECTION --- + +detect_primary_interface() { + # Find the primary physical network interface (excludes lo, docker, veth) + + # Try to find interface with default route + local iface=$(ip route show default 2>/dev/null | awk '/^default/ {print $5; exit}') + + if [ -n "$iface" ]; then + echo "$iface" + return 0 + fi + + # Fallback: first non-loopback physical interface + iface=$(ip -o link show | awk -F': ' '$2 != "lo" && $2 !~ /^(docker|veth|br-)/ {print $2; exit}') + + if [ -n "$iface" ]; then + echo "$iface" + return 0 + fi + + echo "unknown" + return 1 +} + +get_current_ip() { + # Get current IPv4 address of primary interface + + local iface=$(detect_primary_interface) + + if [ "$iface" == "unknown" ]; then + echo "unknown" + return 1 + fi + + local ip=$(ip -4 addr show "$iface" 2>/dev/null | grep -oP '(?<=inet\s)\d+(\.\d+){3}' | head -n1) + + if [ -n "$ip" ]; then + echo "$ip" + else + echo "unknown" + return 1 + fi +} + +# --- ENVIRONMENT DETECTION --- + +is_running_in_container() { + # Check if running inside a container (Docker, LXC, etc.) + + if [ -f /.dockerenv ]; then + return 0 + fi + + if grep -q "lxc\|docker" /proc/1/cgroup 2>/dev/null; then + return 0 + fi + + return 1 +} + +detect_init_system() { + # Detect init system: systemd, sysvinit, or unknown + + if [ -d /run/systemd/system ]; then + echo "systemd" + return 0 + elif [ -f /sbin/init ]; then + local init_path=$(readlink -f /sbin/init) + if [[ "$init_path" =~ systemd ]]; then + echo "systemd" + return 0 + fi + fi + + echo "unknown" + return 1 +} + +# --- PACKAGE MANAGER DETECTION --- + +detect_package_manager() { + # Detect primary package manager: apt, dnf, yum, or unknown + + if command -v apt-get &>/dev/null; then + echo "apt" + return 0 + elif command -v dnf &>/dev/null; then + echo "dnf" + return 0 + elif command -v yum &>/dev/null; then + echo "yum" + return 0 + fi + + echo "unknown" + return 1 +} + +# --- SUMMARY FUNCTION --- + +print_detection_summary() { + # Print comprehensive detection summary to stderr for logging + + cat >&2 </dev/null; then + SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" + # shellcheck source=./detection.sh + source "${SCRIPT_DIR}/detection.sh" +fi + +# --- HARDWARE FACT COLLECTION --- + +collect_cpu_facts() { + # Collect CPU information in structured format + # Returns JSON-like structured data + + local cpu_model=$(grep "model name" /proc/cpuinfo | head -n1 | cut -d: -f2 | xargs) + local cpu_cores=$(nproc 2>/dev/null || echo "unknown") + local cpu_vendor=$(detect_cpu_vendor) + local cpu_gen=$(detect_cpu_generation) + local cpu_arch=$(uname -m) + + # Get CPU frequency (current) + local cpu_freq_mhz=$(grep "cpu MHz" /proc/cpuinfo | head -n1 | awk '{print $4}' | cut -d. -f1) + + # Get CPU max frequency if available + local cpu_max_freq="unknown" + if [ -f /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq ]; then + local freq_khz=$(cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq) + cpu_max_freq=$((freq_khz / 1000)) + fi + + cat </dev/null; then + local block_devices=$(lsblk -d -n -o NAME,SIZE,TYPE | grep disk | awk '{printf " - { name: \"%s\", size: \"%s\" }\n", $1, $2}') + if [ -n "$block_devices" ]; then + cat </dev/null | awk '/^default/ {print $3; exit}') + + cat </dev/null | grep "link/ether" | awk '{print $2}') + cat </dev/null || echo "unknown") + cat </dev/null; then + local nvidia_driver=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader 2>/dev/null | head -n1) + local nvidia_cuda=$(nvidia-smi --query-gpu=compute_cap --format=csv,noheader 2>/dev/null | head -n1) + + cat </dev/null || echo "$hostname") + + if [ -f /etc/os-release ]; then + . /etc/os-release + local os_name="$NAME" + local os_version_id="$VERSION_ID" + fi + + cat </dev/null && echo "true" || echo "false") + local virt_type=$(systemd-detect-virt 2>/dev/null || echo "none") + + cat </dev/null; then + local pve_version=$(pveversion | head -n1 | awk '{print $2}') + cat <&2 + + # Memory requirements + case "$hw_type" in + proxmox) + if [ "$mem_gb" -lt 8 ]; then + echo " - \"WARNING: Proxmox host has ${mem_gb}GB RAM (minimum 8GB recommended)\"" >&2 + ((violations++)) + fi + ;; + docker-vm|physical-docker) + if [ "$mem_gb" -lt 4 ]; then + echo " - \"WARNING: Docker host has ${mem_gb}GB RAM (minimum 4GB recommended)\"" >&2 + ((violations++)) + fi + ;; + esac + + # Disk space + if [ "$disk_gb" -lt 20 ]; then + echo " - \"WARNING: Low disk space (${disk_gb}GB) - minimum 20GB recommended\"" >&2 + ((violations++)) + fi + + # CPU cores + if [ "$cpu_cores" -lt 2 ]; then + echo " - \"WARNING: Single CPU core detected - minimum 2 recommended\"" >&2 + ((violations++)) + fi + + if [ $violations -eq 0 ]; then + echo " - \"OK: Hardware meets minimum standards\"" >&2 + fi + + return $violations +} + +# --- STRUCTURED OUTPUT --- + +generate_hardware_facts_yaml() { + # Generate complete hardware facts in YAML format + # Compatible with gather_hardware_facts.yml output + + local hostname=$(hostname) + + cat < "$output_file" + + echo "$output_file" +} + +# --- ANSIBLE INVENTORY GENERATION --- + +generate_ansible_inventory_snippet() { + # Generate Ansible inventory entry for this host + # Can be appended to discovered-hosts.yml + + local hostname=$(hostname) + local current_ip=$(get_current_ip) + local hw_type=$(detect_hardware_type) + local os_family=$(detect_os_family) + + # Determine inventory group + local inventory_group="docker_hosts" + case "$hw_type" in + proxmox) + inventory_group="proxmox_cluster" + ;; + docker-vm) + inventory_group="swarm_managers" # Or swarm_workers, requires manual classification + ;; + pi) + inventory_group="control_nodes" + ;; + ai-workstation) + inventory_group="ai_nodes" + ;; + esac + + cat < "$inventory_file" </dev/null; then + echo "Host $hostname already in discovered inventory, skipping" >&2 + return 0 + fi + + # Append + generate_ansible_inventory_snippet >> "$inventory_file" + echo "" >> "$inventory_file" + + echo "$inventory_file" +} + +# --- OUTPUT FORMAT OPTIONS --- + +generate_json_output() { + # Generate JSON format instead of YAML (for monitoring integration) + # Converts YAML facts to JSON + + local hostname=$(hostname) + local current_ip=$(get_current_ip) + local hw_type=$(detect_hardware_type) + local os_family=$(detect_os_family) + local cpu_cores=$(nproc) + local mem_gb=$(grep MemTotal /proc/meminfo | awk '{print int($2/1024/1024)}') + local disk_gb=$(df -BG / | awk 'NR==2 {print $2}' | sed 's/G//') + + cat <&2 </dev/null; then + SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" + # shellcheck source=./detection.sh + source "${SCRIPT_DIR}/detection.sh" +fi + +# --- NETWORK CONFIGURATION --- + +apply_static_ip() { + # Configure static IP via netplan (Ubuntu/Debian) + # Args: $1 = Target IP, $2 = Gateway (default: 10.0.0.1), $3 = DNS (default: 10.0.0.2) + + local target_ip="$1" + local gateway="${2:-10.0.0.1}" + local dns="${3:-10.0.0.2}" + + if [ -z "$target_ip" ]; then + echo "ERROR: Target IP address required" >&2 + return 1 + fi + + # Validate IP format + if ! [[ "$target_ip" =~ ^[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$ ]]; then + echo "ERROR: Invalid IP address format: $target_ip" >&2 + return 1 + fi + + local interface=$(detect_primary_interface) + if [ "$interface" == "unknown" ]; then + echo "ERROR: Could not detect primary network interface" >&2 + return 1 + fi + + echo "[โš™] Configuring static IP: $target_ip on $interface..." >&2 + + # Fix permissions on existing netplan files (common issue) + sudo chmod 600 /lib/netplan/*.yaml 2>/dev/null || true + sudo chmod 600 /etc/netplan/*.yaml 2>/dev/null || true + + # Create netplan directory if missing + sudo mkdir -p /etc/netplan + + # Generate netplan configuration + sudo tee /etc/netplan/01-netcfg.yaml >/dev/null <&2 + return 0 +} + +apply_network_changes() { + # Apply netplan configuration (WARNING: may cause SSH disconnect) + # Uses background apply to prevent SSH session hang + + echo "[โš™] Applying network configuration (SSH may disconnect)..." >&2 + + # Test configuration first + if ! sudo netplan generate 2>/dev/null; then + echo "ERROR: netplan configuration validation failed" >&2 + return 1 + fi + + # Apply in background to avoid blocking SSH + sudo netplan apply & + local apply_pid=$! + + echo "[โœ“] Network apply started (PID: $apply_pid)" >&2 + echo "[!] SSH connection will drop. Reconnect to new IP address." >&2 + + # Give it a moment to start + sleep 2 + + return 0 +} + +configure_network_safe() { + # Safe wrapper: configure IP + apply with reconnection instructions + # Args: $1 = Target IP, $2 = Gateway (optional), $3 = DNS (optional) + + local target_ip="$1" + local current_ip=$(get_current_ip) + + # Check if already configured + if [ "$current_ip" == "$target_ip" ]; then + echo "[โœ“] IP already configured as $target_ip, skipping" >&2 + return 0 + fi + + # Configure + if ! apply_static_ip "$@"; then + return 1 + fi + + # Apply + apply_network_changes + + echo "" >&2 + echo "=========================================" >&2 + echo "Network configuration applied" >&2 + echo "Old IP: $current_ip" >&2 + echo "New IP: $target_ip" >&2 + echo "Reconnect with: ssh user@$target_ip" >&2 + echo "=========================================" >&2 + + return 0 +} + +# --- VLAN CONFIGURATION (PLACEHOLDER) --- + +get_desired_vlan_ip() { + # Determine desired VLAN IP based on hardware type + # Returns IP address from environment-constraints.md topology + # TODO: Enable when VLAN segmentation is live + + local hardware_type=$(detect_hardware_type) + local hostname=$(hostname) + + # TODO: Implement VLAN placement logic based on: + # - Proxmox hosts โ†’ 10.0.10.x (infra VLAN) + # - Swarm VMs โ†’ 10.0.200.x (compute VLAN) + # - Control nodes โ†’ 10.0.0.x (main VLAN) + + # For now, return flat network assignment + case "$hardware_type" in + proxmox) + # Currently: 10.0.0.200-209 + # Desired: 10.0.10.11-13 (future VLAN) + echo "10.0.0.201" # Placeholder + ;; + docker-vm) + # Currently: 10.0.0.210-229 + # Desired: 10.0.200.11+ (future VLAN) + echo "10.0.0.211" # Placeholder + ;; + pi|physical-docker) + # Control nodes stay on main VLAN + echo "10.0.0.200" + ;; + ai-workstation) + # Currently: 10.0.0.230-239 + # Desired: 10.0.200.x (future VLAN) + echo "10.0.0.230" # Placeholder + ;; + *) + echo "10.0.0.200" # Safe default + ;; + esac +} + +check_vlan_support() { + # Check if network hardware supports VLAN tagging + # Returns 0 if supported, 1 otherwise + + local interface=$(detect_primary_interface) + + if [ "$interface" == "unknown" ]; then + return 1 + fi + + # Check for 802.1Q VLAN support in kernel modules + if lsmod | grep -q "^8021q"; then + return 0 + fi + + # Check if module can be loaded + if sudo modprobe 8021q 2>/dev/null; then + return 0 + fi + + return 1 +} + +# --- NETWORK VALIDATION --- + +validate_connectivity() { + # Test basic network connectivity + # Returns 0 if healthy, 1 otherwise + + local errors=0 + + echo "[โš™] Validating network connectivity..." >&2 + + # Test default gateway + local gateway=$(ip route show default 2>/dev/null | awk '/^default/ {print $3; exit}') + if [ -n "$gateway" ]; then + if ping -c 2 -W 3 "$gateway" &>/dev/null; then + echo " [โœ“] Gateway reachable: $gateway" >&2 + else + echo " [โœ—] Gateway unreachable: $gateway" >&2 + ((errors++)) + fi + else + echo " [โœ—] No default gateway configured" >&2 + ((errors++)) + fi + + # Test DNS resolution + if ping -c 2 -W 3 8.8.8.8 &>/dev/null; then + echo " [โœ“] Internet connectivity (8.8.8.8)" >&2 + else + echo " [โœ—] No internet connectivity" >&2 + ((errors++)) + fi + + # Test DNS resolution + if host google.com &>/dev/null; then + echo " [โœ“] DNS resolution working" >&2 + else + echo " [!] DNS resolution issue (warning)" >&2 + fi + + if [ $errors -eq 0 ]; then + echo "[โœ“] Network validation passed" >&2 + return 0 + else + echo "[โœ—] Network validation failed ($errors errors)" >&2 + return 1 + fi +} + +check_nfs_accessibility() { + # Test NFS server accessibility (TerraMaster NAS) + # Args: $1 = NFS server IP (default: 10.0.0.250) + + local nfs_server="${1:-10.0.0.250}" + + echo "[โš™] Checking NFS server accessibility ($nfs_server)..." >&2 + + # Test basic connectivity via ping + if ! ping -c 2 -W 3 "$nfs_server" &>/dev/null; then + echo " [โœ—] NFS server unreachable: $nfs_server" >&2 + return 1 + fi + + echo " [โœ“] NFS server reachable" >&2 + + # Test NFS ports (2049 = NFSv3/v4, 111 = portmapper) + if command -v nc &>/dev/null; then + if nc -z -w 3 "$nfs_server" 2049 2>/dev/null; then + echo " [โœ“] NFS service responding (port 2049)" >&2 + else + echo " [โœ—] NFS port 2049 closed" >&2 + return 1 + fi + fi + + return 0 +} + +test_internal_hairpin_nat() { + # Test for hairpin NAT issues (lessons-learned.md #3) + # Internal hosts should NOT route through public DNS + + local test_domain="castaldifamily.com" + + echo "[โš™] Testing for hairpin NAT issues..." >&2 + + # Get public IP of domain + local public_ip=$(dig +short "$test_domain" @8.8.8.8 2>/dev/null | grep -E '^[0-9.]+$' | head -n1) + + if [ -z "$public_ip" ]; then + echo " [!] Could not resolve $test_domain, skipping test" >&2 + return 0 + fi + + # Try to ping public IP from inside network (should fail on hairpin NAT routers) + if ping -c 2 -W 2 "$public_ip" &>/dev/null; then + echo " [โœ“] No hairpin NAT issue detected" >&2 + return 0 + else + echo " [!] Possible hairpin NAT - use internal IPs (10.0.0.x) for node-to-node" >&2 + return 0 # Warning, not error + fi +} + +# --- NETWORK RENDERER DETECTION --- + +detect_network_renderer() { + # Detect network configuration system: netplan, networkd, NetworkManager + + if [ -d /etc/netplan ] && command -v netplan &>/dev/null; then + echo "netplan" + return 0 + elif systemctl is-active systemd-networkd &>/dev/null; then + echo "networkd" + return 0 + elif systemctl is-active NetworkManager &>/dev/null; then + echo "NetworkManager" + return 0 + fi + + echo "unknown" + return 1 +} + +# --- WAIT FOR NETWORK --- + +wait_for_network() { + # Wait for network to stabilize after configuration change + # Args: $1 = timeout in seconds (default: 10) + + local timeout="${1:-10}" + local elapsed=0 + + echo "[โš™] Waiting for network to stabilize (timeout: ${timeout}s)..." >&2 + + while [ $elapsed -lt $timeout ]; do + if ping -c 1 -W 1 8.8.8.8 &>/dev/null; then + echo "[โœ“] Network ready after ${elapsed}s" >&2 + return 0 + fi + sleep 1 + ((elapsed++)) + done + + echo "[!] Network not ready after ${timeout}s, continuing anyway" >&2 + return 1 +} + +# Export functions +export -f apply_static_ip +export -f apply_network_changes +export -f configure_network_safe +export -f get_desired_vlan_ip +export -f check_vlan_support +export -f validate_connectivity +export -f check_nfs_accessibility +export -f test_internal_hairpin_nat +export -f detect_network_renderer +export -f wait_for_network diff --git a/scripts/lib/proxmox.sh b/scripts/lib/proxmox.sh new file mode 100644 index 0000000..732623d --- /dev/null +++ b/scripts/lib/proxmox.sh @@ -0,0 +1,453 @@ +#!/bin/bash + +# ============================================================================== +# PROXMOX LIBRARY: Post-Install Configuration and Management +# ============================================================================== +# Part of unified bootstrap system for homelab infrastructure +# Comprehensive Proxmox VE post-install routines inspired by community scripts +# (tteck/community-scripts ProxmoxVE post-pve-install.sh) +# ============================================================================== + +# Source detection library if not already loaded +if ! type -t detect_os_family &>/dev/null; then + SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" + # shellcheck source=./detection.sh + source "${SCRIPT_DIR}/detection.sh" +fi + +# --- PROXMOX VERSION DETECTION --- + +get_pve_version() { + # Get Proxmox VE version (e.g., 8.2.4) + if ! command -v pveversion &>/dev/null; then + echo "unknown" + return 1 + fi + + pveversion | awk -F'/' '{print $2}' | awk -F'-' '{print $1}' +} + +get_pve_major_version() { + # Get major version (8 or 9) + local ver=$(get_pve_version) + echo "$ver" | cut -d. -f1 +} + +is_pve_supported_version() { + # Check if PVE version is supported (8.0-8.9.x or 9.0-9.1.x) + local major=$(get_pve_major_version) + + case "$major" in + 8|9) + return 0 + ;; + *) + return 1 + ;; + esac +} + +# --- REPOSITORY MANAGEMENT --- + +disable_pve_enterprise_repo() { + # Disable enterprise repository (requires subscription) + local major=$(get_pve_major_version) + + echo "[โš™] Disabling Proxmox enterprise repository..." >&2 + + if [ "$major" == "9" ]; then + # PVE 9.x uses .sources format + for file in /etc/apt/sources.list.d/*.sources; do + if [ -f "$file" ] && grep -q "pve-enterprise" "$file"; then + # Use Enabled: false instead of commenting + if grep -q "^Enabled:" "$file"; then + sudo sed -i 's/^Enabled:.*/Enabled: false/' "$file" + else + echo "Enabled: false" | sudo tee -a "$file" >/dev/null + fi + fi + done + else + # PVE 8.x uses .list format + if [ -f /etc/apt/sources.list.d/pve-enterprise.list ]; then + sudo sed -i 's/^deb/# deb/' /etc/apt/sources.list.d/pve-enterprise.list + fi + fi + + echo " [โœ“] Enterprise repository disabled" >&2 +} + +enable_pve_no_subscription_repo() { + # Enable no-subscription repository (free updates) + local major=$(get_pve_major_version) + + echo "[โš™] Enabling Proxmox no-subscription repository..." >&2 + + if [ "$major" == "9" ]; then + # PVE 9.x deb822 format + sudo tee /etc/apt/sources.list.d/proxmox.sources >/dev/null </dev/null <&2 +} + +configure_ceph_repos() { + # Configure Ceph storage repositories (disabled by default) + local major=$(get_pve_major_version) + + echo "[โš™] Configuring Ceph package repositories..." >&2 + + if [ "$major" == "9" ]; then + # PVE 9.x - Squid version + sudo tee /etc/apt/sources.list.d/ceph.sources >/dev/null </dev/null <&2 +} + +add_pvetest_repo() { + # Add pvetest repository for beta features (disabled by default) + local major=$(get_pve_major_version) + + echo "[โš™] Adding pvetest repository (disabled)..." >&2 + + if [ "$major" == "9" ]; then + sudo tee /etc/apt/sources.list.d/pve-test.sources >/dev/null </dev/null <&2 +} + +configure_all_pve_repos() { + # Run complete repository configuration + disable_pve_enterprise_repo + enable_pve_no_subscription_repo + configure_ceph_repos + add_pvetest_repo + + echo "[โœ“] Proxmox repository configuration complete" >&2 +} + +# --- SUBSCRIPTION NAG REMOVAL --- + +disable_subscription_nag() { + # Remove subscription nag from web UI and mobile UI + # Creates apt hook to persist after updates + + echo "[โš™] Disabling subscription nag dialog..." >&2 + + # Create persistent script + sudo mkdir -p /usr/local/bin + sudo tee /usr/local/bin/pve-remove-nag.sh >/dev/null <<'SCRIPT_EOF' +#!/bin/sh +# Auto-generated by homelab bootstrap - removes Proxmox subscription nag + +WEB_JS=/usr/share/javascript/proxmox-widget-toolkit/proxmoxlib.js +if [ -s "$WEB_JS" ] && ! grep -q NoMoreNagging "$WEB_JS"; then + echo "Patching Web UI subscription nag..." + sed -i -e "/data\.status/ s/!//" -e "/data\.status/ s/active/NoMoreNagging/" "$WEB_JS" +fi + +MOBILE_TPL=/usr/share/pve-yew-mobile-gui/index.html.tpl +MARKER="" +if [ -f "$MOBILE_TPL" ] && ! grep -q "$MARKER" "$MOBILE_TPL"; then + echo "Patching Mobile UI subscription nag..." + printf "%s\n" \ + "$MARKER" \ + "" >> "$MOBILE_TPL" +fi +SCRIPT_EOF + + sudo chmod 755 /usr/local/bin/pve-remove-nag.sh + + # Create apt hook to run after updates + sudo tee /etc/apt/apt.conf.d/90-pve-no-nag <<'APT_EOF' +DPkg::Post-Invoke { "/usr/local/bin/pve-remove-nag.sh"; }; +APT_EOF + + sudo chmod 644 /etc/apt/apt.conf.d/90-pve-no-nag + + # Apply immediately + sudo /usr/local/bin/pve-remove-nag.sh 2>/dev/null || true + + echo " [โœ“] Subscription nag disabled (clear browser cache)" >&2 +} + +enable_subscription_nag() { + # Remove nag removal script (restore default behavior) + + echo "[โš™] Re-enabling subscription nag..." >&2 + + sudo rm -f /usr/local/bin/pve-remove-nag.sh + sudo rm -f /etc/apt/apt.conf.d/90-pve-no-nag + + # Reinstall widget toolkit to restore original + sudo apt --reinstall install proxmox-widget-toolkit &>/dev/null || true + + echo " [โœ“] Subscription nag re-enabled" >&2 +} + +# --- HIGH AVAILABILITY MANAGEMENT --- + +disable_ha_services() { + # Disable HA services for single-node setups (saves resources) + + echo "[โš™] Disabling high availability services..." >&2 + + if systemctl is-active --quiet pve-ha-lrm || systemctl is-active --quiet pve-ha-crm; then + sudo systemctl disable --now pve-ha-lrm 2>/dev/null || true + sudo systemctl disable --now pve-ha-crm 2>/dev/null || true + sudo systemctl disable --now corosync 2>/dev/null || true + echo " [โœ“] HA services disabled" >&2 + else + echo " [โœ“] HA services already disabled" >&2 + fi +} + +enable_ha_services() { + # Enable HA services for clustered environments + + echo "[โš™] Enabling high availability services..." >&2 + + sudo systemctl enable --now pve-ha-lrm + sudo systemctl enable --now pve-ha-crm + sudo systemctl enable --now corosync + + echo " [โœ“] HA services enabled" >&2 +} + +check_ha_status() { + # Check if HA services are running + + if systemctl is-active --quiet pve-ha-lrm; then + echo "enabled" + else + echo "disabled" + fi +} + +# --- SYSTEM MAINTENANCE --- + +update_pve_system() { + # Run full system update + + echo "[โš™] Updating Proxmox VE system (this may take several minutes)..." >&2 + + sudo apt-get update -qq + sudo apt-get dist-upgrade -y -qq + + echo " [โœ“] System update complete" >&2 +} + +check_reboot_required() { + # Check if reboot is needed after updates + + if [ -f /var/run/reboot-required ]; then + return 0 + fi + + # Check if kernel was updated + local running_kernel=$(uname -r) + local latest_kernel=$(dpkg -l | grep linux-image | sort -V | tail -n1 | awk '{print $2}' | sed 's/linux-image-//') + + if [ "$running_kernel" != "$latest_kernel" ]; then + return 0 + fi + + return 1 +} + +# --- COMPREHENSIVE POST-INSTALL --- + +run_proxmox_post_install() { + # Complete post-install routine for Proxmox hosts + # Args: $1 = mode (auto|interactive) - default: auto + + local mode="${1:-auto}" + + echo "=======================================" >&2 + echo "PROXMOX POST-INSTALL CONFIGURATION" >&2 + echo "Version: $(get_pve_version)" >&2 + echo "Mode: $mode" >&2 + echo "=======================================" >&2 + echo "" >&2 + + # Verify Proxmox installation + if ! is_pve_supported_version; then + echo "[โœ—] Unsupported Proxmox VE version" >&2 + return 1 + fi + + # Repository configuration + configure_all_pve_repos + echo "" >&2 + + # Subscription nag removal (auto mode) + if [ "$mode" == "auto" ]; then + disable_subscription_nag + echo "" >&2 + fi + + # HA service management + local cluster_nodes=$(pvesh get /nodes --output-format=json 2>/dev/null | jq -r 'length' 2>/dev/null || echo "1") + + if [ "$cluster_nodes" -eq 1 ]; then + echo "[โš™] Single-node setup detected" >&2 + disable_ha_services + echo "" >&2 + else + echo "[โš™] Multi-node cluster detected ($cluster_nodes nodes)" >&2 + echo " [โœ“] HA services left enabled" >&2 + echo "" >&2 + fi + + # System update + if [ "$mode" == "auto" ]; then + echo "[โš™] Checking for updates..." >&2 + sudo apt-get update -qq + + local upgrades=$(apt list --upgradable 2>/dev/null | grep -c upgradable || echo "0") + if [ "$upgrades" -gt 1 ]; then + echo " [!] $((upgrades - 1)) packages can be upgraded" >&2 + echo " [!] Run manually: apt update && apt dist-upgrade" >&2 + else + echo " [โœ“] System up to date" >&2 + fi + echo "" >&2 + fi + + # Check reboot requirement + if check_reboot_required; then + echo "[!] REBOOT REQUIRED" >&2 + echo " Kernel or critical services updated" >&2 + echo " Run: reboot" >&2 + echo "" >&2 + fi + + echo "=======================================" >&2 + echo "POST-INSTALL COMPLETE" >&2 + echo "=======================================" >&2 + echo "" >&2 + echo "Next Steps:" >&2 + echo " 1. Clear browser cache (Ctrl+Shift+R)" >&2 + echo " 2. Reboot if kernel was updated" >&2 + echo " 3. Configure storage/networking in web UI" >&2 + echo "" >&2 +} + +# --- VALIDATION --- + +validate_proxmox_config() { + # Validate Proxmox configuration + + local errors=0 + + echo "[โš™] Validating Proxmox configuration..." >&2 + + # Check PVE version + if ! is_pve_supported_version; then + echo " [โœ—] Unsupported PVE version: $(get_pve_version)" >&2 + ((errors++)) + else + echo " [โœ“] PVE version: $(get_pve_version)" >&2 + fi + + # Check no-subscription repo + if grep -rq "pve-no-subscription" /etc/apt/sources.list.d/ 2>/dev/null; then + echo " [โœ“] No-subscription repository configured" >&2 + else + echo " [โœ—] No-subscription repository not found" >&2 + ((errors++)) + fi + + # Check enterprise repo disabled + if grep -rq "^deb.*pve-enterprise" /etc/apt/sources.list.d/ 2>/dev/null; then + echo " [!] Enterprise repository still active (requires subscription)" >&2 + else + echo " [โœ“] Enterprise repository disabled" >&2 + fi + + # Check cluster status + local cluster_status=$(pvesh get /cluster/status --output-format=json 2>/dev/null | jq -r '.[0].type' 2>/dev/null || echo "unknown") + if [ "$cluster_status" == "cluster" ]; then + echo " [โœ“] Part of Proxmox cluster" >&2 + else + echo " [โœ“] Standalone node" >&2 + fi + + if [ $errors -eq 0 ]; then + echo "[โœ“] Proxmox validation passed" >&2 + return 0 + else + echo "[โœ—] Proxmox validation failed ($errors errors)" >&2 + return 1 + fi +} + +# Export functions +export -f get_pve_version +export -f get_pve_major_version +export -f is_pve_supported_version +export -f disable_pve_enterprise_repo +export -f enable_pve_no_subscription_repo +export -f configure_ceph_repos +export -f add_pvetest_repo +export -f configure_all_pve_repos +export -f disable_subscription_nag +export -f enable_subscription_nag +export -f disable_ha_services +export -f enable_ha_services +export -f check_ha_status +export -f update_pve_system +export -f check_reboot_required +export -f run_proxmox_post_install +export -f validate_proxmox_config diff --git a/scripts/lib/validation.sh b/scripts/lib/validation.sh new file mode 100644 index 0000000..6ac0302 --- /dev/null +++ b/scripts/lib/validation.sh @@ -0,0 +1,510 @@ +#!/bin/bash + +# ============================================================================== +# VALIDATION LIBRARY: Comprehensive System Health Checks +# ============================================================================== +# Part of unified bootstrap system for homelab infrastructure +# Provides comprehensive pre-flight and post-bootstrap validation with +# severity levels (critical, warning, info) for operational readiness. +# ============================================================================== + +# Source detection library if not already loaded +if ! type -t detect_os_family &>/dev/null; then + SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" + # shellcheck source=./detection.sh + source "${SCRIPT_DIR}/detection.sh" +fi + +# --- VALIDATION TRACKING --- + +declare -g VALIDATION_ERRORS=0 +declare -g VALIDATION_WARNINGS=0 +declare -g VALIDATION_PASSED=0 + +reset_validation_counters() { + VALIDATION_ERRORS=0 + VALIDATION_WARNINGS=0 + VALIDATION_PASSED=0 +} + +log_pass() { + local message="$1" + echo " [โœ“] $message" >&2 + ((VALIDATION_PASSED++)) +} + +log_warning() { + local message="$1" + echo " [!] WARNING: $message" >&2 + ((VALIDATION_WARNINGS++)) +} + +log_error() { + local message="$1" + echo " [โœ—] ERROR: $message" >&2 + ((VALIDATION_ERRORS++)) +} + +# --- DISK VALIDATION --- + +check_disk_space() { + # Validate sufficient disk space for Docker and system operations + # Critical: Root partition must have at least 10GB free + # Warning: Root partition should have at least 20GB free + + echo "[โš™] Checking disk space..." >&2 + + local root_avail=$(df -BG / | awk 'NR==2 {print $4}' | sed 's/G//') + + if [ -z "$root_avail" ]; then + log_error "Could not determine disk space" + return 1 + fi + + if [ "$root_avail" -lt 10 ]; then + log_error "Insufficient disk space on /: ${root_avail}GB (minimum 10GB required)" + return 1 + elif [ "$root_avail" -lt 20 ]; then + log_warning "Low disk space on /: ${root_avail}GB (recommend 20GB+)" + else + log_pass "Disk space on /: ${root_avail}GB available" + fi + + # Check for separate /var partition (common on servers) + if mountpoint -q /var 2>/dev/null; then + local var_avail=$(df -BG /var | awk 'NR==2 {print $4}' | sed 's/G//') + if [ "$var_avail" -lt 20 ]; then + log_warning "Low disk space on /var: ${var_avail}GB (Docker images will go here)" + else + log_pass "Disk space on /var: ${var_avail}GB available" + fi + fi + + return 0 +} + +check_disk_performance() { + # Check if root is on SSD vs HDD (performance indicator) + + echo "[โš™] Checking disk performance characteristics..." >&2 + + local root_device=$(df / | awk 'NR==2 {print $1}' | sed 's/[0-9]*$//') + local device_name=$(basename "$root_device") + + if [ -f "/sys/block/$device_name/queue/rotational" ]; then + local rotational=$(cat "/sys/block/$device_name/queue/rotational") + if [ "$rotational" -eq 0 ]; then + log_pass "Root filesystem on SSD ($device_name)" + else + log_warning "Root filesystem on HDD ($device_name) - SSD recommended for Docker" + fi + else + log_warning "Could not determine disk type for $device_name" + fi + + return 0 +} + +# --- MEMORY VALIDATION --- + +check_memory() { + # Validate sufficient RAM for deployment type + # Proxmox: 8GB minimum, Docker Swarm: 4GB minimum, Pi: 2GB minimum + + echo "[โš™] Checking memory..." >&2 + + local mem_total_kb=$(grep MemTotal /proc/meminfo | awk '{print $2}') + local mem_total_gb=$((mem_total_kb / 1024 / 1024)) + + if [ -z "$mem_total_gb" ] || [ "$mem_total_gb" -eq 0 ]; then + log_error "Could not determine system memory" + return 1 + fi + + local hardware_type=$(detect_hardware_type) + local min_required=2 + + case "$hardware_type" in + proxmox) + min_required=8 + ;; + docker-vm|physical-docker) + min_required=4 + ;; + ai-workstation) + min_required=16 + ;; + pi) + min_required=2 + ;; + esac + + if [ "$mem_total_gb" -lt "$min_required" ]; then + log_error "Insufficient RAM: ${mem_total_gb}GB (minimum ${min_required}GB for $hardware_type)" + return 1 + else + log_pass "Memory: ${mem_total_gb}GB (meets ${min_required}GB minimum for $hardware_type)" + fi + + return 0 +} + +check_swap() { + # Check swap configuration (important for memory-constrained systems) + + echo "[โš™] Checking swap configuration..." >&2 + + local swap_total_kb=$(grep SwapTotal /proc/meminfo | awk '{print $2}') + local swap_total_gb=$((swap_total_kb / 1024 / 1024)) + + local hardware_type=$(detect_hardware_type) + + if [ "$hardware_type" == "proxmox" ]; then + # Proxmox hosts should NOT have swap enabled (best practice) + if [ "$swap_total_gb" -gt 0 ]; then + log_warning "Swap enabled on Proxmox host (${swap_total_gb}GB) - consider disabling" + else + log_pass "Swap disabled (correct for Proxmox)" + fi + else + # Other systems benefit from swap + if [ "$swap_total_gb" -eq 0 ]; then + log_warning "No swap configured - may cause OOM issues under load" + else + log_pass "Swap: ${swap_total_gb}GB configured" + fi + fi + + return 0 +} + +# --- NETWORK VALIDATION --- + +check_network_routes() { + # Validate proper network routing configuration + + echo "[โš™] Checking network routing..." >&2 + + # Check for default route + if ip route show default &>/dev/null; then + local gateway=$(ip route show default | awk '/^default/ {print $3; exit}') + log_pass "Default gateway configured: $gateway" + else + log_error "No default gateway configured" + return 1 + fi + + # Check for DNS servers + if [ -f /etc/resolv.conf ]; then + local dns_count=$(grep -c "^nameserver" /etc/resolv.conf) + if [ "$dns_count" -gt 0 ]; then + log_pass "DNS servers configured ($dns_count entries)" + else + log_warning "No DNS servers in /etc/resolv.conf" + fi + else + log_error "/etc/resolv.conf missing" + return 1 + fi + + return 0 +} + +check_hostname_resolution() { + # Validate hostname resolution (important for cluster operations) + + echo "[โš™] Checking hostname resolution..." >&2 + + local hostname=$(hostname) + local fqdn=$(hostname -f 2>/dev/null || echo "") + + if [ -n "$hostname" ]; then + log_pass "Hostname: $hostname" + else + log_error "Hostname not set" + return 1 + fi + + # Check if hostname resolves + if host "$hostname" &>/dev/null || getent hosts "$hostname" &>/dev/null; then + log_pass "Hostname resolves" + else + log_warning "Hostname '$hostname' does not resolve - may cause cluster issues" + fi + + # Check /etc/hosts + if grep -q "127.0.1.1.*$hostname" /etc/hosts 2>/dev/null; then + log_pass "Hostname in /etc/hosts" + else + log_warning "Hostname not in /etc/hosts - adding is recommended" + fi + + return 0 +} + +# --- NFS CLIENT VALIDATION --- + +check_nfs_client() { + # Validate NFS client packages and kernel modules + + echo "[โš™] Checking NFS client..." >&2 + + # Check for NFS common package + if dpkg -l | grep -q "^ii.*nfs-common"; then + log_pass "nfs-common package installed" + else + log_warning "nfs-common not installed - required for NFS mounts" + return 1 + fi + + # Check for NFS kernel modules + if lsmod | grep -q "^nfs "; then + log_pass "NFS kernel module loaded" + else + # Try to load it + if sudo modprobe nfs 2>/dev/null; then + log_pass "NFS kernel module loaded successfully" + else + log_warning "Could not load NFS kernel module" + fi + fi + + return 0 +} + +# --- DOCKER VALIDATION --- + +check_docker_daemon() { + # Validate Docker installation and daemon health + + echo "[โš™] Checking Docker installation..." >&2 + + # Check if Docker is installed + if ! command -v docker &>/dev/null; then + log_warning "Docker not installed (will be installed during bootstrap)" + return 0 + fi + + log_pass "Docker binary found: $(docker --version 2>/dev/null | head -n1)" + + # Check if daemon is running + if systemctl is-active docker &>/dev/null; then + log_pass "Docker daemon running" + else + log_warning "Docker daemon not running" + return 0 + fi + + # Check Docker socket permissions + if [ -S /var/run/docker.sock ]; then + if sudo docker ps &>/dev/null; then + log_pass "Docker socket accessible" + else + log_warning "Docker socket exists but not accessible" + fi + else + log_warning "Docker socket not found" + fi + + # Check storage driver + local storage_driver=$(docker info 2>/dev/null | grep "Storage Driver" | awk '{print $3}') + if [ -n "$storage_driver" ]; then + log_pass "Storage driver: $storage_driver" + + # Warn about devicemapper (deprecated) + if [ "$storage_driver" == "devicemapper" ]; then + log_warning "devicemapper storage driver is deprecated - consider overlay2" + fi + fi + + return 0 +} + +# --- PROXMOX VALIDATION --- + +check_proxmox_api() { + # Validate Proxmox VE installation and API accessibility + + local hardware_type=$(detect_hardware_type) + + if [ "$hardware_type" != "proxmox" ]; then + return 0 # Skip if not Proxmox + fi + + echo "[โš™] Checking Proxmox VE..." >&2 + + # Check for pveversion + if command -v pveversion &>/dev/null; then + local pve_version=$(pveversion | head -n1) + log_pass "Proxmox installed: $pve_version" + else + log_error "Proxmox tools not found (pveversion missing)" + return 1 + fi + + # Check cluster status + if command -v pvesh &>/dev/null; then + if sudo pvesh get /cluster/status &>/dev/null; then + log_pass "Proxmox API accessible" + else + log_warning "Proxmox API not responding" + fi + fi + + # Check for required repositories + if [ -f /etc/apt/sources.list.d/pve-no-subscription.list ]; then + log_pass "No-subscription repository configured" + else + log_warning "No-subscription repository not configured" + fi + + return 0 +} + +# --- SECURITY VALIDATION --- + +check_ssh_security() { + # Basic SSH security validation + + echo "[โš™] Checking SSH security..." >&2 + + # Check if SSH is running + if systemctl is-active ssh &>/dev/null || systemctl is-active sshd &>/dev/null; then + log_pass "SSH service running" + else + log_error "SSH service not running" + return 1 + fi + + # Check for password authentication (should be disabled in production) + if [ -f /etc/ssh/sshd_config ]; then + if grep -q "^PasswordAuthentication no" /etc/ssh/sshd_config; then + log_pass "SSH password authentication disabled (secure)" + else + log_warning "SSH password authentication may be enabled - key-only is recommended" + fi + fi + + # Check for root login + if [ -f /etc/ssh/sshd_config ]; then + if grep -q "^PermitRootLogin no" /etc/ssh/sshd_config; then + log_pass "SSH root login disabled (secure)" + else + log_warning "SSH root login may be enabled - consider disabling" + fi + fi + + return 0 +} + +check_firewall() { + # Check firewall status (informational) + + echo "[โš™] Checking firewall..." >&2 + + if systemctl is-active ufw &>/dev/null; then + local ufw_status=$(sudo ufw status 2>/dev/null | head -n1) + log_pass "UFW active: $ufw_status" + elif systemctl is-active firewalld &>/dev/null; then + log_pass "firewalld active" + elif command -v iptables &>/dev/null; then + local iptables_rules=$(sudo iptables -L -n | wc -l) + if [ "$iptables_rules" -gt 8 ]; then + log_pass "iptables rules configured ($iptables_rules lines)" + else + log_warning "No firewall detected - consider enabling UFW" + fi + else + log_warning "No firewall detected" + fi + + return 0 +} + +# --- TIME SYNCHRONIZATION --- + +check_time_sync() { + # Validate NTP/timesyncd for cluster time synchronization + + echo "[โš™] Checking time synchronization..." >&2 + + if systemctl is-active systemd-timesyncd &>/dev/null; then + local ntp_status=$(timedatectl status 2>/dev/null | grep "synchronized" | awk '{print $3}') + if [ "$ntp_status" == "yes" ]; then + log_pass "Time synchronized via systemd-timesyncd" + else + log_warning "Time sync not confirmed" + fi + elif systemctl is-active ntp &>/dev/null || systemctl is-active ntpd &>/dev/null; then + log_pass "NTP service running" + else + log_warning "No time synchronization service detected - critical for clusters" + fi + + return 0 +} + +# --- COMPREHENSIVE VALIDATION SUITE --- + +run_validation_suite() { + # Run all validation checks and return summary + # Returns 0 if no critical errors, 1 if critical errors found + + reset_validation_counters + + echo "======================================" >&2 + echo "SYSTEM VALIDATION SUITE" >&2 + echo "======================================" >&2 + + # Run all checks (continue even if some fail) + check_disk_space || true + check_disk_performance || true + check_memory || true + check_swap || true + check_network_routes || true + check_hostname_resolution || true + check_nfs_client || true + check_docker_daemon || true + check_proxmox_api || true + check_ssh_security || true + check_firewall || true + check_time_sync || true + + # Summary + echo "======================================" >&2 + echo "VALIDATION SUMMARY" >&2 + echo " Passed: $VALIDATION_PASSED" >&2 + echo " Warnings: $VALIDATION_WARNINGS" >&2 + echo " Errors: $VALIDATION_ERRORS" >&2 + echo "======================================" >&2 + + if [ $VALIDATION_ERRORS -gt 0 ]; then + echo "[โœ—] CRITICAL: $VALIDATION_ERRORS validation errors - manual intervention required" >&2 + return 1 + elif [ $VALIDATION_WARNINGS -gt 0 ]; then + echo "[!] WARNINGS: $VALIDATION_WARNINGS issues detected - review recommended" >&2 + return 0 + else + echo "[โœ“] ALL CHECKS PASSED" >&2 + return 0 + fi +} + +# Export functions +export -f reset_validation_counters +export -f log_pass +export -f log_warning +export -f log_error +export -f check_disk_space +export -f check_disk_performance +export -f check_memory +export -f check_swap +export -f check_network_routes +export -f check_hostname_resolution +export -f check_nfs_client +export -f check_docker_daemon +export -f check_proxmox_api +export -f check_ssh_security +export -f check_firewall +export -f check_time_sync +export -f run_validation_suite diff --git a/scripts/onboarding.sh b/scripts/onboarding.sh index 101e9a8..cddc240 100644 --- a/scripts/onboarding.sh +++ b/scripts/onboarding.sh @@ -1,5 +1,46 @@ #!/bin/bash +# ============================================================================== +# DEPRECATED: onboarding.sh +# ============================================================================== +# โš ๏ธ DEPRECATION NOTICE +# This script is deprecated and will be removed in a future release. +# Please use the unified bootstrap.sh script instead: +# +# ./bootstrap.sh +# +# Note: Proxmox SSH key distribution is now handled by: +# 1. Run bootstrap.sh to generate keys +# 2. Manually copy keys to Proxmox: ssh-copy-id root@ +# 3. Run Ansible playbook: ansible-playbook playbooks/onboarding/proxmox_host.yml +# +# This wrapper will redirect to bootstrap.sh. +# ============================================================================== + +set -euo pipefail + +# Show deprecation warning +echo "=======================================" >&2 +echo "โš ๏ธ DEPRECATION WARNING" >&2 +echo "=======================================" >&2 +echo "onboarding.sh is deprecated!" >&2 +echo "" >&2 +echo "Please use: ./bootstrap.sh" >&2 +echo "" >&2 +echo "Note: Proxmox SSH key setup is now a separate step." >&2 +echo "See documentation/SOPs/SOP-002-Initial-Infrastructure-Deployment.md" >&2 +echo "" >&2 +echo "Redirecting to bootstrap.sh in 5 seconds..." >&2 +echo "Press Ctrl+C to cancel" >&2 +echo "=======================================" >&2 +sleep 5 + +# Redirect to unified bootstrap +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +exec "${SCRIPT_DIR}/bootstrap.sh" "$@" + +# ============================================================================== +# LEGACY CODE BELOW (no longer executed) # ============================================================================== # ENVIRONMENT VARIABLES # ============================================================================== diff --git a/scripts/pi_init.sh b/scripts/pi_init.sh index d076f41..811fbb7 100644 --- a/scripts/pi_init.sh +++ b/scripts/pi_init.sh @@ -1,5 +1,44 @@ #!/bin/bash +# ============================================================================== +# DEPRECATED: pi_init.sh +# ============================================================================== +# โš ๏ธ DEPRECATION NOTICE +# This script is deprecated and will be removed in a future release. +# Please use the unified bootstrap.sh script instead: +# +# ./bootstrap.sh --hardware-type pi +# +# This wrapper will redirect to bootstrap.sh with appropriate flags. +# ============================================================================== + +set -euo pipefail + +# Show deprecation warning +echo "=======================================" >&2 +echo "โš ๏ธ DEPRECATION WARNING" >&2 +echo "=======================================" >&2 +echo "pi_init.sh is deprecated!" >&2 +echo "" >&2 +echo "Please use: ./bootstrap.sh" >&2 +echo "" >&2 +echo "Redirecting to bootstrap.sh in 5 seconds..." >&2 +echo "Press Ctrl+C to cancel" >&2 +echo "=======================================" >&2 +sleep 5 + +# Redirect to unified bootstrap +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +exec "${SCRIPT_DIR}/bootstrap.sh" --hardware-type pi "$@" + +# ============================================================================== +# LEGACY CODE BELOW (no longer executed) +# ============================================================================== + +exit 0 + +#!/bin/bash + # ============================================================================== # SINGLE-COMMAND BOOTSTRAP: IP, DOCKER, ANSIBLE # Target: Ubuntu / Debian diff --git a/scripts/validate-node.sh b/scripts/validate-node.sh new file mode 100644 index 0000000..1b46e8c --- /dev/null +++ b/scripts/validate-node.sh @@ -0,0 +1,215 @@ +#!/bin/bash + +# ============================================================================== +# STANDALONE NODE VALIDATION TOOL +# ============================================================================== +# Comprehensive health check utility for homelab nodes +# Can be run independently on any managed host (post-bootstrap or ad-hoc) +# +# Usage: +# ./validate-node.sh [OPTIONS] +# +# Options: +# --help Show this help message +# --json Output results in JSON format +# --critical-only Show only critical errors (exit code 1 if any found) +# --verbose Show detailed output for each check +# +# Exit Codes: +# 0 - All checks passed or warnings only +# 1 - Critical errors found +# 2 - Invalid usage +# +# ============================================================================== + +set -euo pipefail + +# --- SCRIPT METADATA --- + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +SCRIPT_NAME="$(basename "${BASH_SOURCE[0]}")" +VERSION="1.0.0" + +# --- LOAD LIBRARIES --- + +# shellcheck source=./lib/detection.sh +source "${SCRIPT_DIR}/lib/detection.sh" +# shellcheck source=./lib/validation.sh +source "${SCRIPT_DIR}/lib/validation.sh" +# shellcheck source=./lib/network.sh +source "${SCRIPT_DIR}/lib/network.sh" + +# --- COMMAND LINE ARGUMENTS --- + +SHOW_HELP=false +OUTPUT_JSON=false +CRITICAL_ONLY=false +VERBOSE=false + +parse_arguments() { + while [[ $# -gt 0 ]]; do + case "$1" in + --help|-h) + SHOW_HELP=true + shift + ;; + --json) + OUTPUT_JSON=true + shift + ;; + --critical-only) + CRITICAL_ONLY=true + shift + ;; + --verbose|-v) + VERBOSE=true + shift + ;; + *) + echo "ERROR: Unknown option: $1" >&2 + echo "Use --help for usage information" >&2 + exit 2 + ;; + esac + done +} + +show_help() { + cat <<'EOF' +============================================================================= +STANDALONE NODE VALIDATION TOOL v1.0.0 +============================================================================= + +Comprehensive health check for homelab infrastructure nodes. +Can be run on any managed host to verify operational readiness. + +USAGE: + ./validate-node.sh [OPTIONS] + +OPTIONS: + --help Show this help message + --json Output results in JSON format (for monitoring integration) + --critical-only Show only critical errors (suppress warnings) + --verbose Show detailed output for each check + +EXIT CODES: + 0 - All checks passed (or warnings only) + 1 - Critical errors found + 2 - Invalid usage + +CHECKS PERFORMED: + โ€ข Disk Space & Performance + โ€ข Memory & Swap Configuration + โ€ข Network Routes & Connectivity + โ€ข Hostname Resolution + โ€ข NFS Client Configuration + โ€ข Docker Daemon Health + โ€ข Proxmox API (if applicable) + โ€ข SSH Security Configuration + โ€ข Firewall Status + โ€ข Time Synchronization + +EXAMPLES: + ./validate-node.sh + Run all checks with standard output + + ./validate-node.sh --critical-only + Show only critical errors (useful in scripts) + + ./validate-node.sh --json + Output JSON for monitoring/alerting systems + + ./validate-node.sh --verbose + Show detailed information for each check + +INTEGRATION: + JSON output can be consumed by monitoring systems (Prometheus, Grafana, etc.) + Example: ./validate-node.sh --json | jq '.errors' + +============================================================================= +EOF +} + +# --- JSON OUTPUT FUNCTIONS --- + +generate_json_report() { + # Generate JSON output for monitoring integration + + local hostname=$(hostname) + local timestamp=$(date -u +"%Y-%m-%dT%H:%M:%SZ") + + cat <