homelab/scripts/README.md
nathan e16f98a183 feat(bootstrap)!: introduce unified bootstrap system with modular libraries
BREAKING CHANGE: day0bootstrap.sh deprecated in favor of bootstrap.sh

- Add scripts/bootstrap.sh (488 lines): Unified entrypoint supporting multiple hardware types (Proxmox/Docker VMs/Pi)
- Create scripts/lib/ modular library system:
  - detection.sh: OS/hardware/container detection (362 lines)
  - fingerprint.sh: System fingerprinting and inventory (494 lines)
  - network.sh: IP configuration and VLAN placement (356 lines)
  - proxmox.sh: PVE post-install automation (453 lines)
  - validation.sh: Comprehensive pre-flight checks (510 lines)
- Add validation tools: validate-node.sh, onboarding.sh, pi_init.sh
- Deprecate scripts/day0bootstrap.sh with graceful redirect wrapper
- Document architecture in scripts/README.md (495 lines) and PROXMOX-COMPARISON.md
- Update SOP-002 with new bootstrap workflow
- Add nodes/watchtower/compose.yaml (Raspberry Pi 5 stack)

Migration: Existing day0bootstrap.sh users automatically redirected to new system after 5-second warning. No manual intervention required.

Ref: Infrastructure automation modernization per active-tasks.md
2026-04-12 22:48:19 -04:00

569 lines
17 KiB
Markdown

# Homelab Bootstrap & Utility Scripts
This directory contains the unified day0 onboarding infrastructure and operational utility scripts for homelab hardware provisioning.
## 🚀 Quick Start
**New Hardware Onboarding (Recommended):**
```bash
# Auto-detect hardware type and configure
./bootstrap.sh
# Preview changes without applying
./bootstrap.sh --dry-run
# Force specific hardware type
./bootstrap.sh --hardware-type proxmox --target-ip 10.0.0.201
```
**Health Check on Existing Nodes:**
```bash
# Run comprehensive validation
./validate-node.sh
# JSON output for monitoring
./validate-node.sh --json
```
---
## 📁 Directory Structure
```
scripts/
├── bootstrap.sh # ⭐ Unified day0 onboarding (NEW)
├── validate-node.sh # ⭐ Standalone health check tool (NEW)
├── lib/ # ⭐ Modular libraries (NEW)
│ ├── detection.sh # OS/hardware auto-detection
│ ├── network.sh # Network configuration & validation
│ ├── validation.sh # Comprehensive health checks
│ ├── fingerprint.sh # Hardware inventory collection
│ └── proxmox.sh # Proxmox VE post-install (comprehensive)
├── day0bootstrap.sh # 🔻 DEPRECATED - use bootstrap.sh
├── pi_init.sh # 🔻 DEPRECATED - use bootstrap.sh
├── onboarding.sh # 🔻 DEPRECATED - use bootstrap.sh
└── README.md # This file
```
---
## 🛠️ Core Tools
### `bootstrap.sh` — Unified Day0 Onboarding
**Purpose:** Intelligent initialization for all homelab hardware types. Auto-detects OS, hardware, and applies appropriate configuration.
**Replaces:** `day0bootstrap.sh`, `pi_init.sh`, `onboarding.sh`
**Features:**
- ✅ Auto-detection (Proxmox, Docker VM, Raspberry Pi, physical servers, AI workstations)
- ✅ Static IP configuration via netplan
- ✅ Docker & Ansible installation (with Debian Trixie compatibility)
- ✅ SSH key generation (ED25519 preferred, RSA fallback)
- ✅ Comprehensive validation suite
- ✅ Hardware fingerprinting with YAML/JSON output
- ✅ Ansible inventory auto-discovery
**Usage:**
```bash
./bootstrap.sh [OPTIONS]
Options:
--help Show detailed help
--dry-run Preview actions without changes
--hardware-type TYPE Override auto-detection
(proxmox|docker-vm|pi|physical-docker|ai-workstation)
--skip-network Skip network configuration
--skip-validation Skip post-bootstrap validation
--output-json Generate JSON output instead of YAML
--target-ip IP Set static IP address (default: auto-assigned)
--gateway IP Set gateway IP (default: 10.0.0.1)
--dns IP Set DNS server (default: 10.0.0.2)
```
**Examples:**
```bash
# Auto-detect and configure with defaults
./bootstrap.sh
# Preview what would be done (safe to run)
./bootstrap.sh --dry-run
# Force Proxmox mode with custom IP
./bootstrap.sh --hardware-type proxmox --target-ip 10.0.0.201
# Skip network config (already configured manually)
./bootstrap.sh --skip-network
# Generate JSON for monitoring integration
./bootstrap.sh --output-json
```
**Output Files:**
- `../ansible/archive/outputs/hardware-facts-{hostname}-{timestamp}.yml`
- `../ansible/archive/outputs/bootstrap-validation-{hostname}-{timestamp}.log`
- `../ansible/archive/inventory/discovered-hosts.yml` (auto-discovery)
**Workflow:**
1. **Detection** → Identify OS, hardware type, CPU, GPU
2. **Network Config** → Apply static IP via netplan (SSH will disconnect)
3. **Package Install** → Docker, Ansible, proxmoxer (as needed)
4. **SSH Keys** → Generate/verify ED25519 keys
5. **Validation** → Comprehensive health checks (disk, memory, network, etc.)
6. **Fingerprinting** → Generate hardware inventory YAML
**Network Warning:** Network configuration will disconnect SSH. Reconnect to new IP after ~10 seconds.
---
### `validate-node.sh` — Standalone Health Check Tool
**Purpose:** Comprehensive validation suite for operational readiness. Can be run post-bootstrap or ad-hoc on any managed node.
**Features:**
- ✅ Disk space & performance checks
- ✅ Memory & swap validation
- ✅ Network routing & connectivity tests
- ✅ NFS client configuration
- ✅ Docker daemon health
- ✅ Proxmox API accessibility (if applicable)
- ✅ SSH security audit
- ✅ Time synchronization verification
- ✅ JSON output for monitoring integration
**Usage:**
```bash
./validate-node.sh [OPTIONS]
Options:
--help Show detailed help
--json Output results in JSON format
--critical-only Show only critical errors
--verbose Show detailed output for each check
```
**Examples:**
```bash
# Run all checks with standard output
./validate-node.sh
# Only show critical errors (useful in scripts)
./validate-node.sh --critical-only
# JSON output for monitoring/alerting
./validate-node.sh --json | jq '.validation.errors'
# Verbose mode with detection summary
./validate-node.sh --verbose
```
**Exit Codes:**
- `0` → All checks passed (or warnings only)
- `1` → Critical errors found
- `2` → Invalid usage
**Integration Example:**
```bash
# Use in deployment pipeline
if ./validate-node.sh --critical-only; then
echo "Node healthy, proceeding with deployment"
else
echo "Critical issues found, aborting"
exit 1
fi
```
---
## 📚 Library Components
### `lib/detection.sh` — System Detection
**Functions:**
- `detect_os_family()` → debian, ubuntu, raspbian
- `detect_os_version()` → trixie, bookworm, noble, etc.
- `detect_hardware_type()` → proxmox, docker-vm, pi, physical-docker, ai-workstation
- `detect_cpu_vendor()` → intel, amd, arm
- `detect_cpu_generation()` → Intel generation number (12, 13, 14, etc.)
- `detect_gpu()` → nvidia, amd, intel, none
- `detect_primary_interface()` → Network interface name
- `get_current_ip()` → Current IPv4 address
**Usage:**
```bash
source lib/detection.sh
HW_TYPE=$(detect_hardware_type)
echo "Hardware type: $HW_TYPE"
```
---
### `lib/network.sh` — Network Configuration
**Functions:**
- `apply_static_ip()` → Configure netplan with static IP
- `configure_network_safe()` → Apply network changes with reconnection instructions
- `validate_connectivity()` → Test gateway, internet, DNS
- `check_nfs_accessibility()` → Test NFS server reachability
- `get_desired_vlan_ip()` → Determine VLAN IP based on hardware type (future)
**Usage:**
```bash
source lib/network.sh
configure_network_safe "10.0.0.151" "10.0.0.1" "10.0.0.2"
```
---
### `lib/validation.sh` — Health Checks
**Functions:**
- `check_disk_space()` → Validate sufficient disk space
- `check_memory()` → Validate RAM meets hardware type requirements
- `check_network_routes()` → Validate routing configuration
- `check_docker_daemon()` → Validate Docker installation and health
- `check_proxmox_api()` → Validate Proxmox VE (if applicable)
- `run_validation_suite()` → Run all checks and return summary
**Severity Levels:**
- **Critical** → Blocks Ansible playbook execution
- **Warning** → Manual review recommended
- **Info** → Logged for reference
---
### `lib/fingerprint.sh` — Hardware Inventory
**Functions:**
- `collect_cpu_facts()` → CPU model, cores, frequency
- `collect_memory_facts()` → RAM and swap details
- `collect_disk_facts()` → Storage capacity and type (SSD/HDD)
- `collect_network_facts()` → Network interfaces and connectivity
- `generate_hardware_facts_yaml()` → Full YAML output
- `generate_ansible_inventory_snippet()` → Ansible inventory entry
**Output Format (YAML):**
```yaml
hostname:
os:
family: " "bookworm"
cpu:
model: "Intel Core i7-12700"
cores: 12
memory:
total_gb: 32
storage:
root:
total_gb: 500
type: "ssd"
```
---
### `lib/proxmox.sh` — Proxmox VE Post-Install
**Comprehensive Proxmox configuration (inspired by community-scripts/ProxmoxVE):**
**Repository Management:**
- `configure_all_pve_repos()` → Complete repository configuration
- `disable_pve_enterprise_repo()` → Disable subscription-required repository
- `enable_pve_no_subscription_repo()` → Enable free updates repository
- `configure_ceph_repos()` → Configure Ceph storage repositories
- `add_pvetest_repo()` → Add beta/test repository (disabled by default)
**Subscription Nag Removal:**
- `disable_subscription_nag()` → Remove subscription warnings from UI
- Patches web UI JavaScript (proxmoxlib.js)
- Patches mobile UI templates
- Creates apt hook to persist after updates
**High Availability Management:**
- `disable_ha_services()` → Disable HA for single-node setups (saves resources)
- `enable_ha_services()` → Enable HA for clustered environments
- `check_ha_status()` → Check current HA service status
**System Maintenance:**
- `update_pve_system()` → Run full system update (apt dist-upgrade)
- `check_reboot_required()` → Check if reboot needed after updates
**Comprehensive Routine:**
- `run_proxmox_post_install()`**All-in-one post-install configuration**
- Repository fixes (PVE 8.x and 9.x compatible)
- Subscription nag removal
- Auto-detect single vs. cluster and adjust HA accordingly
- System update checks
- Reboot requirement detection
**Validation:**
- `validate_proxmox_config()` → Verify Proxmox configuration
- `get_pve_version()` → Get Proxmox VE version
- `is_pve_supported_version()` → Check if version is supported (8.0-8.9.x, 9.0-9.1.x)
**Usage:**
```bash
source lib/proxmox.sh
# Run complete post-install (auto mode)
run_proxmox_post_install "auto"
# Individual functions
disable_subscription_nag
disable_ha_services # For single-node setups
validate_proxmox_config
```
**What It Does (Compared to Community Script):**
- ✅ Repository management (enterprise/no-subscription/Ceph/pvetest)
- ✅ Subscription nag removal (web + mobile UI)
- ✅ HA service management (auto-detect single vs. cluster)
- ✅ Supports PVE 8.x (.list format) and 9.x (.sources deb822 format)
- ✅ System update and reboot detection
- ✅ Automated (non-interactive) for use in bootstrap.sh
**Integration:**
When `bootstrap.sh` detects Proxmox hardware, it automatically runs `run_proxmox_post_install "auto"` after package installation, providing the same comprehensive post-install configuration as the community tteck/ProxmoxVE script. type: "ssd"
```
---
## 🔻 Deprecated Scripts
**These scripts are deprecated and will be removed in a future release. Use `bootstrap.sh` instead.**
### `day0bootstrap.sh` → `bootstrap.sh --hardware-type pi`
**Original Purpose:** Debian Trixie bootstrap for Raspberry Pi
**Status:** Wrapper redirects to bootstrap.sh with deprecation warning
### `pi_init.sh` → `bootstrap.sh --hardware-type pi`
**Original Purpose:** Raspberry Pi initialization
**Status:** Wrapper redirects to bootstrap.sh with deprecation warning
### `onboarding.sh` → `bootstrap.sh`
**Original Purpose:** Generic onboarding + Proxmox integration
**Status:** Wrapper redirects to bootstrap.sh with deprecation warning
**Migration Path:**
All legacy scripts now display a deprecation warning and automatically redirect to `bootstrap.sh` with appropriate flags. You have 6 months to update your workflows before these wrappers are removed.
---
## 🎯 Common Workflows
### New Raspberry Pi Control Node
```bash
# 1. Transfer scripts to Pi
scp -r scripts/ chester@10.0.0.200:~/
# 2. SSH to Pi
ssh chester@10.0.0.200
# 3. Run bootstrap
cd scripts
./bootstrap.sh # Auto-detects Pi hardware
# 4. SSH will disconnect - reconnect to new IP
ssh chester@10.0.0.200
```
### New Proxmox Host
```bash
# On Proxmox console (pre-SSH setup)
curl -O https://git.castaldifamily.com/nathan/homelab/raw/branch/main/scripts/bootstrap.sh
chmod +x bootstrap.sh
./bootstrap.sh --hardware-type proxmox --target-ip 10.0.0.201
```
### New Docker Swarm VM
```bash
# From control node (Watchtower)
scp -r scripts/ chester@10.0.0.211:~/
ssh chester@10.0.0.211
cd scripts
./bootstrap.sh --hardware-type docker-vm --target-ip 10.0.0.211
```
### Health Check All Nodes (from Control Node)
```bash
# Create validation script
cat > validate-all.sh <<'EOF'
#!/bin/bash
for host in heimdall waldorf watchtower; do
echo "=== $host ==="
ssh $host "~/scripts/validate-node.sh --critical-only"
done
EOF
chmod +x validate-all.sh
./validate-all.sh
```
---
## 🧪 Testing & Development
### Test Bootstrap in Dry-Run Mode
```bash
# Safe to run on production nodes
./bootstrap.sh --dry-run --hardware-type docker-vm
```
### Manually Test Libraries
```bash
# Source library and test functions
source lib/detection.sh
print_detection_summary
source lib/validation.sh
run_validation_suite
```
### Validate Script Syntax
```bash
# Check for syntax errors
bash -n bootstrap.sh
shellcheck bootstrap.sh lib/*.sh
```
---
## 📖 Documentation
**Primary References:**
- [SOP-002: Initial Infrastructure Deployment](../documentation/SOPs/SOP-002-Initial-Infrastructure-Deployment.md)
- [Technical Runbook](../documentation/TECHNICAL_RUNBOOK.md)
- [Environment Constraints](../ansible/archive/documentation/standards/environment-constraints.md)
**Related Ansible Playbooks:**
- [generic_host.yml](../ansible/archive/playbooks/onboarding/generic_host.yml) — Second-stage onboarding
- [proxmox_host.yml](../ansible/archive/playbooks/onboarding/proxmox_host.yml) — Proxmox-specific config
- [gather_hardware_facts.yml](../ansible/archive/playbooks/preflight/gather_hardware_facts.yml) — Ansible fact collection
---
## 🐛 Troubleshooting
**Bootstrap fails with "could not detect primary interface":**
- Check network cable/Wi-Fi connection
- Run manually: `ip -o link show` to see interfaces
- Override: `INTERFACE=eth0 ./bootstrap.sh`
**SSH disconnects and doesn't reconnect:**
- Wait 30 seconds and retry
- Check console access if available
- Verify IP in router DHCP leases
**Validation reports critical errors:**
- Review log: `cat ../ansible/archive/outputs/bootstrap-validation-*.log`
- Address critical issues (disk space, memory, etc.)
- Re-run validation: `./validate-node.sh`
**Docker installation fails on Debian Trixie:**
- Script automatically uses Bookworm repos as fallback
- Check logs: `journalctl -u docker`
**Hardware fingerprinting shows "unknown" values:**
- Some VMs may not expose full hardware info
- Run `sudo dmidecode --type system` for manual inspection
---
## 🔮 Future Enhancements
**Planned (Not Yet Implemented):**
- [ ] PXE boot + cloud-init provisioning (eliminate manual OS install)
- [ ] Active VLAN migration logic (10.0.0.x → 10.0.10.x/10.0.200.x)
- [ ] Proxmox VM auto-provisioning via API
- [ ] Node retirement/decommissioning automation
- [ ] Integration with Prometheus node exporter metrics
- [ ] Containerized control node for disaster recovery
**VLAN Placeholder:**
The `get_desired_vlan_ip()` function contains placeholders for future VLAN segmentation. Current implementation uses flat 10.0.0.x network. See comments in `lib/network.sh` for migration plan.
---
## 📝 Version History
- **v4.0.0** (2026-04-12): Initial unified bootstrap release
- Consolidated 3 legacy scripts into single entry point
- Added modular library architecture
- Comprehensive validation and fingerprinting
- Auto-discovery with Ansible inventory generation
---
**Questions?** See [documentation/TECHNICAL_RUNBOOK.md](../documentation/TECHNICAL_RUNBOOK.md) or review session memory in `/memories/session/plan.md`.
# scripts
Automation utilities and helper scripts for homelab infrastructure management.
---
## Inventory
| Script | Purpose | Status |
|--------|---------|--------|
| [onboarding.sh](onboarding.sh) | Bootstrap Ansible control node for Proxmox management | 🟡 **DRAFT** - Testing Required |
---
## onboarding.sh
**Purpose:** Automated setup of Ansible control node for Proxmox infrastructure management.
**What it does:**
1. Installs Ansible and Proxmoxer Python library
2. Detects or generates SSH keypair (ED25519 preferred, RSA fallback)
3. Copies public key to Proxmox server for passwordless authentication
4. Generates Ansible inventory file (`hosts.ini`) with Proxmox connection details
**Prerequisites:**
- Debian/Ubuntu-based system (uses `apt`)
- Network access to Proxmox server
- Initial SSH password for target Proxmox server
**Configuration:**
Edit the following variables at the top of the script:
```bash
PROXMOX_IP="192.168.1.100" # Target Proxmox server IP
PROXMOX_USER="root" # Proxmox SSH user
```
**Usage:**
```bash
cd ~/dev/homelab/scripts
chmod +x onboarding.sh
./onboarding.sh
```
**Verification:**
```bash
ansible proxmox_nodes -m ping -i hosts.ini
```
---
## ⚠️ Development Status
| Script | Testing Status | Known Issues |
|--------|---------------|--------------|
| onboarding.sh | ❌ Untested in production | • Hardcoded Proxmox IP/user variables<br>• No error handling for failed SSH key copy<br>• Assumes Debian/Ubuntu package manager<br>• No validation of Proxmox connectivity |
**DO NOT USE IN PRODUCTION** until the following are addressed:
1. **Error Handling:** Add validation checks for each step
2. **Idempotency:** Verify script can be safely re-run
3. **Multi-OS Support:** Test on RHEL/Arch variants or add OS detection
4. **Interactive Mode:** Prompt for PROXMOX_IP/USER instead of manual editing
5. **Rollback:** Add cleanup mechanism for failed installations
---
## Contributing
When adding new scripts:
1. Update the **Inventory** table with script name and purpose
2. Document prerequisites, configuration, and usage
3. Mark status as 🟡 DRAFT until production-tested
4. Add to **Development Status** table with known issues