homelab/documentation/SOPs/SOP-002-Initial-Infrastructure-Deployment.md
nathan e16f98a183 feat(bootstrap)!: introduce unified bootstrap system with modular libraries
BREAKING CHANGE: day0bootstrap.sh deprecated in favor of bootstrap.sh

- Add scripts/bootstrap.sh (488 lines): Unified entrypoint supporting multiple hardware types (Proxmox/Docker VMs/Pi)
- Create scripts/lib/ modular library system:
  - detection.sh: OS/hardware/container detection (362 lines)
  - fingerprint.sh: System fingerprinting and inventory (494 lines)
  - network.sh: IP configuration and VLAN placement (356 lines)
  - proxmox.sh: PVE post-install automation (453 lines)
  - validation.sh: Comprehensive pre-flight checks (510 lines)
- Add validation tools: validate-node.sh, onboarding.sh, pi_init.sh
- Deprecate scripts/day0bootstrap.sh with graceful redirect wrapper
- Document architecture in scripts/README.md (495 lines) and PROXMOX-COMPARISON.md
- Update SOP-002 with new bootstrap workflow
- Add nodes/watchtower/compose.yaml (Raspberry Pi 5 stack)

Migration: Existing day0bootstrap.sh users automatically redirected to new system after 5-second warning. No manual intervention required.

Ref: Infrastructure automation modernization per active-tasks.md
2026-04-12 22:48:19 -04:00

631 lines
16 KiB
Markdown

# SOP-002: Initial Infrastructure Deployment
**Status:** Active
**Created:** April 12, 2026
**Last Updated:** April 12, 2026
**Owner:** Nathan Castaldi
**Applies To:** Fresh homelab deployments and disaster recovery scenarios
---
## Purpose
Deploy the complete homelab infrastructure from a clean state using GitOps principles and automation. This SOP covers:
- Secure repository setup with encrypted secrets
- Ansible control node configuration
- Core service deployment (Komodo, Traefik, Gitea, Redis)
- Validation and health checks
**Use Cases:**
- New homelab initialization
- Disaster recovery (full infrastructure rebuild)
- Node replacement or migration
---
## Prerequisites
### Required Access
- [ ] Physical or console access to all nodes (Heimdall, Waldorf, Watchtower)
- [ ] GitHub account with access to `homelab` repository
- [ ] Gitea credentials (if repository already hosted locally)
- [ ] Root/sudo privileges on all nodes
### Required Infrastructure
- [ ] Nodes have base OS installed (Debian/Ubuntu recommended)
- [ ] Network connectivity between all nodes
- [ ] NFS storage accessible at `10.0.0.250:/Volume1/appdata`
- [ ] DNS/hosts file configured for node resolution
- [ ] Internet access for package installation
### Security Requirements
- [ ] Git-crypt symmetric key (if repository already encrypted)
- [ ] Password manager for storing credentials
- [ ] Secure workstation for handling keys and secrets
---
## Security & Pre-Deployment Setup
### Step 1: Prepare Your Workstation
**Time:** 15-20 minutes
1. **Install Required Tools:**
**Linux/MacOS:**
```bash
# Install git-crypt
brew install git-crypt # MacOS
# OR
sudo apt install git-crypt # Debian/Ubuntu
# Verify installation
git-crypt --version
```
**Windows (Git Bash/WSL):**
```bash
# Download git-crypt binary
curl -L https://github.com/AGWA/git-crypt/releases/download/0.7.0/git-crypt-0.7.0-x86_64.exe -o /usr/local/bin/git-crypt
chmod +x /usr/local/bin/git-crypt
```
2. **Configure Git Identity:**
```bash
git config --global user.name "Your Name"
git config --global user.email "your.email@domain.com"
git config --global core.autocrlf true # Windows only
```
---
### Step 2: Clone Repository & Initialize Secrets
**Time:** 10-15 minutes
1. **Clone from Source:**
**Option A: GitHub (Initial Clone):**
```bash
cd ~/dev # Or your preferred code directory
git clone https://github.com/your-username/homelab.git
cd homelab
```
**Option B: Gitea (Production Environment):**
```bash
cd ~/dev
git clone https://git.castaldifamily.com/nathan/homelab.git
cd homelab
```
2. **Unlock Encrypted Secrets (If Repository Already Uses Git-crypt):**
```bash
# Import the symmetric key (retrieve from password manager)
git-crypt unlock /path/to/homelab-secrets.key
# Verify decryption
ls -lh nodes/heimdall/core/.env.secrets
# File should be readable plaintext, not binary
```
**⚠️ Security Warning:** Store `homelab-secrets.key` in:
- Password manager (1Password, Bitwarden, etc.)
- Encrypted backup drive
- **NEVER** commit it to the repository
3. **Initialize Git-crypt (First-Time Setup Only):**
```bash
# If repository is NOT yet encrypted
git-crypt init
git-crypt export-key ~/homelab-secrets.key
# Secure the key immediately
chmod 600 ~/homelab-secrets.key
```
---
## Ansible Control Node Setup
### Step 3: Bootstrap Watchtower as Control Node
**Time:** 15-20 minutes (reduced from 25-35 via automation)
**Rationale:** Watchtower (Raspberry Pi 5) serves as the Ansible control node to manage all infrastructure, including itself.
**New Method:** Use the unified bootstrap script for automated, idempotent configuration.
1. **Transfer Bootstrap Script to Watchtower:**
**Option A: From local repository (if cloned on workstation):**
```bash
# From your workstation
scp -r homelab/scripts chester@10.0.0.200:~/
```
**Option B: Direct clone on Watchtower:**
```bash
# SSH to Watchtower
ssh chester@10.0.0.200
# Minimal clone (scripts only)
git clone --depth=1 https://git.castaldifamily.com/nathan/homelab.git
cd homelab/scripts
```
2. **Run Unified Bootstrap Script:**
```bash
# Auto-detect and configure (Raspberry Pi will be detected)
./bootstrap.sh
# The script will:
# - Detect Raspberry Pi hardware
# - Configure static IP (10.0.0.200)
# - Install Docker with Debian Trixie compatibility
# - Install Ansible and proxmoxer
# - Generate ED25519 SSH keys
# - Run comprehensive validation
# - Generate hardware fingerprint
```
**⚠️ Important:** SSH connection will drop during network reconfiguration.
Reconnect after ~10 seconds:
```bash
ssh chester@10.0.0.200
```
3. **Verify Bootstrap Success:**
```bash
# After reconnecting
cd homelab/scripts
# Check validation report
cat ../ansible/archive/outputs/bootstrap-validation-watchtower-*.log
# Verify installations
docker --version # Should show Docker 24.x or newer
ansible --version # Should show ansible [core 2.x.x]
# Check SSH key
ls -lh ~/.ssh/id_ed25519.pub
cat ~/.ssh/id_ed25519.pub # Copy this for distribution
```
4. **Distribute SSH Keys to Managed Nodes:**
```bash
# The bootstrap script generated keys, now distribute them
# Deploy to Heimdall
ssh-copy-id -i ~/.ssh/id_ed25519.pub chester@10.0.0.151
# Deploy to Waldorf
ssh-copy-id -i ~/.ssh/id_ed25519.pub chester@10.0.0.251
# Deploy to localhost (self-management)
ssh-copy-id -i ~/.ssh/id_ed25519.pub chester@localhost
```
5. **Validate Passwordless Authentication:**
```bash
# Test each node
ssh -i ~/.ssh/id_ed25519 chester@10.0.0.151 "hostname"
# Expected: heimdall
ssh -i ~/.ssh/id_ed25519 chester@10.0.0.251 "hostname"
# Expected: waldorf
ssh -i ~/.ssh/id_ed25519 chester@localhost "hostname"
# Expected: watchtower
```
6. **Clone Full Repository (If Not Already Present):**
```bash
cd ~
# If you only did shallow clone earlier, get full repo
rm -rf homelab # Remove shallow clone
git clone https://git.castaldifamily.com/nathan/homelab.git
cd homelab
# Unlock secrets (if using git-crypt)
# Transfer key securely via scp from workstation
git-crypt unlock ~/homelab-secrets.key
```
**Troubleshooting:**
- **Bootstrap fails:** Run with `--dry-run` first to preview actions: `./bootstrap.sh --dry-run`
- **Network doesn't reconnect:** Wait 30 seconds and retry SSH
- **Validation errors:** Review the validation log, address critical errors before proceeding
- **Manual intervention needed:** Use `./validate-node.sh` to re-check after fixes
---
## Core Infrastructure Deployment
### Step 4: Bootstrap and Deploy Core Stack on Heimdall
**Time:** 15-25 minutes (reduced from 20-30 via automation)
**Core Stack Components:**
- Docker Socket Proxy (security boundary)
- Traefik (reverse proxy with automatic SSL)
- Redis (caching layer)
- Komodo Core (container orchestration)
1. **Bootstrap Heimdall Node:**
**Option A: Remote bootstrap from Watchtower (recommended):**
```bash
# From Watchtower control node
cd ~/homelab
# Copy bootstrap script to Heimdall
scp -r scripts chester@10.0.0.151:~/
# SSH and run bootstrap
ssh chester@10.0.0.151 "cd scripts && ./bootstrap.sh --hardware-type docker-vm"
```
**Option B: Direct console access:**
```bash
# Login to Heimdall directly
ssh chester@10.0.0.151
# Clone repo or copy scripts
git clone --depth=1 https://git.castaldifamily.com/nathan/homelab.git
cd homelab/scripts
# Run bootstrap
./bootstrap.sh --hardware-type docker-vm --target-ip 10.0.0.151
```
2. **Verify Docker Installation:**
```bash
# After bootstrap completes
ssh chester@10.0.0.151
docker --version
docker compose version
docker ps # Should return empty list (no containers yet)
```
3. **Create Komodo Directory Structure:**
```bash
sudo mkdir -p /etc/komodo/{stacks,repos,volumes}
sudo chown -R $USER:$USER /etc/komodo
```
4. **Mount NFS Storage (If Required):**
```bash
# Install NFS client
sudo apt install -y nfs-common
# Create mount point
sudo mkdir -p /mnt/nas
# Add to /etc/fstab (persistent mount)
echo "10.0.0.250:/Volume1/appdata /mnt/nas nfs defaults,nfsvers=3 0 0" | sudo tee -a /etc/fstab
# Mount immediately
sudo mount -a
# Verify mount
df -h | grep nas
```
5. **Clone Repository to Heimdall:**
```bash
cd ~
git clone https://git.castaldifamily.com/nathan/homelab.git
cd homelab
# Unlock secrets if repository uses git-crypt
git-crypt unlock ~/homelab-secrets.key
```
6. **Deploy Core Stack:**
```bash
cd ~/homelab/nodes/heimdall/core
# Review configuration
cat compose.yaml
cat .env.secrets # Verify secrets are decrypted
# Pull images
docker compose pull
# Start services in detached mode
docker compose up -d
# Monitor logs
docker compose logs -f
# Press Ctrl+C to exit log streaming
```
7. **Verify Core Services:**
```bash
# Check running containers
docker ps
# Expected containers:
# - dockerproxy
# - traefik
# - redis
# - komodo-core
# Check health
docker compose ps
# All services should show "running" status
```
---
## Validation & Health Checks
### Step 5: Service Verification
**Time:** 15-20 minutes
1. **Test Internal Connectivity:**
```bash
# From Heimdall
# Test Komodo Core
curl -I http://localhost:9000
# Expected: HTTP/1.1 200 OK
# Test Redis
docker exec -it redis redis-cli ping
# Expected: PONG
# Test Docker Socket Proxy
curl http://localhost:2375/version
# Expected: JSON response with Docker version
```
2. **Test External Access (From Workstation):**
```bash
# Test Traefik dashboard (if exposed)
curl -I https://traefik.castaldifamily.com
# Test Komodo Core UI
curl -I https://komodo.castaldifamily.com
# Expected: HTTP/2 200
```
3. **Verify Traefik SSL Certificates:**
```bash
# SSH to Heimdall
ssh chester@10.0.0.151
# Check Traefik logs for ACME certificate retrieval
docker logs traefik 2>&1 | grep -i "certificate"
# Verify cert storage
ls -lh /etc/komodo/volumes/traefik/acme.json
```
4. **Komodo Core Initial Configuration:**
- Navigate to `https://komodo.castaldifamily.com` in browser
- Complete first-time setup wizard
- Create admin account
- Add server nodes (Heimdall, Waldorf, Watchtower)
---
## Post-Deployment Configuration
### Step 6: Configure GitOps Integration
**Time:** 20-25 minutes
1. **Install Komodo Periphery on Remote Nodes:**
**On Waldorf (10.0.0.251):**
```bash
ssh chester@10.0.0.251
# Install Docker
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
sudo usermod -aG docker $USER
# Create Komodo directory
sudo mkdir -p /etc/komodo/{stacks,repos}
sudo chown -R $USER:$USER /etc/komodo
# Deploy Periphery (via Komodo UI or manually)
# See Komodo documentation for Periphery setup
```
**On Watchtower (10.0.0.200):**
```bash
# Repeat same process as Waldorf
```
2. **Configure Repository Cloning in Komodo:**
In Komodo UI:
- Navigate to **Settings** → **Git Providers**
- Add Gitea provider:
- **URL:** `https://git.castaldifamily.com`
- **Token:** Generate from Gitea Settings → Applications
- Test connection
3. **Create Git-Linked Stacks:**
For each service (Plex, Tunarr, etc.):
- Navigate to **Stacks** → **New Stack**
- Select **Git Repository** as source
- Configure:
- **Repo:** `nathan/homelab`
- **Branch:** `main`
- **Path:** `nodes/{node-name}/{service-name}`
- **Compose File:** `compose.yaml`
- Enable **Auto-Deploy on Push**
4. **Configure Gitea Webhooks:**
In Gitea repository settings:
- Navigate to **Settings** → **Webhooks**
- Add webhook:
- **URL:** `https://komodo.castaldifamily.com/api/webhook/pull-stack/{stack-id}`
- **Secret:** From Komodo stack configuration
- **Events:** Push events only
- **Active:** Enabled
---
## Troubleshooting
### Common Issues
**Issue:** `git-crypt unlock` fails with "File is not encrypted"
**Resolution:**
- Verify you're in the correct repository directory
- Check if repository is actually using git-crypt: `git-crypt status`
- Ensure `.gitattributes` file exists and defines encryption rules
---
**Issue:** SSH key authentication fails to nodes
**Resolution:**
```bash
# Verify key permissions
ls -lh ~/.ssh/id_ed25519
# Should be: -rw------- (600)
# Test manual SSH with verbose logging
ssh -vvv -i ~/.ssh/id_ed25519 chester@10.0.0.151
# Check authorized_keys on target node
ssh chester@10.0.0.151 "cat ~/.ssh/authorized_keys"
```
---
**Issue:** Docker Compose fails with "network not found"
**Resolution:**
```bash
# Recreate default Docker networks
docker network prune -f
docker compose up -d --force-recreate
```
---
**Issue:** NFS mount fails with "Operation not permitted"
**Resolution:**
```bash
# Check NFS server exports
showmount -e 10.0.0.250
# Force NFSv3 (avoid ID mapping issues)
sudo mount -t nfs -o nfsvers=3 10.0.0.250:/Volume1/appdata /mnt/nas
# Update fstab with explicit version
# 10.0.0.250:/Volume1/appdata /mnt/nas nfs defaults,nfsvers=3 0 0
```
---
## Emergency Rollback
### Complete Stack Teardown
If deployment fails and rollback is required:
```bash
# On Heimdall
cd ~/homelab/nodes/heimdall/core
docker compose down -v # -v removes volumes (DESTRUCTIVE)
# Preserve data (omit -v flag)
docker compose down
# Remove repository clone
cd ~
rm -rf homelab
```
### Restore Previous State
```bash
# Re-clone repository at specific commit
git clone https://git.castaldifamily.com/nathan/homelab.git
cd homelab
git checkout {commit-hash} # Hash before failed deployment
# Unlock secrets and redeploy
git-crypt unlock ~/homelab-secrets.key
cd nodes/heimdall/core
docker compose up -d
```
---
## Success Criteria
Deployment is **complete** when:
- [ ] All core services running on Heimdall (Komodo, Traefik, Redis, Docker Proxy)
- [ ] Komodo Periphery agents connected on Waldorf and Watchtower
- [ ] Traefik SSL certificates issued and valid
- [ ] Komodo UI accessible at `https://komodo.castaldifamily.com`
- [ ] Git-linked stacks successfully pull from Gitea
- [ ] Webhooks trigger automatic deployments on push
- [ ] NFS mounts stable across all nodes
- [ ] Ansible control node (Watchtower) can execute playbooks against all nodes
---
## Next Steps
After successful deployment:
1. **Deploy Application Stacks:**
- Use [SOP-001: Migrate Stack from UI to Git](SOP-001-Migrate-Stack-from-UI-to-Git.md) for each service
- Prioritize critical services: Plex, Gitea, Tunarr
2. **Configure Backups:**
- Implement automated Gitea repository backups
- Schedule NFS snapshot retention policy
- Export Komodo configuration regularly
3. **Security Hardening:**
- Enable Traefik authentication for internal services
- Configure fail2ban for SSH protection
- Implement network segmentation (VLANs)
4. **Monitoring & Observability:**
- Deploy Prometheus/Grafana stack
- Configure health check endpoints
- Set up uptime monitoring (Uptime Kuma)
---
## Related Documentation
- [SOP-001: Migrate Stack from UI to Git](SOP-001-Migrate-Stack-from-UI-to-Git.md) - Convert existing services to GitOps
- [KBA-001: Komodo GitOps Deployment Failures](../KBAs/KBA-001-Komodo-GitOps-Stack-Deployment-Failures.md) - Troubleshooting guide
- [plan-ansibleSetup.md](../plans/plan-ansibleSetup.md) - Detailed Ansible control node configuration
- [plan-gitcryptMigration.md](../plans/plan-gitcryptMigration.md) - Comprehensive git-crypt setup guide
- [TECHNICAL_RUNBOOK.md](../TECHNICAL_RUNBOOK.md) - Emergency procedures and reference
---
## Revision History
| Date | Version | Change Description |
|------|---------|-------------------|
| 2026-04-12 | 1.0 | Initial SOP creation |