diff --git a/documentation/TECHNICAL_RUNBOOK.md b/documentation/TECHNICAL_RUNBOOK.md new file mode 100644 index 0000000..5cb87dd --- /dev/null +++ b/documentation/TECHNICAL_RUNBOOK.md @@ -0,0 +1,271 @@ +# Technical Runbook: Castaldi Family Lab + +**Status:** ACTIVE & OPERATIONAL +**Last Updated:** April 11, 2026 +**Maintainer:** Nathan Castaldi + +--- + +## Table of Contents + +1. [Infrastructure Overview](#infrastructure-overview) +2. [Critical Fixes](#critical-fixes) +3. [Lessons Learned](#lessons-learned) +4. [Network Map](#network-map) +5. [Active Tasks](#active-tasks) +6. [Emergency Procedures](#emergency-procedures) + +--- + +## Infrastructure Overview + +### Node Inventory + +| Node | IP Address | Hardware | Services | +|------|------------|----------|----------| +| **Heimdall** | 10.0.0.151 | Proxmox VM | Komodo Core, Gitea, Traefik | +| **Waldorf** | 10.0.0.XXX | NVIDIA GTX 1060 | Plex, Tunarr | +| **Watchtower** | 10.0.0.200 | Raspberry Pi | Komodo Periphery | +| **TerraMaster** | 10.0.0.250 | NAS | NFS Storage (`/Volume1/appdata`) | + +### Repository Structure + +```text +/nodes + /heimdall + /core # Komodo Core + /gitea # Git Repository Server + /waldorf + /plex # Media Server (NVIDIA Optimized) + /tunarr # Channel Management (GPU Passthrough) + /watchtower + # Komodo Periphery +``` + +--- + +## Critical Fixes + +> ⚠️ **DO NOT REVERT THESE CONFIGURATIONS** + +### 1. NFS Mount: Watchtower (Raspberry Pi) + +**Problem:** Permission Denied on `/mnt/appdata` despite matching UIDs. + +**Root Cause:** NFSv4 ID-domain mismatch between Pi and TerraMaster NAS. + +**Solution:** + +```bash +# /etc/fstab entry (Force NFSv3) +10.0.0.250:/Volume1/appdata /mnt/appdata nfs rw,nfsvers=3,hard,intr,x-systemd.automount,nolock 0 0 +``` + +**Mount Point Ownership:** + +```bash +# Set ownership WHILE UNMOUNTED +sudo chown chester:chester /mnt/appdata +``` + +--- + +### 2. Komodo Periphery Connectivity + +**Problem:** Hairpin NAT prevents `*.castaldifamily.com` access from internal nodes. + +**Solution:** + +- **Core URL (Internal):** `ws://10.0.0.151:9120` +- **Key Paths:** `/config/keys/periphery.pub` +- **Environment Variable:** `file:/config/keys/periphery.pub` + +--- + +### 3. Gitea & GitOps + +**Problem:** SSH Key Exchange (Kex) errors on Windows (`diffie-hellman-group1-sha1`). + +**Solution:** + +```bash +# Use HTTPS instead of SSH +git clone https://git.castaldifamily.com/nathan/homelab.git + +# Windows Credential Storage +git config --global credential.helper wincred + +# Cross-Platform Line Endings +git config --global core.autocrlf true + +# Network Share Permissions +git config --global safe.directory "*" +``` + +--- + +### 4. GPU Passthrough (Plex/Waldorf) + +**Problem:** Plex sees GPU but doesn't use it for hardware transcoding. + +**Solution:** + +```yaml +# compose.yaml +services: + plex: + runtime: nvidia + deploy: + resources: + reservations: + devices: + - driver: nvidia + count: all + capabilities: [gpu] +``` + +**Verification:** + +- Monitor Plex Dashboard for `(hw)` status during transcoding. + +--- + +## Lessons Learned + +### "NFSv4 is too smart" + +Modern NFS (v4) tries to sync user identities across a "Domain." If the Pi and NAS don't agree on the domain name, it defaults to `nobody`. + +**Fix:** Force NFSv3—it only checks UID numbers (1000). + +--- + +### "Naked Mount Point" + +If the local folder (`/mnt/appdata`) is owned by `root`, you can't "pass through" to see NAS data once it mounts. + +**Fix:** `chown` the mount point to the user **while unmounted**. + +--- + +### "Hairpin NAT" + +Many routers won't let internal traffic go out to a public IP and then back in (Hairpinning). + +**Fix:** Use **Internal IPs** (`10.0.0.X`) for node-to-node communication. + +--- + +### "GPU Passthrough" + +Docker isolation is strict. Simply having drivers on the host isn't enough. + +**Fix:** Use `deploy: resources: reservations` block in Compose to "hand the keys" of the hardware to the container. + +--- + +## Network Map + +| Service | Protocol | Internal Address | External URL | +|---------|----------|------------------|--------------| +| **Komodo Core** | HTTP | `10.0.0.151:9000` | `komodo.castaldifamily.com` | +| **Gitea** | HTTPS | `10.0.0.151:3000` | `git.castaldifamily.com` | +| **Plex** | Host Network | `10.0.0.XXX:32400` | `plex.castaldifamily.com` | +| **Tunarr** | HTTP | `10.0.0.XXX:8000` | `tunarr.castaldifamily.com` | + +--- + +## Active Tasks + +### Current Focus + +1. **Git-ify Stacks** + - ✅ `plex` and `tunarr` pushed to Gitea + - ⏳ Convert remaining "Manual" stacks to "Git" sources in Komodo + +2. **Webhooks** + - ⏳ Ensure Gitea Webhooks fire to Komodo Stack URLs for auto-deployment + +3. **Hardware Transcoding** + - ⏳ Monitor Waldorf for `(hw)` status in Plex + +--- + +## Emergency Procedures + +### 🔥 NFS Mount Failure (Watchtower) + +```bash +# Check NFS Server +ping 10.0.0.250 + +# Remount NFS Share +sudo umount /mnt/appdata +sudo mount -a + +# Verify Mount +df -h | grep appdata +``` + +--- + +### 🔥 Komodo Periphery Offline + +```bash +# Check Core Connectivity +curl -v ws://10.0.0.151:9120 + +# Restart Periphery Container +docker restart komodo-periphery +docker logs -f komodo-periphery +``` + +--- + +### 🔥 Plex Not Using GPU + +```bash +# Verify NVIDIA Runtime +docker info | grep -i nvidia + +# Check GPU Access in Container +docker exec -it plex nvidia-smi +``` + +--- + +### 🔥 Git Authentication Failure + +```bash +# Regenerate Gitea App Token +# Settings > Applications > Generate New Token + +# Update Credential Helper +git config --global credential.helper wincred + +# Test Clone +git clone https://git.castaldifamily.com/nathan/homelab.git +``` + +--- + +## Credential Management + +- ❌ **DO NOT** store passwords in `compose.yaml` in Git repo +- ✅ **DO** use Komodo Stack "Environment Variables" to inject secrets +- ✅ **DO** use Gitea **App Tokens** for Git authentication (iPad/Windows) + +--- + +## Maintenance Schedule + +| Task | Frequency | Notes | +|------|-----------|-------| +| Update Docker Images | Weekly | Via Komodo or Watchtower | +| Backup Gitea | Weekly | `/data/gitea` directory | +| Backup Plex Metadata | Monthly | `/config/Library` directory | +| Check NFS Mount Health | Monthly | `df -h`, verify permissions | + +--- + +**End of Runbook**