Add technical runbook & handover documentation

This commit is contained in:
nathan 2026-04-11 22:01:18 -04:00
parent 73132766f2
commit 1311e97dc9

View File

@ -0,0 +1,271 @@
# Technical Runbook: Castaldi Family Lab
**Status:** ACTIVE & OPERATIONAL
**Last Updated:** April 11, 2026
**Maintainer:** Nathan Castaldi
---
## Table of Contents
1. [Infrastructure Overview](#infrastructure-overview)
2. [Critical Fixes](#critical-fixes)
3. [Lessons Learned](#lessons-learned)
4. [Network Map](#network-map)
5. [Active Tasks](#active-tasks)
6. [Emergency Procedures](#emergency-procedures)
---
## Infrastructure Overview
### Node Inventory
| Node | IP Address | Hardware | Services |
|------|------------|----------|----------|
| **Heimdall** | 10.0.0.151 | Proxmox VM | Komodo Core, Gitea, Traefik |
| **Waldorf** | 10.0.0.XXX | NVIDIA GTX 1060 | Plex, Tunarr |
| **Watchtower** | 10.0.0.200 | Raspberry Pi | Komodo Periphery |
| **TerraMaster** | 10.0.0.250 | NAS | NFS Storage (`/Volume1/appdata`) |
### Repository Structure
```text
/nodes
/heimdall
/core # Komodo Core
/gitea # Git Repository Server
/waldorf
/plex # Media Server (NVIDIA Optimized)
/tunarr # Channel Management (GPU Passthrough)
/watchtower
# Komodo Periphery
```
---
## Critical Fixes
> ⚠️ **DO NOT REVERT THESE CONFIGURATIONS**
### 1. NFS Mount: Watchtower (Raspberry Pi)
**Problem:** Permission Denied on `/mnt/appdata` despite matching UIDs.
**Root Cause:** NFSv4 ID-domain mismatch between Pi and TerraMaster NAS.
**Solution:**
```bash
# /etc/fstab entry (Force NFSv3)
10.0.0.250:/Volume1/appdata /mnt/appdata nfs rw,nfsvers=3,hard,intr,x-systemd.automount,nolock 0 0
```
**Mount Point Ownership:**
```bash
# Set ownership WHILE UNMOUNTED
sudo chown chester:chester /mnt/appdata
```
---
### 2. Komodo Periphery Connectivity
**Problem:** Hairpin NAT prevents `*.castaldifamily.com` access from internal nodes.
**Solution:**
- **Core URL (Internal):** `ws://10.0.0.151:9120`
- **Key Paths:** `/config/keys/periphery.pub`
- **Environment Variable:** `file:/config/keys/periphery.pub`
---
### 3. Gitea & GitOps
**Problem:** SSH Key Exchange (Kex) errors on Windows (`diffie-hellman-group1-sha1`).
**Solution:**
```bash
# Use HTTPS instead of SSH
git clone https://git.castaldifamily.com/nathan/homelab.git
# Windows Credential Storage
git config --global credential.helper wincred
# Cross-Platform Line Endings
git config --global core.autocrlf true
# Network Share Permissions
git config --global safe.directory "*"
```
---
### 4. GPU Passthrough (Plex/Waldorf)
**Problem:** Plex sees GPU but doesn't use it for hardware transcoding.
**Solution:**
```yaml
# compose.yaml
services:
plex:
runtime: nvidia
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
```
**Verification:**
- Monitor Plex Dashboard for `(hw)` status during transcoding.
---
## Lessons Learned
### "NFSv4 is too smart"
Modern NFS (v4) tries to sync user identities across a "Domain." If the Pi and NAS don't agree on the domain name, it defaults to `nobody`.
**Fix:** Force NFSv3—it only checks UID numbers (1000).
---
### "Naked Mount Point"
If the local folder (`/mnt/appdata`) is owned by `root`, you can't "pass through" to see NAS data once it mounts.
**Fix:** `chown` the mount point to the user **while unmounted**.
---
### "Hairpin NAT"
Many routers won't let internal traffic go out to a public IP and then back in (Hairpinning).
**Fix:** Use **Internal IPs** (`10.0.0.X`) for node-to-node communication.
---
### "GPU Passthrough"
Docker isolation is strict. Simply having drivers on the host isn't enough.
**Fix:** Use `deploy: resources: reservations` block in Compose to "hand the keys" of the hardware to the container.
---
## Network Map
| Service | Protocol | Internal Address | External URL |
|---------|----------|------------------|--------------|
| **Komodo Core** | HTTP | `10.0.0.151:9000` | `komodo.castaldifamily.com` |
| **Gitea** | HTTPS | `10.0.0.151:3000` | `git.castaldifamily.com` |
| **Plex** | Host Network | `10.0.0.XXX:32400` | `plex.castaldifamily.com` |
| **Tunarr** | HTTP | `10.0.0.XXX:8000` | `tunarr.castaldifamily.com` |
---
## Active Tasks
### Current Focus
1. **Git-ify Stacks**
- ✅ `plex` and `tunarr` pushed to Gitea
- ⏳ Convert remaining "Manual" stacks to "Git" sources in Komodo
2. **Webhooks**
- ⏳ Ensure Gitea Webhooks fire to Komodo Stack URLs for auto-deployment
3. **Hardware Transcoding**
- ⏳ Monitor Waldorf for `(hw)` status in Plex
---
## Emergency Procedures
### 🔥 NFS Mount Failure (Watchtower)
```bash
# Check NFS Server
ping 10.0.0.250
# Remount NFS Share
sudo umount /mnt/appdata
sudo mount -a
# Verify Mount
df -h | grep appdata
```
---
### 🔥 Komodo Periphery Offline
```bash
# Check Core Connectivity
curl -v ws://10.0.0.151:9120
# Restart Periphery Container
docker restart komodo-periphery
docker logs -f komodo-periphery
```
---
### 🔥 Plex Not Using GPU
```bash
# Verify NVIDIA Runtime
docker info | grep -i nvidia
# Check GPU Access in Container
docker exec -it plex nvidia-smi
```
---
### 🔥 Git Authentication Failure
```bash
# Regenerate Gitea App Token
# Settings > Applications > Generate New Token
# Update Credential Helper
git config --global credential.helper wincred
# Test Clone
git clone https://git.castaldifamily.com/nathan/homelab.git
```
---
## Credential Management
- ❌ **DO NOT** store passwords in `compose.yaml` in Git repo
- ✅ **DO** use Komodo Stack "Environment Variables" to inject secrets
- ✅ **DO** use Gitea **App Tokens** for Git authentication (iPad/Windows)
---
## Maintenance Schedule
| Task | Frequency | Notes |
|------|-----------|-------|
| Update Docker Images | Weekly | Via Komodo or Watchtower |
| Backup Gitea | Weekly | `/data/gitea` directory |
| Backup Plex Metadata | Monthly | `/config/Library` directory |
| Check NFS Mount Health | Monthly | `df -h`, verify permissions |
---
**End of Runbook**