Add technical runbook & handover documentation
This commit is contained in:
parent
73132766f2
commit
1311e97dc9
271
documentation/TECHNICAL_RUNBOOK.md
Normal file
271
documentation/TECHNICAL_RUNBOOK.md
Normal file
@ -0,0 +1,271 @@
|
||||
# Technical Runbook: Castaldi Family Lab
|
||||
|
||||
**Status:** ACTIVE & OPERATIONAL
|
||||
**Last Updated:** April 11, 2026
|
||||
**Maintainer:** Nathan Castaldi
|
||||
|
||||
---
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Infrastructure Overview](#infrastructure-overview)
|
||||
2. [Critical Fixes](#critical-fixes)
|
||||
3. [Lessons Learned](#lessons-learned)
|
||||
4. [Network Map](#network-map)
|
||||
5. [Active Tasks](#active-tasks)
|
||||
6. [Emergency Procedures](#emergency-procedures)
|
||||
|
||||
---
|
||||
|
||||
## Infrastructure Overview
|
||||
|
||||
### Node Inventory
|
||||
|
||||
| Node | IP Address | Hardware | Services |
|
||||
|------|------------|----------|----------|
|
||||
| **Heimdall** | 10.0.0.151 | Proxmox VM | Komodo Core, Gitea, Traefik |
|
||||
| **Waldorf** | 10.0.0.XXX | NVIDIA GTX 1060 | Plex, Tunarr |
|
||||
| **Watchtower** | 10.0.0.200 | Raspberry Pi | Komodo Periphery |
|
||||
| **TerraMaster** | 10.0.0.250 | NAS | NFS Storage (`/Volume1/appdata`) |
|
||||
|
||||
### Repository Structure
|
||||
|
||||
```text
|
||||
/nodes
|
||||
/heimdall
|
||||
/core # Komodo Core
|
||||
/gitea # Git Repository Server
|
||||
/waldorf
|
||||
/plex # Media Server (NVIDIA Optimized)
|
||||
/tunarr # Channel Management (GPU Passthrough)
|
||||
/watchtower
|
||||
# Komodo Periphery
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Critical Fixes
|
||||
|
||||
> ⚠️ **DO NOT REVERT THESE CONFIGURATIONS**
|
||||
|
||||
### 1. NFS Mount: Watchtower (Raspberry Pi)
|
||||
|
||||
**Problem:** Permission Denied on `/mnt/appdata` despite matching UIDs.
|
||||
|
||||
**Root Cause:** NFSv4 ID-domain mismatch between Pi and TerraMaster NAS.
|
||||
|
||||
**Solution:**
|
||||
|
||||
```bash
|
||||
# /etc/fstab entry (Force NFSv3)
|
||||
10.0.0.250:/Volume1/appdata /mnt/appdata nfs rw,nfsvers=3,hard,intr,x-systemd.automount,nolock 0 0
|
||||
```
|
||||
|
||||
**Mount Point Ownership:**
|
||||
|
||||
```bash
|
||||
# Set ownership WHILE UNMOUNTED
|
||||
sudo chown chester:chester /mnt/appdata
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 2. Komodo Periphery Connectivity
|
||||
|
||||
**Problem:** Hairpin NAT prevents `*.castaldifamily.com` access from internal nodes.
|
||||
|
||||
**Solution:**
|
||||
|
||||
- **Core URL (Internal):** `ws://10.0.0.151:9120`
|
||||
- **Key Paths:** `/config/keys/periphery.pub`
|
||||
- **Environment Variable:** `file:/config/keys/periphery.pub`
|
||||
|
||||
---
|
||||
|
||||
### 3. Gitea & GitOps
|
||||
|
||||
**Problem:** SSH Key Exchange (Kex) errors on Windows (`diffie-hellman-group1-sha1`).
|
||||
|
||||
**Solution:**
|
||||
|
||||
```bash
|
||||
# Use HTTPS instead of SSH
|
||||
git clone https://git.castaldifamily.com/nathan/homelab.git
|
||||
|
||||
# Windows Credential Storage
|
||||
git config --global credential.helper wincred
|
||||
|
||||
# Cross-Platform Line Endings
|
||||
git config --global core.autocrlf true
|
||||
|
||||
# Network Share Permissions
|
||||
git config --global safe.directory "*"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 4. GPU Passthrough (Plex/Waldorf)
|
||||
|
||||
**Problem:** Plex sees GPU but doesn't use it for hardware transcoding.
|
||||
|
||||
**Solution:**
|
||||
|
||||
```yaml
|
||||
# compose.yaml
|
||||
services:
|
||||
plex:
|
||||
runtime: nvidia
|
||||
deploy:
|
||||
resources:
|
||||
reservations:
|
||||
devices:
|
||||
- driver: nvidia
|
||||
count: all
|
||||
capabilities: [gpu]
|
||||
```
|
||||
|
||||
**Verification:**
|
||||
|
||||
- Monitor Plex Dashboard for `(hw)` status during transcoding.
|
||||
|
||||
---
|
||||
|
||||
## Lessons Learned
|
||||
|
||||
### "NFSv4 is too smart"
|
||||
|
||||
Modern NFS (v4) tries to sync user identities across a "Domain." If the Pi and NAS don't agree on the domain name, it defaults to `nobody`.
|
||||
|
||||
**Fix:** Force NFSv3—it only checks UID numbers (1000).
|
||||
|
||||
---
|
||||
|
||||
### "Naked Mount Point"
|
||||
|
||||
If the local folder (`/mnt/appdata`) is owned by `root`, you can't "pass through" to see NAS data once it mounts.
|
||||
|
||||
**Fix:** `chown` the mount point to the user **while unmounted**.
|
||||
|
||||
---
|
||||
|
||||
### "Hairpin NAT"
|
||||
|
||||
Many routers won't let internal traffic go out to a public IP and then back in (Hairpinning).
|
||||
|
||||
**Fix:** Use **Internal IPs** (`10.0.0.X`) for node-to-node communication.
|
||||
|
||||
---
|
||||
|
||||
### "GPU Passthrough"
|
||||
|
||||
Docker isolation is strict. Simply having drivers on the host isn't enough.
|
||||
|
||||
**Fix:** Use `deploy: resources: reservations` block in Compose to "hand the keys" of the hardware to the container.
|
||||
|
||||
---
|
||||
|
||||
## Network Map
|
||||
|
||||
| Service | Protocol | Internal Address | External URL |
|
||||
|---------|----------|------------------|--------------|
|
||||
| **Komodo Core** | HTTP | `10.0.0.151:9000` | `komodo.castaldifamily.com` |
|
||||
| **Gitea** | HTTPS | `10.0.0.151:3000` | `git.castaldifamily.com` |
|
||||
| **Plex** | Host Network | `10.0.0.XXX:32400` | `plex.castaldifamily.com` |
|
||||
| **Tunarr** | HTTP | `10.0.0.XXX:8000` | `tunarr.castaldifamily.com` |
|
||||
|
||||
---
|
||||
|
||||
## Active Tasks
|
||||
|
||||
### Current Focus
|
||||
|
||||
1. **Git-ify Stacks**
|
||||
- ✅ `plex` and `tunarr` pushed to Gitea
|
||||
- ⏳ Convert remaining "Manual" stacks to "Git" sources in Komodo
|
||||
|
||||
2. **Webhooks**
|
||||
- ⏳ Ensure Gitea Webhooks fire to Komodo Stack URLs for auto-deployment
|
||||
|
||||
3. **Hardware Transcoding**
|
||||
- ⏳ Monitor Waldorf for `(hw)` status in Plex
|
||||
|
||||
---
|
||||
|
||||
## Emergency Procedures
|
||||
|
||||
### 🔥 NFS Mount Failure (Watchtower)
|
||||
|
||||
```bash
|
||||
# Check NFS Server
|
||||
ping 10.0.0.250
|
||||
|
||||
# Remount NFS Share
|
||||
sudo umount /mnt/appdata
|
||||
sudo mount -a
|
||||
|
||||
# Verify Mount
|
||||
df -h | grep appdata
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 🔥 Komodo Periphery Offline
|
||||
|
||||
```bash
|
||||
# Check Core Connectivity
|
||||
curl -v ws://10.0.0.151:9120
|
||||
|
||||
# Restart Periphery Container
|
||||
docker restart komodo-periphery
|
||||
docker logs -f komodo-periphery
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 🔥 Plex Not Using GPU
|
||||
|
||||
```bash
|
||||
# Verify NVIDIA Runtime
|
||||
docker info | grep -i nvidia
|
||||
|
||||
# Check GPU Access in Container
|
||||
docker exec -it plex nvidia-smi
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 🔥 Git Authentication Failure
|
||||
|
||||
```bash
|
||||
# Regenerate Gitea App Token
|
||||
# Settings > Applications > Generate New Token
|
||||
|
||||
# Update Credential Helper
|
||||
git config --global credential.helper wincred
|
||||
|
||||
# Test Clone
|
||||
git clone https://git.castaldifamily.com/nathan/homelab.git
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Credential Management
|
||||
|
||||
- ❌ **DO NOT** store passwords in `compose.yaml` in Git repo
|
||||
- ✅ **DO** use Komodo Stack "Environment Variables" to inject secrets
|
||||
- ✅ **DO** use Gitea **App Tokens** for Git authentication (iPad/Windows)
|
||||
|
||||
---
|
||||
|
||||
## Maintenance Schedule
|
||||
|
||||
| Task | Frequency | Notes |
|
||||
|------|-----------|-------|
|
||||
| Update Docker Images | Weekly | Via Komodo or Watchtower |
|
||||
| Backup Gitea | Weekly | `/data/gitea` directory |
|
||||
| Backup Plex Metadata | Monthly | `/config/Library` directory |
|
||||
| Check NFS Mount Health | Monthly | `df -h`, verify permissions |
|
||||
|
||||
---
|
||||
|
||||
**End of Runbook**
|
||||
Loading…
x
Reference in New Issue
Block a user