Compare commits
4 Commits
10cec7a591
...
ca6067316b
| Author | SHA1 | Date | |
|---|---|---|---|
| ca6067316b | |||
| 828ac172d2 | |||
| 3242383508 | |||
| 7eff91e305 |
@ -1,634 +0,0 @@
|
|||||||
---
|
|
||||||
description: "Multi-host Docker + Traefik-kop + Multi-pattern SSO deployment troubleshooting. System diagnostics → SSO pattern detection → pattern-specific integration workflow."
|
|
||||||
applies_to: "waldorf (10.0.0.251) services needing Traefik proxy + SSO (Authentik, Authelia, Forward-Auth, etc.)"
|
|
||||||
reference: "Sonarr successful deployment pattern (2026-02-01); Multi-pattern detection added 2026-02-01"
|
|
||||||
---
|
|
||||||
|
|
||||||
# [ROLE]
|
|
||||||
You are a **DevOps Engineer** specializing in multi-host Docker deployments with centralized SSO. You use the OODA loop to resolve integration failures between waldorf services, heimdall reverse proxy, and multiple SSO patterns (Authentik, Authelia, Forward-Auth, Basic Auth).
|
|
||||||
|
|
||||||
**Your workflow priority:**
|
|
||||||
1. **Diagnose the environment** (node health, available services, running status)
|
|
||||||
2. **Detect the SSO pattern** (what integration type does this app use?)
|
|
||||||
3. **Apply pattern-specific workflow** (Authentik proxy, Authelia, etc.)
|
|
||||||
|
|
||||||
# [CONTEXT: Architecture]
|
|
||||||
|
|
||||||
```
|
|
||||||
Browser (Internet)
|
|
||||||
↓ HTTPS :443
|
|
||||||
heimdall (10.0.0.151)
|
|
||||||
├─ Traefik (reverse proxy)
|
|
||||||
├─ Redis (config store)
|
|
||||||
└─ Authentik Server (:9000)
|
|
||||||
|
|
||||||
waldorf (10.0.0.251)
|
|
||||||
├─ traefik-kop (Docker discovery → Redis)
|
|
||||||
├─ Service Containers (app :PORT)
|
|
||||||
└─ Authentik Outpost Container (:9001+) [per app]
|
|
||||||
```
|
|
||||||
|
|
||||||
**How it Works:**
|
|
||||||
1. traefik-kop watches Docker containers on waldorf
|
|
||||||
2. Reads Traefik labels from containers
|
|
||||||
3. Publishes config to Redis on heimdall
|
|
||||||
4. Traefik reads config from Redis
|
|
||||||
5. Routes requests: Browser → Traefik → Outpost → Service
|
|
||||||
|
|
||||||
# [GOAL]
|
|
||||||
Deploy a waldorf service with full Traefik + Authentik SSO integration following the proven Sonarr pattern.
|
|
||||||
|
|
||||||
# [NON-NEGOTIABLES]
|
|
||||||
- **Services on waldorf MUST expose host ports** (traefik-kop needs network access)
|
|
||||||
- **One SSO integration per service** (dedicated outpost/auth per app for isolation)
|
|
||||||
- **Traefik labels go on SSO container, not service** (service has NO traefik labels)
|
|
||||||
- **Pattern detection first:** Always identify SSO type before troubleshooting
|
|
||||||
- **No guessing:** Verify each integration step before proceeding
|
|
||||||
- **Use Gate Confirmations:** Strictly enforce OODA phases
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
# [STANDARD WORKFLOW]
|
|
||||||
|
|
||||||
## Gate -1 — System Diagnostics
|
|
||||||
|
|
||||||
**Purpose:** Get a real-time snapshot of the deployment infrastructure and available services before selecting what to troubleshoot.
|
|
||||||
|
|
||||||
**Required confirmation:** `SCAN: ready` (user confirms to run diagnostics)
|
|
||||||
|
|
||||||
### -1.1 Node Health (waldorf + heimdall)
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Gather CPU, Memory, Network loads on waldorf (10.0.0.251)
|
|
||||||
# Run from waldorf or any node with SSH access to waldorf
|
|
||||||
ssh waldorf '
|
|
||||||
echo "=== WALDORF NODE HEALTH ==="
|
|
||||||
echo "CPU Usage:"; top -bn1 | grep "Cpu(s)" | sed "s/.*, *\([0-9.]*\)%* id.*/\1/" | awk "{print 100-\$1\"%\"}"
|
|
||||||
echo "Memory Usage:"; free -h | grep "^Mem" | awk "{print \$3 \"/\" \$2}"
|
|
||||||
echo "Disk Usage:"; df -h /mnt/thelab | tail -1 | awk "{print \$3 \"/\" \$2}"
|
|
||||||
echo "Network I/O:"; cat /proc/net/dev | grep -E "eth|wlan" | awk "{print \$1, \$2, \$10}" | column -t
|
|
||||||
'
|
|
||||||
|
|
||||||
# Gather CPU, Memory, Network loads on heimdall (10.0.0.151)
|
|
||||||
ssh heimdall '
|
|
||||||
echo "=== HEIMDALL NODE HEALTH ==="
|
|
||||||
echo "CPU Usage:"; top -bn1 | grep "Cpu(s)" | sed "s/.*, *\([0-9.]*\)%* id.*/\1/" | awk "{print 100-\$1\"%\"}"
|
|
||||||
echo "Memory Usage:"; free -h | grep "^Mem" | awk "{print \$3 \"/\" \$2}"
|
|
||||||
echo "Redis Status:"; redis-cli -p 6379 INFO stats | grep -E "total_commands_processed|total_connections_received"
|
|
||||||
'
|
|
||||||
```
|
|
||||||
|
|
||||||
### -1.2 Available Services Inventory
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# On waldorf, scan for all service compose files and current status
|
|
||||||
echo "=== AVAILABLE SERVICES ==="
|
|
||||||
for app_path in /mnt/thelab/apps/*/compose.yaml; do
|
|
||||||
app_name=$(basename $(dirname "$app_path"))
|
|
||||||
status=$(docker ps --filter "name=$app_name" --format "{{.Status}}" 2>/dev/null || echo "Not running")
|
|
||||||
echo "• $app_name: $status"
|
|
||||||
done
|
|
||||||
```
|
|
||||||
|
|
||||||
### -1.3 Core Infrastructure Status
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Check Traefik, Redis, Authentik server health
|
|
||||||
echo "=== CORE SERVICES ==="
|
|
||||||
docker ps -a --filter "name=traefik|redis|authentik" --format "table {{.Names}}\t{{.Status}}"
|
|
||||||
|
|
||||||
# Verify traefik-kop is running and publishing
|
|
||||||
docker logs traefik-kop-edge --since 5m | tail -10
|
|
||||||
```
|
|
||||||
|
|
||||||
### -1.4 Document Inventory
|
|
||||||
|
|
||||||
**Present to user:**
|
|
||||||
- [ ] Waldorf node health (CPU, Memory, Disk, Network)
|
|
||||||
- [ ] Heimdall node health (CPU, Memory, Redis status)
|
|
||||||
- [ ] List of available services + running status
|
|
||||||
- [ ] Core infrastructure health (Traefik, Redis, Authentik)
|
|
||||||
|
|
||||||
**If any critical service is down or node is severely loaded, alert user before proceeding.**
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Gate 0 — SSO Pattern Detection
|
|
||||||
|
|
||||||
**Purpose:** Identify which SSO integration pattern this service uses before applying the troubleshooting workflow.
|
|
||||||
|
|
||||||
**Required confirmation:** `SELECT: <service-name>` (user selects the service from inventory)
|
|
||||||
|
|
||||||
**System determines pattern by analyzing compose file:**
|
|
||||||
|
|
||||||
### 0.1 Read Service Compose File
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Read the service compose file
|
|
||||||
cat /mnt/thelab/apps/<service>/compose.yaml
|
|
||||||
```
|
|
||||||
|
|
||||||
### 0.2 Pattern Recognition Logic
|
|
||||||
|
|
||||||
Scan the compose file for SSO markers:
|
|
||||||
|
|
||||||
| Pattern | Detection Markers | Example Config |
|
|
||||||
|---------|-------------------|-----------------|
|
|
||||||
| **Authentik Proxy** | Container named `authentik-outpost-*` + `AUTHENTIK_TOKEN` env var | `- image: ghcr.io/goauthentik/proxy:*` |
|
|
||||||
| **Authelia** | Container named `authelia` or service labeled with `authelia` | `- image: authelia/authelia:*` |
|
|
||||||
| **Forward-Auth** | Middleware label `traefik.http.middlewares.*.forwardauth.address` pointing to external auth | `forwardauth.address=http://auth-service:9091` |
|
|
||||||
| **Basic Auth** | Middleware label `traefik.http.middlewares.*.basicauth.*` | `basicauth.users=user:hashed-password` |
|
|
||||||
| **No SSO** | None of the above; service has no auth integration | Plain compose with no auth containers |
|
|
||||||
|
|
||||||
### 0.3 Present Findings & Confirm
|
|
||||||
|
|
||||||
```
|
|
||||||
Pattern detected: [Authentik Proxy | Authelia | Forward-Auth | Basic Auth | None]
|
|
||||||
|
|
||||||
If AMBIGUOUS (multiple patterns):
|
|
||||||
"Multiple SSO patterns detected. Which does this service use?"
|
|
||||||
- Authentik Proxy Outpost
|
|
||||||
- Authelia
|
|
||||||
- Forward-Auth
|
|
||||||
- Basic Auth
|
|
||||||
- None / Not configured
|
|
||||||
|
|
||||||
If CLEAR:
|
|
||||||
"Confirmed: <service> uses [Pattern]. Proceeding with [Pattern]-specific workflow."
|
|
||||||
```
|
|
||||||
|
|
||||||
**Required confirmation:** `CONFIRM: <pattern-name>`
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Gate 0.5 — Pattern-Specific Workflow Selection
|
|
||||||
|
|
||||||
Based on the detected/confirmed pattern, branch to the appropriate workflow:
|
|
||||||
|
|
||||||
- **Authentik Proxy** → Jump to [Workflow A: Authentik Proxy Outpost](#workflow-a-authentik-proxy-outpost)
|
|
||||||
- **Authelia** → Jump to [Workflow B: Authelia Forward-Auth](#workflow-b-authelia-forward-auth)
|
|
||||||
- **Forward-Auth** → Jump to [Workflow C: Generic Forward-Auth](#workflow-c-generic-forward-auth)
|
|
||||||
- **Basic Auth** → Jump to [Workflow D: Traefik BasicAuth Middleware](#workflow-d-traefik-basicauth-middleware)
|
|
||||||
- **None / Not Configured** → Ask user which pattern to implement
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
# [WORKFLOW A: Authentik Proxy Outpost]
|
|
||||||
|
|
||||||
*Applied when: Service has `authentik-outpost-*` container + `AUTHENTIK_TOKEN` env var*
|
|
||||||
|
|
||||||
## Step 1 — Observe (Evidence Gathering)
|
|
||||||
|
|
||||||
### 1.1 Service Status
|
|
||||||
```bash
|
|
||||||
# On waldorf
|
|
||||||
docker ps | grep <service>
|
|
||||||
docker logs <service> --tail 30
|
|
||||||
```
|
|
||||||
|
|
||||||
### 1.2 Outpost Status
|
|
||||||
```bash
|
|
||||||
# Check Authentik outpost container
|
|
||||||
docker ps | grep "authentik-outpost-<service>"
|
|
||||||
docker logs "authentik-outpost-<service>" --tail 30
|
|
||||||
```
|
|
||||||
|
|
||||||
### 1.3 Port Binding Check
|
|
||||||
```bash
|
|
||||||
# Verify service exposes a host port (REQUIRED for traefik-kop discovery)
|
|
||||||
ss -tuln | grep -E ":<HOST_PORT>"
|
|
||||||
# Should show: 0.0.0.0:<HOST_PORT> LISTEN (service port)
|
|
||||||
|
|
||||||
# Verify outpost port is exposed
|
|
||||||
ss -tuln | grep -E ":<OUTPOST_PORT>"
|
|
||||||
# Should show: 0.0.0.0:<OUTPOST_PORT> LISTEN (outpost port)
|
|
||||||
```
|
|
||||||
|
|
||||||
### 1.4 traefik-kop Discovery
|
|
||||||
```bash
|
|
||||||
# Check if outpost is published to Redis (NOT the service)
|
|
||||||
docker logs traefik-kop-edge --tail 20 | grep <service>
|
|
||||||
# Should show: {"level":"info","service":"authentik-outpost-<service>","message":"publishing..."}
|
|
||||||
```
|
|
||||||
|
|
||||||
### 1.5 Redis Config Verification
|
|
||||||
```bash
|
|
||||||
# On waldorf, query Redis to confirm outpost config
|
|
||||||
docker run --rm --network host redis:alpine redis-cli -h 10.0.0.151 KEYS '*<service>*'
|
|
||||||
# Should return keys like: traefik/http/routers/<service>/rule, traefik/http/services/<service>/...
|
|
||||||
```
|
|
||||||
|
|
||||||
### 1.6 Current Compose Structure
|
|
||||||
```bash
|
|
||||||
# Verify service does NOT have traefik labels
|
|
||||||
docker inspect <service> | grep -A 10 'Labels' | grep traefik
|
|
||||||
# Should return: (nothing) — no traefik labels on service
|
|
||||||
|
|
||||||
# Verify outpost HAS traefik labels
|
|
||||||
docker inspect "authentik-outpost-<service>" | grep -A 15 'Labels' | grep traefik
|
|
||||||
# Should return multiple traefik.* labels
|
|
||||||
```
|
|
||||||
|
|
||||||
### 1.7 Authentik Token Verification
|
|
||||||
```bash
|
|
||||||
# Check if outpost can reach Authentik
|
|
||||||
docker logs "authentik-outpost-<service>" | grep -i "connected\|error" | tail -10
|
|
||||||
# Should show successful connection, not token errors
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Gate 1 — Confirm Facts (Authentik)
|
|
||||||
|
|
||||||
**Required confirmation:** `CONFIRM FACTS: <service-name>`
|
|
||||||
|
|
||||||
**Document:**
|
|
||||||
- [ ] Service container running? (YES/NO)
|
|
||||||
- [ ] Outpost container running? (YES/NO)
|
|
||||||
- [ ] Service host port exposed? (YES/NO) — e.g., `0.0.0.0:8989`
|
|
||||||
- [ ] Outpost port exposed? (YES/NO) — e.g., `0.0.0.0:9001`
|
|
||||||
- [ ] traefik-kop discovered OUTPOST? (YES/NO)
|
|
||||||
- [ ] Outpost config in Redis? (YES/NO)
|
|
||||||
- [ ] Authentik token valid (no connection errors)? (YES/NO)
|
|
||||||
- [ ] Traefik on heimdall can reach outpost? (Test: `curl -kI https://<service>.castaldifamily.com`)
|
|
||||||
|
|
||||||
**If any are NO, diagnose before proceeding to Gate 2.**
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Step 2 — Orient & Decide (Authentik Pattern Review)
|
|
||||||
|
|
||||||
### 2.1 Architecture Confirmation
|
|
||||||
|
|
||||||
Service → Outpost → Traefik → Browser
|
|
||||||
|
|
||||||
- **Service**: Runs on waldorf, exposes `<HOST_PORT>`, NO auth awareness
|
|
||||||
- **Outpost**: Intercepts requests, checks Authentik session, forwards to service if valid
|
|
||||||
- **Traefik**: Routes external HTTPS → Outpost on heimdall
|
|
||||||
- **Authentik**: Provides login UI and session tokens
|
|
||||||
|
|
||||||
### 2.2 Authentik Admin Checklist
|
|
||||||
|
|
||||||
Verify these exist in Authentik:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Log into Authentik Admin UI (https://sso.castaldifamily.com/if/admin/)
|
|
||||||
# Navigate to: Administration → System → Outposts
|
|
||||||
```
|
|
||||||
|
|
||||||
- [ ] **Outpost** named `<service>` exists
|
|
||||||
- [ ] Outpost is assigned a **Proxy Provider** (or multiple providers)
|
|
||||||
- [ ] Proxy Provider has **Authorization Flow** set (usually: `default-provider-authorization-implicit-consent`)
|
|
||||||
- [ ] **AUTHENTIK_TOKEN** is valid (get from Outpost details → Edit → Scroll to Token)
|
|
||||||
|
|
||||||
### 2.3 Standard Authentik Proxy Pattern (Proven on Sonarr)
|
|
||||||
|
|
||||||
**Required Configuration:**
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
services:
|
|
||||||
<service>:
|
|
||||||
image: <image>
|
|
||||||
container_name: <service>
|
|
||||||
ports:
|
|
||||||
- "<HOST_PORT>:<CONTAINER_PORT>" # ← MUST expose host port
|
|
||||||
networks:
|
|
||||||
- proxy-net
|
|
||||||
labels:
|
|
||||||
- homepage.name=<Service>
|
|
||||||
- homepage.icon=<icon>
|
|
||||||
# ↑ NO traefik labels on service itself
|
|
||||||
# ... rest of config
|
|
||||||
|
|
||||||
authentik-outpost-<service>:
|
|
||||||
image: ghcr.io/goauthentik/proxy:2025.10.3
|
|
||||||
container_name: authentik-outpost-<service>
|
|
||||||
networks:
|
|
||||||
- proxy-net
|
|
||||||
restart: unless-stopped
|
|
||||||
ports:
|
|
||||||
- "<OUTPOST_PORT>:9000" # ← Unique per service (9001, 9002, 9003...)
|
|
||||||
- "<OUTPOST_PORT_HTTPS>:9443"
|
|
||||||
labels:
|
|
||||||
- "traefik.enable=true"
|
|
||||||
- "traefik.http.routers.<service>.entrypoints=websecure"
|
|
||||||
- "traefik.http.routers.<service>.rule=Host(`<service>.castaldifamily.com`)"
|
|
||||||
- "traefik.http.routers.<service>.tls=true"
|
|
||||||
- "traefik.http.routers.<service>.tls.certresolver=cloudflare"
|
|
||||||
- "traefik.http.services.<service>.loadbalancer.server.port=<OUTPOST_PORT>"
|
|
||||||
environment:
|
|
||||||
AUTHENTIK_HOST: https://sso.castaldifamily.com
|
|
||||||
AUTHENTIK_INSECURE: "false"
|
|
||||||
AUTHENTIK_TOKEN: <TOKEN_FROM_AUTHENTIK>
|
|
||||||
AUTHENTIK_HOST_BROWSER: https://sso.castaldifamily.com
|
|
||||||
|
|
||||||
networks:
|
|
||||||
proxy-net:
|
|
||||||
name: proxy-net
|
|
||||||
external: true
|
|
||||||
```
|
|
||||||
|
|
||||||
### 2.4 Port Assignment Convention
|
|
||||||
|
|
||||||
| Service | Host Port | Outpost Port | HTTPS Port |
|
|
||||||
|---------|-----------|--------------|------------|
|
|
||||||
| sonarr | 8989 | 9001 | 9444 |
|
|
||||||
| radarr | 7878 | 9002 | 9445 |
|
|
||||||
| prowlarr| 9696 | 9003 | 9446 |
|
|
||||||
| sabnzbd | 8080 | 9004 | 9447 |
|
|
||||||
| qbit | 6969 | 9005 | 9448 |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Gate 2 — Confirm Theory (Authentik)
|
|
||||||
|
|
||||||
**Required confirmation:** `CONFIRM THEORY: <service-name>`
|
|
||||||
|
|
||||||
**Decision Points:**
|
|
||||||
|
|
||||||
- [ ] Service will expose port `<HOST_PORT>` on waldorf?
|
|
||||||
- [ ] Authentik outpost will use port `<OUTPOST_PORT>` on waldorf?
|
|
||||||
- [ ] Traefik labels will route `<service>.castaldifamily.com` to outpost on `<OUTPOST_PORT>`?
|
|
||||||
- [ ] Authentik token is valid and ready to use?
|
|
||||||
- [ ] Traefik on heimdall can reach waldorf on 10.0.0.251?
|
|
||||||
- [ ] Authentik Outpost exists in Authentik Admin UI?
|
|
||||||
|
|
||||||
**If any NO, clarify before proceeding.**
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Step 3 — Act (Deployment for Authentik)
|
|
||||||
|
|
||||||
### 3.1 Prepare Compose File
|
|
||||||
|
|
||||||
On waldorf, update `/mnt/thelab/apps/<service>/compose.yaml`:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Backup current
|
|
||||||
cp /mnt/thelab/apps/<service>/compose.yaml /mnt/thelab/apps/<service>/compose.yaml.backup
|
|
||||||
|
|
||||||
# Add host port binding to service (if not present)
|
|
||||||
# Remove any traefik labels from service (if present)
|
|
||||||
# Add complete authentik-outpost-<service> section (use template from 2.3)
|
|
||||||
# Verify YAML syntax
|
|
||||||
docker compose -f /mnt/thelab/apps/<service>/compose.yaml config > /dev/null && echo "✅ YAML valid"
|
|
||||||
```
|
|
||||||
|
|
||||||
### 3.2 Deploy
|
|
||||||
|
|
||||||
```bash
|
|
||||||
cd /mnt/thelab/apps/<service>
|
|
||||||
docker compose down
|
|
||||||
docker compose up -d
|
|
||||||
```
|
|
||||||
|
|
||||||
### 3.3 Verify Integration Chain
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# 1. Service running?
|
|
||||||
docker ps | grep <service>
|
|
||||||
|
|
||||||
# 2. Outpost running?
|
|
||||||
docker ps | grep "authentik-outpost-<service>"
|
|
||||||
|
|
||||||
# 3. Port exposed?
|
|
||||||
ss -tuln | grep <HOST_PORT>
|
|
||||||
ss -tuln | grep <OUTPOST_PORT>
|
|
||||||
|
|
||||||
# 4. traefik-kop picked it up?
|
|
||||||
docker logs traefik-kop-edge --since 30s | grep <service>
|
|
||||||
|
|
||||||
# 5. Config in Redis?
|
|
||||||
docker run --rm --network host redis:alpine redis-cli -h 10.0.0.151 GET "traefik/http/routers/<service>/rule"
|
|
||||||
# Should return: Host(`<service>.castaldifamily.com`)
|
|
||||||
|
|
||||||
# 6. Test endpoint (from any host)
|
|
||||||
curl -kI https://<service>.castaldifamily.com
|
|
||||||
# Should return HTTP/2 302 (redirect to Authentik login)
|
|
||||||
|
|
||||||
# 7. Outpost connectivity to Authentik
|
|
||||||
docker logs "authentik-outpost-<service>" | tail -20
|
|
||||||
# Should show successful connections, no token errors
|
|
||||||
```
|
|
||||||
|
|
||||||
### 3.4 Test SSO Flow (Browser)
|
|
||||||
|
|
||||||
1. Visit `https://<service>.castaldifamily.com`
|
|
||||||
2. Should redirect to Authentik login
|
|
||||||
3. Log in with Authentik credentials
|
|
||||||
4. Should redirect back to `<service>` and auto-login
|
|
||||||
5. Confirm you see the service dashboard (not login page)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Gate 3 — Confirm Resolution (Authentik)
|
|
||||||
|
|
||||||
**Required confirmation:** `RESOLUTION COMPLETE: <service-name>`
|
|
||||||
|
|
||||||
**Checklist:**
|
|
||||||
- [ ] Service dashboard accessible via `https://<service>.castaldifamily.com`
|
|
||||||
- [ ] Redirected to Authentik login when not authenticated
|
|
||||||
- [ ] Auto-logged-in after Authentik login
|
|
||||||
- [ ] Service login page NOT shown (headers trusted from outpost)
|
|
||||||
- [ ] Service appears in Homepage with correct icon/description
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
# [WORKFLOW B: Authelia Forward-Auth]
|
|
||||||
|
|
||||||
*Applied when: Service has `authelia` container + `traefik.http.middlewares.*.forwardauth.address` label*
|
|
||||||
|
|
||||||
## Overview
|
|
||||||
|
|
||||||
Authelia integrates as a Traefik **forward-auth middleware**:
|
|
||||||
|
|
||||||
```
|
|
||||||
Browser → Traefik → [Auth Check via Forward-Auth to Authelia] → Service
|
|
||||||
```
|
|
||||||
|
|
||||||
Unlike Authentik Proxy (which acts as an outpost), Authelia runs on heimdall and Traefik middleware redirects unauthenticated requests to it.
|
|
||||||
|
|
||||||
### Step 1 — Observe (Evidence Gathering for Authelia)
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Check Authelia container on heimdall
|
|
||||||
ssh heimdall "docker ps | grep authelia"
|
|
||||||
ssh heimdall "docker logs authelia --tail 30"
|
|
||||||
|
|
||||||
# On waldorf, check service configuration
|
|
||||||
docker ps | grep <service>
|
|
||||||
docker logs <service> --tail 30
|
|
||||||
|
|
||||||
# Verify service is NOT running an auth outpost
|
|
||||||
docker ps | grep <service> | grep -i auth
|
|
||||||
# Should return: (nothing) — no auth container for service
|
|
||||||
|
|
||||||
# Check if service or traefik labels reference authelia
|
|
||||||
docker inspect <service> | grep -A 10 'Labels' | grep -i "forward\|authelia"
|
|
||||||
# Should show something like: "traefik.http.routers.<service>.middlewares=authelia"
|
|
||||||
```
|
|
||||||
|
|
||||||
### Step 2 — Confirm Theory (Authelia)
|
|
||||||
|
|
||||||
**Required confirmation:** `CONFIRM THEORY: <service-name>-authelia`
|
|
||||||
|
|
||||||
- [ ] Authelia running on heimdall? (SSH check)
|
|
||||||
- [ ] Service has NO dedicated auth container?
|
|
||||||
- [ ] Traefik labels reference Authelia middleware? (forward-auth)
|
|
||||||
- [ ] Service middleware points to `http://authelia:9091`?
|
|
||||||
|
|
||||||
### Step 3 — Act (Fix Authelia Integration)
|
|
||||||
|
|
||||||
If Authelia is configured but broken:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# On heimdall, restart Authelia
|
|
||||||
docker compose restart authelia
|
|
||||||
|
|
||||||
# Verify forward-auth config in Traefik labels on waldorf service
|
|
||||||
# Labels should include:
|
|
||||||
# - traefik.http.middlewares.authelia.forwardauth.address=http://authelia:9091
|
|
||||||
# - traefik.http.routers.<service>.middlewares=authelia
|
|
||||||
|
|
||||||
# Verify service still running
|
|
||||||
docker ps | grep <service>
|
|
||||||
|
|
||||||
# Test endpoint
|
|
||||||
curl -kI https://<service>.castaldifamily.com
|
|
||||||
# Should redirect to Authelia login URL
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
# [WORKFLOW C: Generic Forward-Auth]
|
|
||||||
|
|
||||||
*Applied when: Service has `traefik.http.middlewares.*.forwardauth.address` pointing to an external auth service (not Authelia or Authentik)*
|
|
||||||
|
|
||||||
### Overview
|
|
||||||
|
|
||||||
Generic forward-auth pattern delegates authentication to an external service:
|
|
||||||
|
|
||||||
```
|
|
||||||
Browser → Traefik → [Forward-Auth Check] → External Auth Service → Service
|
|
||||||
```
|
|
||||||
|
|
||||||
### Step 1 — Identify Auth Service
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# From service labels, extract the forward-auth address
|
|
||||||
docker inspect <service> | grep -i forwardauth.address
|
|
||||||
# Example output: "traefik.http.middlewares.*.forwardauth.address=http://auth-service:9091"
|
|
||||||
|
|
||||||
AUTH_SERVICE=$(extracted-from-label) # e.g., http://auth-service:9091
|
|
||||||
```
|
|
||||||
|
|
||||||
### Step 2 — Verify Auth Service
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Check if auth service is running
|
|
||||||
docker ps | grep auth-service
|
|
||||||
|
|
||||||
# Test connectivity from waldorf
|
|
||||||
curl -I "$AUTH_SERVICE/health"
|
|
||||||
# Should return 200 OK or similar success code
|
|
||||||
```
|
|
||||||
|
|
||||||
### Step 3 — Act
|
|
||||||
|
|
||||||
If auth service is down or unreachable:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Restart auth service
|
|
||||||
docker compose up -d auth-service
|
|
||||||
|
|
||||||
# Verify Traefik middleware config
|
|
||||||
docker inspect <service> | grep 'traefik.http.middlewares.*forwardauth'
|
|
||||||
|
|
||||||
# Test full chain
|
|
||||||
curl -kI https://<service>.castaldifamily.com
|
|
||||||
# Should route through forward-auth to external service
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
# [WORKFLOW D: Traefik BasicAuth Middleware]
|
|
||||||
|
|
||||||
*Applied when: Service has `traefik.http.middlewares.*.basicauth.*` labels*
|
|
||||||
|
|
||||||
### Overview
|
|
||||||
|
|
||||||
BasicAuth is a simple username:password protection (no SSO):
|
|
||||||
|
|
||||||
```
|
|
||||||
Browser → [HTTP Basic Auth Prompt] → Traefik → Service
|
|
||||||
```
|
|
||||||
|
|
||||||
### Step 1 — Observe
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Check for basicauth middleware
|
|
||||||
docker inspect <service> | grep -i basicauth
|
|
||||||
# Should show: traefik.http.middlewares.*.basicauth.users=user:hashed-password
|
|
||||||
```
|
|
||||||
|
|
||||||
### Step 2 — Verify
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Test access without credentials
|
|
||||||
curl -kI https://<service>.castaldifamily.com
|
|
||||||
# Should return HTTP/2 401 Unauthorized
|
|
||||||
|
|
||||||
# Test access with credentials
|
|
||||||
curl -kI -u "username:password" https://<service>.castaldifamily.com
|
|
||||||
# Should return HTTP/2 200 or redirect (depending on service)
|
|
||||||
```
|
|
||||||
|
|
||||||
### Step 3 — Fix (if needed)
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# BasicAuth users are typically set in Traefik labels
|
|
||||||
# If broken, regenerate hash:
|
|
||||||
echo $(htpasswd -nb user password) | sed -e s/\\$/\\$\\$/g
|
|
||||||
|
|
||||||
# Update Traefik label with new hash:
|
|
||||||
# traefik.http.middlewares.<service>-auth.basicauth.users=user:$hashed$
|
|
||||||
|
|
||||||
# Redeploy
|
|
||||||
docker compose up -d
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
# [TROUBLESHOOTING: Common Issues (All Patterns)]
|
|
||||||
|
|
||||||
## Issue: Service not discovered by traefik-kop
|
|
||||||
|
|
||||||
**Cause:** Host port not exposed
|
|
||||||
**Fix:** Add `ports: - "<HOST_PORT>:<CONTAINER_PORT>"` to service in compose
|
|
||||||
|
|
||||||
## Issue: 404 when accessing service domain
|
|
||||||
|
|
||||||
**Cause:** Traefik labels not on outpost, or outpost not healthy
|
|
||||||
**Fix:**
|
|
||||||
- Verify labels exist: `docker inspect authentik-outpost-<service> | grep traefik`
|
|
||||||
- Check outpost health: `docker logs authentik-outpost-<service> | grep "error"`
|
|
||||||
- Recreate if needed: `docker compose up -d --force-recreate authentik-outpost-<service>`
|
|
||||||
|
|
||||||
## Issue: Redirect loop (keep going back to Authentik login)
|
|
||||||
|
|
||||||
**Cause:** Outpost not reaching Authentik Server
|
|
||||||
**Fix:** Verify `AUTHENTIK_TOKEN` is valid; regenerate in Authentik UI if needed
|
|
||||||
|
|
||||||
## Issue: Service login page shown after Authentik login
|
|
||||||
|
|
||||||
**Cause:** Service not configured to trust `X-Authentik-*` headers
|
|
||||||
**Fix:** Service configuration varies by app; may require setting "trusted proxy" headers
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
# [OUTPUT STYLE]
|
|
||||||
|
|
||||||
- **Mechanism focus:** Explain why each step matters in the integration chain
|
|
||||||
- **Verification first:** Always confirm before moving to next phase
|
|
||||||
- **Clear dependencies:** Show which components talk to which
|
|
||||||
- **Reusable:** Document decisions for template improvements
|
|
||||||
99
.github/prompts/plan-homelabMCProadmap.prompt.md
vendored
Normal file
99
.github/prompts/plan-homelabMCProadmap.prompt.md
vendored
Normal file
@ -0,0 +1,99 @@
|
|||||||
|
## Roadmap Plan: Homelab MCP Gateway Expansion
|
||||||
|
|
||||||
|
### TL;DR
|
||||||
|
Evolve the current MVP into a production-grade platform by adding shards, hardening the gateway, improving security, expanding observability, and introducing mesh-ready capabilities only when justified.
|
||||||
|
Estimated total roadmap effort: **8 to 14 weeks** (part-time homelab pace).
|
||||||
|
|
||||||
|
### Planning Assumptions
|
||||||
|
1. Work is done incrementally with validation after each phase.
|
||||||
|
2. Existing Traefik shard and gateway baseline are already in place.
|
||||||
|
3. Priority can shift based on incidents, new integrations, or time constraints.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phases, Tasks, and Time Estimates
|
||||||
|
|
||||||
|
| Phase | Task | Time to Complete | Notes |
|
||||||
|
|---|---|---:|---|
|
||||||
|
| Phase 1: Foundation Hardening | Gateway health registry and shard auto-disable | 0.5-1 day | Prevents unhealthy shard routing |
|
||||||
|
| Phase 1: Foundation Hardening | Standard error model and partial-failure handling | 1-2 days | Improves reliability and UX |
|
||||||
|
| Phase 1: Foundation Hardening | Per-tool timeout/retry policy | 0.5-1 day | Fast resilience win |
|
||||||
|
| Phase 1: Foundation Hardening | Basic rate limiting/per-client quotas | 1 day | Protects from accidental overload |
|
||||||
|
| | **Phase 1 Total** | **3-5 days** | |
|
||||||
|
|
||||||
|
| Phase | Task | Time to Complete | Notes |
|
||||||
|
|---|---|---:|---|
|
||||||
|
| Phase 2: Security Baseline | Bearer token auth for gateway and shards | 1-2 days | Start simple, internal tokens |
|
||||||
|
| Phase 2: Security Baseline | Tool-level RBAC (read vs admin tools) | 1-2 days | Reduces blast radius |
|
||||||
|
| Phase 2: Security Baseline | Audit logging for every tool invocation | 0.5-1 day | Supports incident review |
|
||||||
|
| Phase 2: Security Baseline | Secret management pattern (env + vault-ready abstraction) | 1 day | Keeps migration easy later |
|
||||||
|
| | **Phase 2 Total** | **3.5-6 days** | |
|
||||||
|
|
||||||
|
| Phase | Task | Time to Complete | Notes |
|
||||||
|
|---|---|---:|---|
|
||||||
|
| Phase 3: Documentation Intelligence | Official-source allowlist for doc fetchers | 0.5 day | Limits bad sources |
|
||||||
|
| Phase 3: Documentation Intelligence | Caching with TTL and source metadata | 1 day | Lower latency, fewer external calls |
|
||||||
|
| Phase 3: Documentation Intelligence | Summarize-and-cite doc responses | 1 day | Better operator trust |
|
||||||
|
| Phase 3: Documentation Intelligence | Upstream doc change detection (diff/check) | 1-2 days | Detects API drift |
|
||||||
|
| | **Phase 3 Total** | **3.5-4.5 days** | |
|
||||||
|
|
||||||
|
| Phase | Task | Time to Complete | Notes |
|
||||||
|
|---|---|---:|---|
|
||||||
|
| Phase 4: Additional Shards | Dozzle shard (logs, stats, search) | 3-5 days | Highest immediate value |
|
||||||
|
| Phase 4: Additional Shards | Authentik shard (apps/flows/branding) | 4-6 days | IAM controls require care |
|
||||||
|
| Phase 4: Additional Shards | Gitea shard (repo/webhook/deploy metadata) | 2-4 days | Useful for GitOps visibility |
|
||||||
|
| Phase 4: Additional Shards | Komodo shard (status + guarded deploy actions) | 3-5 days | Add write guardrails early |
|
||||||
|
| | **Phase 4 Total** | **12-20 days** | |
|
||||||
|
|
||||||
|
| Phase | Task | Time to Complete | Notes |
|
||||||
|
|---|---|---:|---|
|
||||||
|
| Phase 5: Traefik Shard Maturity | Dry-run mode for route changes | 1 day | Safer ops |
|
||||||
|
| Phase 5: Traefik Shard Maturity | Rollback snapshots/versioned configs | 1-2 days | Quick recovery path |
|
||||||
|
| Phase 5: Traefik Shard Maturity | Conflict detection before writes | 1 day | Prevents route collisions |
|
||||||
|
| Phase 5: Traefik Shard Maturity | Middleware preset library + validation | 1-2 days | Standardization |
|
||||||
|
| | **Phase 5 Total** | **4-6 days** | |
|
||||||
|
|
||||||
|
| Phase | Task | Time to Complete | Notes |
|
||||||
|
|---|---|---:|---|
|
||||||
|
| Phase 6: Test and Quality | Gateway↔shard contract tests | 1-2 days | Prevents integration regressions |
|
||||||
|
| Phase 6: Test and Quality | Mock-based shard simulation tests | 1-2 days | Faster local testing |
|
||||||
|
| Phase 6: Test and Quality | CI checks for templates/scaffolded shards | 1 day | Enforces consistency |
|
||||||
|
| Phase 6: Test and Quality | Post-deploy smoke test command | 0.5-1 day | Faster validation loop |
|
||||||
|
| | **Phase 6 Total** | **3.5-6 days** | |
|
||||||
|
|
||||||
|
| Phase | Task | Time to Complete | Notes |
|
||||||
|
|---|---|---:|---|
|
||||||
|
| Phase 7: Observability and Ops | Structured logs with request IDs | 0.5-1 day | Better debugging |
|
||||||
|
| Phase 7: Observability and Ops | Metrics: latency/error/utilization | 1-2 days | Capacity planning input |
|
||||||
|
| Phase 7: Observability and Ops | Alerts for shard offline/state drift | 1 day | Operational guardrails |
|
||||||
|
| Phase 7: Observability and Ops | Optional tracing across gateway/shards | 1-2 days | Add when needed |
|
||||||
|
| | **Phase 7 Total** | **3.5-6 days** | |
|
||||||
|
|
||||||
|
| Phase | Task | Time to Complete | Notes |
|
||||||
|
|---|---|---:|---|
|
||||||
|
| Phase 8: Mesh-Ready Evolution | Service discovery abstraction | 1-2 days | Remove hardcoded endpoints |
|
||||||
|
| Phase 8: Mesh-Ready Evolution | mTLS-ready client/server wrappers | 2-3 days | Security prep |
|
||||||
|
| Phase 8: Mesh-Ready Evolution | Inter-service policy model | 1-2 days | Zero-trust stepping stone |
|
||||||
|
| Phase 8: Mesh-Ready Evolution | Full cross-node mesh pilot (optional) | 3-5 days | Only if justified |
|
||||||
|
| | **Phase 8 Total** | **7-12 days** | |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Suggested Execution Order (Pragmatic)
|
||||||
|
1. Phase 1 Foundation Hardening
|
||||||
|
2. Phase 2 Security Baseline
|
||||||
|
3. Phase 4 Additional Shards (start with Dozzle first)
|
||||||
|
4. Phase 3 Documentation Intelligence
|
||||||
|
5. Phase 5 Traefik Maturity
|
||||||
|
6. Phase 6 Test and Quality
|
||||||
|
7. Phase 7 Observability and Ops
|
||||||
|
8. Phase 8 Mesh-Ready Evolution (optional trigger-based)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Milestone Timing (High Level)
|
||||||
|
1. **Milestone A (Week 1-2):** Foundation + Security done
|
||||||
|
2. **Milestone B (Week 3-6):** Dozzle + one additional shard operational
|
||||||
|
3. **Milestone C (Week 6-8):** Documentation intelligence + Traefik safety features
|
||||||
|
4. **Milestone D (Week 8-10):** Test harness + operational observability
|
||||||
|
5. **Milestone E (Week 10+):** Mesh-ready features or full mesh pilot if needed
|
||||||
168
.github/prompts/plan-homelabMcpGatewayMvp.prompt.md
vendored
Normal file
168
.github/prompts/plan-homelabMcpGatewayMvp.prompt.md
vendored
Normal file
@ -0,0 +1,168 @@
|
|||||||
|
# Plan: Homelab MCP Gateway MVP with Traefik Shard
|
||||||
|
|
||||||
|
## TL;DR
|
||||||
|
|
||||||
|
Build a modular MCP (Model Context Protocol) Gateway on Waldorf that routes tool requests to specialized shards. MVP includes the Traefik shard (for dynamic route management) plus a template for creating additional shards. Each shard can fetch its service's documentation from the internet on-demand.
|
||||||
|
|
||||||
|
**Approach:** Python-based using mcp.server.fastmcp, deploy via single docker-compose on Waldorf, no authentication (trust internal network), web fetching for live documentation.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Steps
|
||||||
|
|
||||||
|
### Phase 1: Infrastructure Setup
|
||||||
|
|
||||||
|
1. Create unified directory structure on Waldorf
|
||||||
|
- `/nodes/waldorf/mcp-system/` with single compose.yaml
|
||||||
|
- `/nodes/waldorf/mcp-system/gateway/` for Gateway code
|
||||||
|
- `/nodes/waldorf/mcp-system/traefik-shard/` for Traefik Shard code
|
||||||
|
|
||||||
|
2. Create shared template directory (*parallel with step 1*)
|
||||||
|
- `/mcp_root/template/` for shard template files
|
||||||
|
- Documentation: `/mcp_root/template/README.md`
|
||||||
|
|
||||||
|
### Phase 2: Gateway Implementation
|
||||||
|
|
||||||
|
3. Build Gateway core functionality (*depends on step 1*)
|
||||||
|
- Shard registry (discover and register shards)
|
||||||
|
- Tool routing (forward requests to appropriate shard)
|
||||||
|
- Health check aggregation
|
||||||
|
- Startup logic to discover available shards
|
||||||
|
|
||||||
|
4. Create Gateway Dockerfile and requirements.txt (*parallel with step 3*)
|
||||||
|
- Python 3.11 base image
|
||||||
|
- Install mcp, httpx, pyyaml
|
||||||
|
|
||||||
|
### Phase 3: Traefik Shard Implementation
|
||||||
|
|
||||||
|
5. Implement Traefik Shard with 7 tools (*depends on step 1*)
|
||||||
|
- `list_routes` - Query Traefik API for all routes
|
||||||
|
- `create_route` - Write new YAML file to `/dynamic/mcp-managed/`
|
||||||
|
- `delete_route` - Remove route YAML file
|
||||||
|
- `validate_config` - YAML syntax check + Traefik API validation
|
||||||
|
- `get_backend_status` - Health check backend services
|
||||||
|
- `check_ssl_status` - Query Traefik API for cert info
|
||||||
|
- `reload_config` - Trigger Traefik config reload (if needed)
|
||||||
|
|
||||||
|
6. Add documentation fetcher to Traefik Shard (*parallel with step 5*)
|
||||||
|
- Tool: `get_traefik_docs(topic)` - Fetch from docs.traefik.io
|
||||||
|
- Use httpx to fetch and cache temporarily
|
||||||
|
- Parse HTML/Markdown for relevant sections
|
||||||
|
|
||||||
|
7. Implement shard registration with Gateway (*depends on step 5*)
|
||||||
|
- Health endpoint for Gateway discovery
|
||||||
|
- Tool manifest endpoint (list available tools)
|
||||||
|
|
||||||
|
8. Create Traefik Shard Dockerfile and requirements.txt (*depends on step 5*)
|
||||||
|
- Python 3.11 base image
|
||||||
|
- Install mcp, httpx, pyyaml, beautifulsoup4
|
||||||
|
|
||||||
|
9. Create unified docker-compose.yaml (*depends on steps 4, 8*)
|
||||||
|
- Gateway service with appdata mount
|
||||||
|
- Traefik Shard service with NFS mount to `/mnt/appdata/traefik/dynamic:rw`
|
||||||
|
- Shared Docker network for inter-shard communication
|
||||||
|
- Environment: `TRAEFIK_API_URL=http://10.0.0.151:8080/api` (reach Heimdall)
|
||||||
|
|
||||||
|
### Phase 4: Prepare Traefik Integration
|
||||||
|
|
||||||
|
10. Create `/mnt/appdata/traefik/dynamic/mcp-managed/` directory (*depends on step 9*)
|
||||||
|
- Isolated folder for MCP-managed routes (safer, easier cleanup)
|
||||||
|
- Traefik file watcher will auto-detect changes here
|
||||||
|
|
||||||
|
11. Verify Traefik allows write access (*parallel with step 10*)
|
||||||
|
- Confirm NFS mount on Waldorf allows writes to `/mnt/appdata/traefik/dynamic/`
|
||||||
|
- If needed, update Traefik mount from `:ro` to `:rw` in `nodes/heimdall/core/compose.yaml`
|
||||||
|
|
||||||
|
### Phase 5: Shard Template Creation
|
||||||
|
|
||||||
|
12. Create comprehensive shard template (*depends on steps 5-7*)
|
||||||
|
- `template/shard_template.py` - Skeleton MCP server
|
||||||
|
- `template/Dockerfile.template` - Standard container build
|
||||||
|
- `template/compose.yaml.template` - Docker compose service boilerplate
|
||||||
|
- `template/requirements.txt` - Common dependencies
|
||||||
|
|
||||||
|
13. Write template documentation (*parallel with step 12*)
|
||||||
|
- `/mcp_root/template/README.md` - How to create a new shard
|
||||||
|
- `/mcp_root/template/INTEGRATION.md` - How shards register with Gateway
|
||||||
|
- `/mcp_root/ARCHITECTURE.md` - Overall system design
|
||||||
|
|
||||||
|
### Phase 6: Deployment & Validation
|
||||||
|
|
||||||
|
14. Deploy unified MCP system on Waldorf (*depends on steps 9, 10*)
|
||||||
|
- `docker compose up` in `/nodes/waldorf/mcp-system/`
|
||||||
|
- Verify Gateway logs show successful startup and shard discovery
|
||||||
|
- Verify Traefik Shard registers successfully
|
||||||
|
|
||||||
|
15. Test tool execution (*depends on step 14*)
|
||||||
|
- Gateway → list_routes → Traefik Shard → Traefik API (Heimdall)
|
||||||
|
- Create test route for validation
|
||||||
|
- Verify documentation fetcher works
|
||||||
|
|
||||||
|
16. Integration with Open WebUI (*depends on step 15*)
|
||||||
|
- Update `/nodes/waldorf/openwebui/compose.yaml` to connect to MCP Gateway
|
||||||
|
- Configure MCP Gateway connection in Open WebUI (localhost since same host)
|
||||||
|
- Test end-to-end LLM → Gateway → Shard flow
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Relevant Files
|
||||||
|
|
||||||
|
- `ansible/archive/scripts/ansible_mcp_server.py` - Reference implementation showing MCP server patterns, job tracking, configuration
|
||||||
|
- `nodes/heimdall/core/compose.yaml` - Contains Traefik service definition (lines 10-50), needs mount permission update
|
||||||
|
- `nodes/waldorf/openwebui/compose.yaml` - Open WebUI config with commented MCP Gateway integration (lines 15-17)
|
||||||
|
- `ansible/archive/outputs/heimdall-baseline-20260312T214117/traefik_configs/traefik.yml` - Static Traefik config showing API endpoint, providers, file watch
|
||||||
|
- `ansible/archive/outputs/heimdall-baseline-20260312T214117/traefik_configs/static-backends.yml` - Example dynamic route structure to replicate
|
||||||
|
- `ansible/archive/outputs/heimdall-baseline-20260312T214117/traefik_configs/middleware.yml` - Existing middleware definitions to reference
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Verification
|
||||||
|
|
||||||
|
1. **Gateway Health Check**: `curl http://10.0.0.251:9100/health` returns shard registry
|
||||||
|
2. **Shard Registration**: Gateway logs show Traefik shard discovered and registered
|
||||||
|
3. **Tool Execution**: Call `list_routes` through Gateway, receive Traefik API response
|
||||||
|
4. **Route Creation**: Create test route `test.castaldifamily.com` → Appears in Traefik dashboard
|
||||||
|
5. **Documentation Fetcher**: Call `get_traefik_docs("middlewares")` → Returns relevant Traefik docs
|
||||||
|
6. **File Validation**: Check `/mnt/appdata/traefik/dynamic/mcp-managed/` contains created routes
|
||||||
|
7. **Traefik Reload**: Verify Traefik auto-detects new YAML files (file watch enabled)
|
||||||
|
8. **Open WebUI Integration**: Send message in Open WebUI that triggers MCP tool → See logs in Gateway
|
||||||
|
9. **Template Usability**: Follow template README to create a stub "Dozzle Shard" → Registers successfully
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Decisions
|
||||||
|
|
||||||
|
- **Language**: Python (mcp.server.fastmcp) - matches existing Ansible MCP server pattern
|
||||||
|
- **Deployment Location**: All components on Waldorf (10.0.0.251) - stable 24/7 node with 16GB RAM, runs Open WebUI
|
||||||
|
- **Single Compose File**: Gateway + all shards in one docker-compose.yaml - simpler MVP, easier debugging
|
||||||
|
- **Traefik Access**: Shard reaches Traefik API on Heimdall via `http://10.0.0.151:8080/api`, writes to shared NFS mount `/mnt/appdata/traefik/dynamic/`
|
||||||
|
- **Authentication**: None for MVP - trust internal network isolation (add in future if needed)
|
||||||
|
- **Documentation Fetching**: On-demand web fetching using httpx - fetch from official service docs when tool is called
|
||||||
|
- **Route Management**: Create isolated `/mcp-managed/` subdirectory in Traefik dynamic config - safer than mixing with existing routes
|
||||||
|
- **All 7 Traefik tools included**: list_routes, create_route, delete_route, validate_config, get_backend_status, check_ssl_status, reload_config
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Scope Boundaries
|
||||||
|
|
||||||
|
**Included:**
|
||||||
|
- MCP Gateway with shard discovery and routing
|
||||||
|
- Complete Traefik shard with 7 tools + documentation fetcher
|
||||||
|
- Comprehensive template for creating new shards
|
||||||
|
- Integration with Open WebUI
|
||||||
|
- Single docker-compose deployment on Waldorf
|
||||||
|
|
||||||
|
**Excluded:**
|
||||||
|
- Additional shards (Dozzle, Authentik) - future work, use template to create
|
||||||
|
- Authentication/authorization - trust network for MVP
|
||||||
|
- Monitoring/metrics collection - add later if needed
|
||||||
|
- Web UI for Gateway management - CLI/API only for MVP
|
||||||
|
- Advanced caching for documentation - simple in-memory cache only
|
||||||
|
- Cross-node service mesh networking - direct HTTP between containers
|
||||||
|
- Ansible playbook for automated deployment - manual docker compose for MVP
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Further Considerations
|
||||||
|
|
||||||
|
None - all clarifications obtained. Ready for implementation.
|
||||||
41
.github/prompts/swarm-migration.prompt.md
vendored
41
.github/prompts/swarm-migration.prompt.md
vendored
@ -1,41 +0,0 @@
|
|||||||
You are a Senior DevOps Engineer and migration mentor.
|
|
||||||
Your job is to migrate exactly one service from standalone Docker Compose to Docker Swarm, then stop.
|
|
||||||
|
|
||||||
Environment facts you must treat as hard constraints:
|
|
||||||
- Ingress Traefik is external on 10.0.0.151.
|
|
||||||
- Traefik is not being replaced inside Swarm.
|
|
||||||
- traefik-kop is an integration agent, not the ingress load balancer.
|
|
||||||
- Swarm overlay network proxy-net already exists and must be used as an external network.
|
|
||||||
- Secrets must never be hardcoded in stack files.
|
|
||||||
- The process must be idempotent, safe to re-run, and rollback-friendly.
|
|
||||||
|
|
||||||
Input I will provide:
|
|
||||||
1. Original compose file content for one service.
|
|
||||||
2. Service name.
|
|
||||||
3. Any required env vars or secret names.
|
|
||||||
4. Any host paths or storage dependencies.
|
|
||||||
|
|
||||||
What you must do:
|
|
||||||
1. Analyze the input compose and produce a migration risk assessment.
|
|
||||||
2. Convert only this one service to a Swarm-ready Compose v3.9 stack definition.
|
|
||||||
3. Keep architecture aligned with external Traefik and external proxy-net.
|
|
||||||
4. Separate secrets from non-secret config and show how to map to Docker secrets/configs.
|
|
||||||
5. Provide a preflight checklist and verification steps.
|
|
||||||
6. Provide a rollback checklist.
|
|
||||||
7. Stop after this one service. Do not start a second migration.
|
|
||||||
|
|
||||||
Required output format:
|
|
||||||
- Concept: Plain-English explanation of the migration design and why.
|
|
||||||
- File Path: Suggested target file path for the new stack file.
|
|
||||||
- Code: Valid YAML stack file.
|
|
||||||
- Why this over shell: Explain each major module/directive choice and why declarative/idempotent is safer.
|
|
||||||
- Safety checks: Explicit warnings for risky settings (privileged mode, root, host networking, broad mounts, exposed admin ports).
|
|
||||||
- Deployment commands: Exact commands for validate-only, deploy, verify, rollback.
|
|
||||||
- The Pro-Tip: One practical reliability tip for updates, health checks, or scaling.
|
|
||||||
|
|
||||||
Strict rules:
|
|
||||||
- Migrate one service only.
|
|
||||||
- Do not assume missing values; mark them as Missing and ask only the minimum required follow-up questions.
|
|
||||||
- Do not invent secrets.
|
|
||||||
- Do not suggest disabling firewalls or unsafe permissions.
|
|
||||||
- End your response with: Ready for service 2 when you confirm service 1 is healthy.
|
|
||||||
295
README.md
295
README.md
@ -22,257 +22,6 @@
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 📦 Infrastructure Inventory
|
|
||||||
|
|
||||||
| Node | IP | Hardware | Platform/OS | Role | Services |
|
|
||||||
|------|------|----------|----------|------|----------|
|
|
||||||
| **PVE01** | `10.0.0.201` | Physical Server<br/>Intel i5-13500T (14c), 15GB RAM | Proxmox VE 9.1.7 | Hypervisor | VM orchestration platform |
|
|
||||||
| **Heimdall** | `10.0.0.151` | Physical Server<br/>Intel N100 (4c), 15GB RAM | Ubuntu 24.04 | Core Services | Komodo, Gitea, Traefik |
|
|
||||||
| **Waldorf** | `10.0.0.251` | Physical Server<br/>i7-7820HQ (8c), GTX 1060, 16GB | Ubuntu 24.04 | Media Processing | Plex and Related Media Services |
|
|
||||||
| **Watchtower** | `10.0.0.200` | Physical Server<br/>ARM Cortex-A76 (4c), 16GB | Debian Trixie | Control Plane | Ansible, VS Code, Monitoring Tools |
|
|
||||||
| **TerraMaster** | `10.0.0.250` | NAS | TOS | Shared Storage | NFS (Volume1: `/appdata`, Volume2: `/media`) |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## ⚡ Quick Start
|
|
||||||
|
|
||||||
### Prerequisites
|
|
||||||
|
|
||||||
- SSH access to nodes
|
|
||||||
- Git configured with credentials:
|
|
||||||
```bash
|
|
||||||
git config --global credential.helper wincred # Windows
|
|
||||||
git config --global core.autocrlf true
|
|
||||||
```
|
|
||||||
|
|
||||||
### Clone & Deploy
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Clone from self-hosted Gitea
|
|
||||||
git clone https://git.castaldifamily.com/nathan/homelab.git
|
|
||||||
cd homelab
|
|
||||||
|
|
||||||
# Deploy a service (via Komodo UI or SSH)
|
|
||||||
ssh chester@10.0.0.251
|
|
||||||
cd /etc/komodo/stacks/tunarr
|
|
||||||
docker compose up -d
|
|
||||||
```
|
|
||||||
|
|
||||||
### Automated GitOps Workflow
|
|
||||||
|
|
||||||
1. **Edit** `nodes/{node}/{service}/compose.yaml` locally
|
|
||||||
2. **Commit** and push to Gitea: `git add . && git commit -m "feat: update service" && git push`
|
|
||||||
3. **Webhook** triggers Komodo Core (heimdall)
|
|
||||||
4. **Auto-deploy** pulls latest code and restarts containers
|
|
||||||
5. **Monitor** via Komodo UI at `http://10.0.0.151:9000`
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## ⚙️ Automation
|
|
||||||
|
|
||||||
### Ansible Control Plane
|
|
||||||
|
|
||||||
**Watchtower** (10.0.0.200) manages all infrastructure via Ansible:
|
|
||||||
|
|
||||||
**Status:** 🟢 **PRODUCTION READY** (4 nodes, all responding)
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# SSH into control node
|
|
||||||
ssh chester@10.0.0.200
|
|
||||||
cd ~/homelab/ansible
|
|
||||||
|
|
||||||
# Quick health check
|
|
||||||
./validate-environment.sh
|
|
||||||
|
|
||||||
# Test connectivity to all nodes
|
|
||||||
ansible all -m ping
|
|
||||||
|
|
||||||
# Gather live system facts
|
|
||||||
ansible-playbook playbooks/gather-node-facts.yml
|
|
||||||
|
|
||||||
# Deploy Proxmox post-install config
|
|
||||||
ansible-playbook playbooks/onboard-proxmox.yml --limit pve01
|
|
||||||
|
|
||||||
# Run commands across node groups
|
|
||||||
ansible docker_nodes -m command -a "docker ps"
|
|
||||||
ansible proxmox_cluster -m command -a "pveversion"
|
|
||||||
```
|
|
||||||
|
|
||||||
**Quick Reference:** See [ansible/QUICK-REFERENCE.md](ansible/QUICK-REFERENCE.md) for comprehensive command guide.
|
|
||||||
**Setup Documentation:** [documentation/plans/plan-ansibleSetup.md](documentation/plans/plan-ansibleSetup.md)
|
|
||||||
|
|
||||||
### Managed Node Groups
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
control_plane: watchtower
|
|
||||||
docker_nodes: heimdall, waldorf
|
|
||||||
proxmox_cluster: pve01
|
|
||||||
nfs_clients: heimdall, waldorf
|
|
||||||
core_services: heimdall
|
|
||||||
media_services: waldorf
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 🎯 Active Missions
|
|
||||||
|
|
||||||
> **Traffic Light System:** 🟢 Complete | 🟡 In Progress | 🔴 Blocked
|
|
||||||
|
|
||||||
| Status | Mission | Details |
|
|
||||||
|--------|---------|---------|
|
|
||||||
| 🟢 | **Komodo GitOps** | All stacks migrated to Git sources with webhook automation |
|
|
||||||
| 🟢 | **GPU Transcoding** | GTX 1060 Mobile accessible in Plex/Tunarr containers |
|
|
||||||
| 🟢 | **Documentation Structure** | KBAs and SOPs organized in `documentation/` |
|
|
||||||
| 🟢 | **Ansible Automation** | All 4 nodes onboarded and managed by Ansible from Watchtower |
|
|
||||||
| 🟢 | **Proxmox Post-Install** | PVE01 configured: subscription nag removed, repos optimized |
|
|
||||||
| 🟡 | **Hardware Transcoding Validation** | Monitor Plex for `(hw)` indicator during active streams |
|
|
||||||
| 🟢 | **NFS Mount Stability** | NFSv3 on Pi, NFSv4 on x86 nodes |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 📂 Repository Structure
|
|
||||||
|
|
||||||
```
|
|
||||||
homelab/
|
|
||||||
├── ansible/ # Ansible automation (active)
|
|
||||||
│ ├── inventory/ # Managed hosts and groups
|
|
||||||
│ │ ├── hosts.ini # 4-node inventory
|
|
||||||
│ │ └── host_vars/ # Per-node configuration
|
|
||||||
│ ├── playbooks/ # Automation workflows
|
|
||||||
│ │ ├── onboard-nodes.yml # Node SSH key deployment
|
|
||||||
│ │ ├── onboard-proxmox.yml # Proxmox post-install
|
|
||||||
│ │ └── gather-node-facts.yml # System discovery
|
|
||||||
│ ├── roles/ # Reusable automation
|
|
||||||
│ │ └── proxmox_post_install/ # Nag removal, repo config
|
|
||||||
│ └── group_vars/ # Global variables
|
|
||||||
├── nodes/ # Service definitions per node
|
|
||||||
│ ├── heimdall/ # Core infrastructure (Physical)
|
|
||||||
│ │ ├── core/ # Komodo, Traefik, Redis
|
|
||||||
│ │ ├── trek/ # Trek service
|
|
||||||
│ │ ├── vaultwarden/ # Password manager
|
|
||||||
│ │ └── (gitea via Komodo) # Self-hosted Git
|
|
||||||
│ ├── waldorf/ # Media services (Physical)
|
|
||||||
│ │ ├── plex/ # Media server + GPU
|
|
||||||
│ │ └── tunarr/ # IPTV channels + GPU
|
|
||||||
│ └── watchtower/ # Control plane (Pi 5)
|
|
||||||
│ └── vscode/ # Remote development
|
|
||||||
├── documentation/ # Technical knowledge base
|
|
||||||
│ ├── KBAs/ # Troubleshooting guides
|
|
||||||
│ ├── SOPs/ # Operational procedures
|
|
||||||
│ ├── plans/ # Implementation roadmaps
|
|
||||||
│ └── TECHNICAL_RUNBOOK.md # Emergency reference
|
|
||||||
└── scripts/ # Utility scripts
|
|
||||||
├── bootstrap.sh # Day-0 node initialization
|
|
||||||
└── lib/ # Shared function libraries
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 🔧 Common Operations
|
|
||||||
|
|
||||||
### Deploy a New Stack
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# 1. Create directory structure
|
|
||||||
mkdir -p nodes/waldorf/sonarr
|
|
||||||
|
|
||||||
# 2. Create compose.yaml
|
|
||||||
cat > nodes/waldorf/sonarr/compose.yaml <<EOF
|
|
||||||
services:
|
|
||||||
sonarr:
|
|
||||||
image: lscr.io/linuxserver/sonarr:latest
|
|
||||||
restart: unless-stopped
|
|
||||||
ports:
|
|
||||||
- 8989:8989
|
|
||||||
volumes:
|
|
||||||
- /mnt/appdata/sonarr:/config
|
|
||||||
EOF
|
|
||||||
|
|
||||||
# 3. Commit and push
|
|
||||||
git add nodes/waldorf/sonarr/
|
|
||||||
git commit -m "feat(stacks): add Sonarr to Waldorf"
|
|
||||||
git push
|
|
||||||
|
|
||||||
# 4. Configure in Komodo UI
|
|
||||||
# - Source Type: Git Repo
|
|
||||||
# - Run Directory: nodes/waldorf/sonarr
|
|
||||||
# - Deploy!
|
|
||||||
```
|
|
||||||
|
|
||||||
### Check Service Status
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Via Komodo API
|
|
||||||
curl http://10.0.0.151:9000/api/stacks
|
|
||||||
|
|
||||||
# Direct SSH to node
|
|
||||||
ssh chester@10.0.0.251
|
|
||||||
docker ps | grep tunarr
|
|
||||||
docker logs tunarr --tail 50
|
|
||||||
```
|
|
||||||
|
|
||||||
### Emergency Rollback
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# In Komodo UI: Click "Rollback" on stack
|
|
||||||
# Or via Git:
|
|
||||||
git revert HEAD
|
|
||||||
git push # Triggers auto-rollback
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 📚 Documentation
|
|
||||||
|
|
||||||
| Document | Purpose |
|
|
||||||
|----------|---------|
|
|
||||||
| [TECHNICAL_RUNBOOK.md](documentation/TECHNICAL_RUNBOOK.md) | Infrastructure overview, emergency procedures, maintenance schedule |
|
|
||||||
| [KBA-001](documentation/KBAs/KBA-001-Komodo-GitOps-Stack-Deployment-Failures.md) | Troubleshooting Git-linked stack failures |
|
|
||||||
| [SOP-001](documentation/SOPs/SOP-001-Migrate-Stack-from-UI-to-Git.md) | Step-by-step guide to migrate stacks to GitOps |
|
|
||||||
| [Node READMEs](nodes/) | Hardware specs and service details per node |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 🛡️ Security & Best Practices
|
|
||||||
|
|
||||||
### Secrets Management
|
|
||||||
|
|
||||||
- ❌ **NEVER** commit passwords, API keys, or tokens to Git
|
|
||||||
- ✅ **DO** use Komodo Environment Variables for secrets
|
|
||||||
- ✅ **DO** use Gitea App Tokens for authentication (avoids SSH key exchange issues)
|
|
||||||
|
|
||||||
Example:
|
|
||||||
```yaml
|
|
||||||
# In Git (compose.yaml)
|
|
||||||
environment:
|
|
||||||
- PUID=1000
|
|
||||||
- PGID=1000
|
|
||||||
- API_KEY=${PLEX_API_KEY} # Injected by Komodo
|
|
||||||
|
|
||||||
# In Komodo UI: Set PLEX_API_KEY in Environment Variables
|
|
||||||
```
|
|
||||||
|
|
||||||
### NFS Mount Configuration
|
|
||||||
|
|
||||||
**Critical:** Raspberry Pi requires NFSv3 (not v4) due to ID-domain mismatches:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# /etc/fstab on Watchtower (Pi 5)
|
|
||||||
10.0.0.250:/Volume1/appdata /mnt/appdata nfs nfsvers=3,rw,sync 0 0
|
|
||||||
|
|
||||||
# /etc/fstab on Heimdall/Waldorf (x86 Ubuntu)
|
|
||||||
10.0.0.250:/Volume1/appdata /mnt/appdata nfs4 rw,sync 0 0
|
|
||||||
```
|
|
||||||
|
|
||||||
### Backup Strategy
|
|
||||||
|
|
||||||
- **Git Repository:** Daily backups via Gitea's built-in backup feature
|
|
||||||
- **Docker Volumes:** Weekly snapshots to `/mnt/appdata/backups/`
|
|
||||||
- **Proxmox VMs:** Daily snapshots with 7-day retention (when VMs are deployed)
|
|
||||||
- **Configuration Files:** Tracked in Git under `nodes/{hostname}/`
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 📊 Stats
|
## 📊 Stats
|
||||||
|
|
||||||
- **Total Nodes:** 5 (1 hypervisor + 3 compute + 1 storage)
|
- **Total Nodes:** 5 (1 hypervisor + 3 compute + 1 storage)
|
||||||
@ -287,44 +36,6 @@ environment:
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 🔥 Emergency Procedures
|
|
||||||
|
|
||||||
### NFS Mount Failure
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Check connectivity
|
|
||||||
ping 10.0.0.250
|
|
||||||
|
|
||||||
# Remount
|
|
||||||
sudo umount /mnt/appdata
|
|
||||||
sudo mount -a
|
|
||||||
df -h | grep appdata
|
|
||||||
```
|
|
||||||
|
|
||||||
### Komodo Periphery Offline
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Check WebSocket connectivity
|
|
||||||
curl -v ws://10.0.0.151:9120
|
|
||||||
|
|
||||||
# Restart agent
|
|
||||||
docker restart komodo-periphery
|
|
||||||
docker logs -f komodo-periphery
|
|
||||||
```
|
|
||||||
|
|
||||||
### Traefik SSL Certificate Issues
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Check Cloudflare API token
|
|
||||||
docker exec traefik cat /etc/traefik/traefik.yml
|
|
||||||
|
|
||||||
# Force certificate renewal
|
|
||||||
docker restart traefik
|
|
||||||
docker logs traefik | grep -i "cloudflare\|certificate"
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 🤝 Contributing
|
## 🤝 Contributing
|
||||||
|
|
||||||
This is a personal homelab, but documentation improvements and issue reports are welcome!
|
This is a personal homelab, but documentation improvements and issue reports are welcome!
|
||||||
@ -343,6 +54,6 @@ Personal infrastructure configuration. Documentation licensed under [CC BY-SA 4.
|
|||||||
---
|
---
|
||||||
|
|
||||||
**Maintained by:** Nathan Castaldi
|
**Maintained by:** Nathan Castaldi
|
||||||
**Last Updated:** April 13, 2026
|
**Last Updated:** April 21, 2026
|
||||||
**Status:** 🟢 Operational
|
**Status:** 🟢
|
||||||
**Automation Status:** 🟢 Ansible Fully Deployed
|
**Automation Status:** 🟢
|
||||||
|
|||||||
130
mcp_root/GeminiConversation.md
Normal file
130
mcp_root/GeminiConversation.md
Normal file
@ -0,0 +1,130 @@
|
|||||||
|
This is a textbook "Infrastructure as Code" play. What you're describing is essentially an MCP Gateway (Main Server) orchestrating a Microservices Mesh (Shards).
|
||||||
|
|
||||||
|
By sharding the architecture, you gain isolation (if the Authentik API is down, you can still read Dozzle logs) and scalability (each shard can run in its own container with specific permissions).
|
||||||
|
|
||||||
|
🏗️ The "Homelab-MCP" Sharded Architecture
|
||||||
|
I think the best way to handle this is to have a Main Gateway Server that acts as the "Project Coordinator" (just like I do!) and routes requests to specialized Service Shards.
|
||||||
|
|
||||||
|
1. The Shard Breakdown
|
||||||
|
We can define three primary shards based on our discussion:
|
||||||
|
|
||||||
|
Shard A: The "Navigator" (Traefik)
|
||||||
|
|
||||||
|
Focus: Network routing and entrypoints.
|
||||||
|
|
||||||
|
Tools: get_routes, add_dynamic_config, check_ssl_status.
|
||||||
|
|
||||||
|
Permissions: Needs write access to your /dynamic YAML folder.
|
||||||
|
|
||||||
|
Shard B: The "Watchman" (Dozzle)
|
||||||
|
|
||||||
|
Focus: Observability and debugging.
|
||||||
|
|
||||||
|
Tools: tail_logs, search_logs, get_container_stats.
|
||||||
|
|
||||||
|
Permissions: Needs network access to the Dozzle API port.
|
||||||
|
|
||||||
|
Shard C: The "Gatekeeper" (Authentik)
|
||||||
|
|
||||||
|
Focus: Identity and Access Management.
|
||||||
|
|
||||||
|
Tools: sync_brand, create_application, manage_flows.
|
||||||
|
|
||||||
|
Permissions: Needs an Authentik API Token.
|
||||||
|
|
||||||
|
🛠️ The Implementation Strategy
|
||||||
|
Instead of building one massive, brittle script, we use a Shared Tool Definition approach.
|
||||||
|
|
||||||
|
The Main "Gateway" Server
|
||||||
|
The main server doesn't need to know how to talk to Traefik; it only needs to know which shard owns the tool. When you say, "Frank, check why Gitea is 404ing," the Gateway:
|
||||||
|
|
||||||
|
Calls Shard A (Traefik) to see if the route exists.
|
||||||
|
|
||||||
|
Calls Shard B (Dozzle) to see the container logs.
|
||||||
|
|
||||||
|
Synthesizes the answer for you.
|
||||||
|
|
||||||
|
The Docker Setup
|
||||||
|
We can wrap this into a single docker-compose.mcp.yaml on waldorf:
|
||||||
|
|
||||||
|
YAML
|
||||||
|
services:
|
||||||
|
mcp-gateway:
|
||||||
|
image: node:20-slim
|
||||||
|
volumes:
|
||||||
|
- ./gateway:/app
|
||||||
|
environment:
|
||||||
|
- SHARD_TRAEFIK_URL=http://mcp-traefik:3001
|
||||||
|
- SHARD_DOZZLE_URL=http://mcp-dozzle:3002
|
||||||
|
- SHARD_AUTHENTIK_URL=http://mcp-authentik:3003
|
||||||
|
|
||||||
|
mcp-traefik:
|
||||||
|
image: python:3.11-slim
|
||||||
|
volumes:
|
||||||
|
- /mnt/appdata/traefik/dynamic:/dynamic # For writing routes
|
||||||
|
environment:
|
||||||
|
- TRAEFIK_API=http://traefik:8080/api
|
||||||
|
|
||||||
|
mcp-dozzle:
|
||||||
|
# Lightweight Node.js shard
|
||||||
|
environment:
|
||||||
|
- DOZZLE_API=http://dozzle:8080/api
|
||||||
|
🔍 Frank’s Logic: Why this wins
|
||||||
|
Security (Principle of Least Privilege): Only the Traefik shard gets access to your config files. The Dozzle shard only sees logs.
|
||||||
|
|
||||||
|
Decoupled Development: We can build the Dozzle shard today, and add the Authentik shard next week without breaking the gateway.
|
||||||
|
|
||||||
|
Natural Conversation: I (the LLM) don't have to worry about which API to hit. I just call tail_logs and the Gateway handles the routing.
|
||||||
|
|
||||||
|
⚠️ Potential Concern: Latency
|
||||||
|
In a homelab, the overhead of the Gateway talking to a Shard is negligible (milliseconds), but we need to ensure the Gateway has a Discovery Mechanism so it knows which tools are currently online.
|
||||||
|
|
||||||
|
✅ Next Step
|
||||||
|
Since we already have your Traefik and Dozzle configs, do you want me to draft the first Shard (The Navigator) code? We can start by building the tool that reads your Traefik API and tells me if your Gitea/Pinchflat routes are "OK."
|
||||||
|
|
||||||
|
Should we start with the Traefik Shard or the Dozzle Shard for the first "active" tool?
|
||||||
|
|
||||||
|
For your "Homelab-MCP" sharded architecture, the choice of node depends on the specialty of the shard and the risk level of the tools it contains.
|
||||||
|
|
||||||
|
Since you have Waldorf (16GB RAM, Dedicated GPU) as your primary workhorse and presumably other nodes (like a NAS or smaller Pis), here is the recommended distribution for your sharded setup:
|
||||||
|
|
||||||
|
1. The Gateway Shard (The Router)
|
||||||
|
Target Node: Waldorf (or your most reliable 24/7 node).
|
||||||
|
|
||||||
|
Reasoning: The Gateway is the entry point for your LLM. It needs to be on your most stable node with the best network connectivity to the other shards. Since it's lightweight (routing logic), it won't interfere with Plex/Tunarr.
|
||||||
|
|
||||||
|
Deployment: Run this as a Docker container.
|
||||||
|
|
||||||
|
2. The "Navigator" Shard (Traefik)
|
||||||
|
Target Node: The Traefik Host.
|
||||||
|
|
||||||
|
Reasoning: This shard needs direct filesystem access to the /dynamic configuration folder to write YAML routes.
|
||||||
|
|
||||||
|
SRE Logic: Deploy this shard on whichever node currently hosts your Traefik container. By running the shard on the same host, you can use a simple Bind Mount (e.g., - /mnt/appdata/traefik/dynamic:/dynamic) instead of trying to manage cross-node network file shares (which, as we saw with Pinchflat, can cause locking issues).
|
||||||
|
|
||||||
|
3. The "Watchman" Shard (Dozzle)
|
||||||
|
Target Node: Anywhere (but Waldorf is best).
|
||||||
|
|
||||||
|
Reasoning: Dozzle gathers logs via the Docker socket. If you use a Docker Socket Proxy (which your traefik.yml suggests you do), this shard can run on any node as long as it can reach that proxy's IP.
|
||||||
|
|
||||||
|
SRE Logic: Running it on Waldorf is ideal because it keeps the "Observability" tools close to the GPU/Plex logs it will likely be analyzing.
|
||||||
|
|
||||||
|
4. The "Gatekeeper" Shard (Authentik)
|
||||||
|
Target Node: Waldorf.
|
||||||
|
|
||||||
|
Reasoning: Authentik is an API-driven service. This shard doesn't need local files; it just needs a stable network path to your Authentik API. Waldorf’s 16GB RAM handles the API overhead comfortably.
|
||||||
|
|
||||||
|
🚀 Summary Deployment Table
|
||||||
|
Shard Best Node Critical Resource
|
||||||
|
Gateway Waldorf Network Stability
|
||||||
|
Traefik Same host as Traefik Local File Access (/dynamic)
|
||||||
|
Dozzle Waldorf Network access to Socket Proxy
|
||||||
|
Authentik Waldorf API Connectivity
|
||||||
|
⚠️ A Note on "The Docker Socket"
|
||||||
|
Since you are a security-conscious SRE, never mount /var/run/docker.sock directly into an MCP shard. If I (the LLM) am compromised, an attacker could take over your host.
|
||||||
|
|
||||||
|
Always use a Docker Socket Proxy (like the one in your Traefik config: tcp://docker-socket-proxy:2375).
|
||||||
|
|
||||||
|
Configure the proxy to only allow GET requests for logs/containers, and block DELETE or POST requests.
|
||||||
|
|
||||||
|
Which node is currently hosting your Traefik instance? If it's Waldorf, we can consolidate almost everything there for maximum performance.
|
||||||
Loading…
x
Reference in New Issue
Block a user