5 changed files with 1497 additions and 0 deletions
--- a/.github/prompts/plan-homelabMCProadmap.prompt.md
+++ b/.github/prompts/plan-homelabMCProadmap.prompt.md
@ -0,0 +1,99 @@
 ## Roadmap Plan: Homelab MCP Gateway Expansion
 ### TL;DR
 Evolve the current MVP into a production-grade platform by adding shards, hardening the gateway, improving security, expanding observability, and introducing mesh-ready capabilities only when justified.  
 Estimated total roadmap effort: **8 to 14 weeks** (part-time homelab pace).
 ### Planning Assumptions
 1. Work is done incrementally with validation after each phase.
 2. Existing Traefik shard and gateway baseline are already in place.
 3. Priority can shift based on incidents, new integrations, or time constraints.
 ---
 ## Phases, Tasks, and Time Estimates
 | Phase | Task | Time to Complete | Notes |
 |---|---|---:|---|
 | Phase 1: Foundation Hardening | Gateway health registry and shard auto-disable | 0.5-1 day | Prevents unhealthy shard routing |
 | Phase 1: Foundation Hardening | Standard error model and partial-failure handling | 1-2 days | Improves reliability and UX |
 | Phase 1: Foundation Hardening | Per-tool timeout/retry policy | 0.5-1 day | Fast resilience win |
 | Phase 1: Foundation Hardening | Basic rate limiting/per-client quotas | 1 day | Protects from accidental overload |
 |  | **Phase 1 Total** | **3-5 days** |  |
 | Phase | Task | Time to Complete | Notes |
 |---|---|---:|---|
 | Phase 2: Security Baseline | Bearer token auth for gateway and shards | 1-2 days | Start simple, internal tokens |
 | Phase 2: Security Baseline | Tool-level RBAC (read vs admin tools) | 1-2 days | Reduces blast radius |
 | Phase 2: Security Baseline | Audit logging for every tool invocation | 0.5-1 day | Supports incident review |
 | Phase 2: Security Baseline | Secret management pattern (env + vault-ready abstraction) | 1 day | Keeps migration easy later |
 |  | **Phase 2 Total** | **3.5-6 days** |  |
 | Phase | Task | Time to Complete | Notes |
 |---|---|---:|---|
 | Phase 3: Documentation Intelligence | Official-source allowlist for doc fetchers | 0.5 day | Limits bad sources |
 | Phase 3: Documentation Intelligence | Caching with TTL and source metadata | 1 day | Lower latency, fewer external calls |
 | Phase 3: Documentation Intelligence | Summarize-and-cite doc responses | 1 day | Better operator trust |
 | Phase 3: Documentation Intelligence | Upstream doc change detection (diff/check) | 1-2 days | Detects API drift |
 |  | **Phase 3 Total** | **3.5-4.5 days** |  |
 | Phase | Task | Time to Complete | Notes |
 |---|---|---:|---|
 | Phase 4: Additional Shards | Dozzle shard (logs, stats, search) | 3-5 days | Highest immediate value |
 | Phase 4: Additional Shards | Authentik shard (apps/flows/branding) | 4-6 days | IAM controls require care |
 | Phase 4: Additional Shards | Gitea shard (repo/webhook/deploy metadata) | 2-4 days | Useful for GitOps visibility |
 | Phase 4: Additional Shards | Komodo shard (status + guarded deploy actions) | 3-5 days | Add write guardrails early |
 |  | **Phase 4 Total** | **12-20 days** |  |
 | Phase | Task | Time to Complete | Notes |
 |---|---|---:|---|
 | Phase 5: Traefik Shard Maturity | Dry-run mode for route changes | 1 day | Safer ops |
 | Phase 5: Traefik Shard Maturity | Rollback snapshots/versioned configs | 1-2 days | Quick recovery path |
 | Phase 5: Traefik Shard Maturity | Conflict detection before writes | 1 day | Prevents route collisions |
 | Phase 5: Traefik Shard Maturity | Middleware preset library + validation | 1-2 days | Standardization |
 |  | **Phase 5 Total** | **4-6 days** |  |
 | Phase | Task | Time to Complete | Notes |
 |---|---|---:|---|
 | Phase 6: Test and Quality | Gateway↔shard contract tests | 1-2 days | Prevents integration regressions |
 | Phase 6: Test and Quality | Mock-based shard simulation tests | 1-2 days | Faster local testing |
 | Phase 6: Test and Quality | CI checks for templates/scaffolded shards | 1 day | Enforces consistency |
 | Phase 6: Test and Quality | Post-deploy smoke test command | 0.5-1 day | Faster validation loop |
 |  | **Phase 6 Total** | **3.5-6 days** |  |
 | Phase | Task | Time to Complete | Notes |
 |---|---|---:|---|
 | Phase 7: Observability and Ops | Structured logs with request IDs | 0.5-1 day | Better debugging |
 | Phase 7: Observability and Ops | Metrics: latency/error/utilization | 1-2 days | Capacity planning input |
 | Phase 7: Observability and Ops | Alerts for shard offline/state drift | 1 day | Operational guardrails |
 | Phase 7: Observability and Ops | Optional tracing across gateway/shards | 1-2 days | Add when needed |
 |  | **Phase 7 Total** | **3.5-6 days** |  |
 | Phase | Task | Time to Complete | Notes |
 |---|---|---:|---|
 | Phase 8: Mesh-Ready Evolution | Service discovery abstraction | 1-2 days | Remove hardcoded endpoints |
 | Phase 8: Mesh-Ready Evolution | mTLS-ready client/server wrappers | 2-3 days | Security prep |
 | Phase 8: Mesh-Ready Evolution | Inter-service policy model | 1-2 days | Zero-trust stepping stone |
 | Phase 8: Mesh-Ready Evolution | Full cross-node mesh pilot (optional) | 3-5 days | Only if justified |
 |  | **Phase 8 Total** | **7-12 days** |  |
 ---
 ## Suggested Execution Order (Pragmatic)
 1. Phase 1 Foundation Hardening  
 2. Phase 2 Security Baseline  
 3. Phase 4 Additional Shards (start with Dozzle first)  
 4. Phase 3 Documentation Intelligence  
 5. Phase 5 Traefik Maturity  
 6. Phase 6 Test and Quality  
 7. Phase 7 Observability and Ops  
 8. Phase 8 Mesh-Ready Evolution (optional trigger-based)
 ---
 ## Milestone Timing (High Level)
 1. **Milestone A (Week 1-2):** Foundation + Security done  
 2. **Milestone B (Week 3-6):** Dozzle + one additional shard operational  
 3. **Milestone C (Week 6-8):** Documentation intelligence + Traefik safety features  
 4. **Milestone D (Week 8-10):** Test harness + operational observability  
 5. **Milestone E (Week 10+):** Mesh-ready features or full mesh pilot if needed
--- a/.github/prompts/plan-homelabMcpGatewayMvp.prompt.md
+++ b/.github/prompts/plan-homelabMcpGatewayMvp.prompt.md
@ -0,0 +1,168 @@
 # Plan: Homelab MCP Gateway MVP with Traefik Shard
 ## TL;DR
 Build a modular MCP (Model Context Protocol) Gateway on Waldorf that routes tool requests to specialized shards. MVP includes the Traefik shard (for dynamic route management) plus a template for creating additional shards. Each shard can fetch its service's documentation from the internet on-demand.
 **Approach:** Python-based using mcp.server.fastmcp, deploy via single docker-compose on Waldorf, no authentication (trust internal network), web fetching for live documentation.
 ---
 ## Steps
 ### Phase 1: Infrastructure Setup
 1. Create unified directory structure on Waldorf
   - `/nodes/waldorf/mcp-system/` with single compose.yaml
   - `/nodes/waldorf/mcp-system/gateway/` for Gateway code
   - `/nodes/waldorf/mcp-system/traefik-shard/` for Traefik Shard code
 2. Create shared template directory (*parallel with step 1*)
   - `/mcp_root/template/` for shard template files
   - Documentation: `/mcp_root/template/README.md`
 ### Phase 2: Gateway Implementation
 3. Build Gateway core functionality (*depends on step 1*)
   - Shard registry (discover and register shards)
   - Tool routing (forward requests to appropriate shard)
   - Health check aggregation
   - Startup logic to discover available shards
 4. Create Gateway Dockerfile and requirements.txt (*parallel with step 3*)
   - Python 3.11 base image
   - Install mcp, httpx, pyyaml
 ### Phase 3: Traefik Shard Implementation
 5. Implement Traefik Shard with 7 tools (*depends on step 1*)
   - `list_routes` - Query Traefik API for all routes
   - `create_route` - Write new YAML file to `/dynamic/mcp-managed/`
   - `delete_route` - Remove route YAML file
   - `validate_config` - YAML syntax check + Traefik API validation
   - `get_backend_status` - Health check backend services
   - `check_ssl_status` - Query Traefik API for cert info
   - `reload_config` - Trigger Traefik config reload (if needed)
 6. Add documentation fetcher to Traefik Shard (*parallel with step 5*)
   - Tool: `get_traefik_docs(topic)` - Fetch from docs.traefik.io
   - Use httpx to fetch and cache temporarily
   - Parse HTML/Markdown for relevant sections
 7. Implement shard registration with Gateway (*depends on step 5*)
   - Health endpoint for Gateway discovery
   - Tool manifest endpoint (list available tools)
 8. Create Traefik Shard Dockerfile and requirements.txt (*depends on step 5*)
   - Python 3.11 base image
   - Install mcp, httpx, pyyaml, beautifulsoup4
 9. Create unified docker-compose.yaml (*depends on steps 4, 8*)
   - Gateway service with appdata mount
   - Traefik Shard service with NFS mount to `/mnt/appdata/traefik/dynamic:rw`
   - Shared Docker network for inter-shard communication
   - Environment: `TRAEFIK_API_URL=http://10.0.0.151:8080/api` (reach Heimdall)
 ### Phase 4: Prepare Traefik Integration
 10. Create `/mnt/appdata/traefik/dynamic/mcp-managed/` directory (*depends on step 9*)
    - Isolated folder for MCP-managed routes (safer, easier cleanup)
    - Traefik file watcher will auto-detect changes here
 11. Verify Traefik allows write access (*parallel with step 10*)
    - Confirm NFS mount on Waldorf allows writes to `/mnt/appdata/traefik/dynamic/`
    - If needed, update Traefik mount from `:ro` to `:rw` in `nodes/heimdall/core/compose.yaml`
 ### Phase 5: Shard Template Creation
 12. Create comprehensive shard template (*depends on steps 5-7*)
    - `template/shard_template.py` - Skeleton MCP server
    - `template/Dockerfile.template` - Standard container build
    - `template/compose.yaml.template` - Docker compose service boilerplate
    - `template/requirements.txt` - Common dependencies
 13. Write template documentation (*parallel with step 12*)
    - `/mcp_root/template/README.md` - How to create a new shard
    - `/mcp_root/template/INTEGRATION.md` - How shards register with Gateway
    - `/mcp_root/ARCHITECTURE.md` - Overall system design
 ### Phase 6: Deployment & Validation
 14. Deploy unified MCP system on Waldorf (*depends on steps 9, 10*)
    - `docker compose up` in `/nodes/waldorf/mcp-system/`
    - Verify Gateway logs show successful startup and shard discovery
    - Verify Traefik Shard registers successfully
 15. Test tool execution (*depends on step 14*)
    - Gateway → list_routes → Traefik Shard → Traefik API (Heimdall)
    - Create test route for validation
    - Verify documentation fetcher works
 16. Integration with Open WebUI (*depends on step 15*)
    - Update `/nodes/waldorf/openwebui/compose.yaml` to connect to MCP Gateway
    - Configure MCP Gateway connection in Open WebUI (localhost since same host)
    - Test end-to-end LLM → Gateway → Shard flow
 ---
 ## Relevant Files
 - `ansible/archive/scripts/ansible_mcp_server.py` - Reference implementation showing MCP server patterns, job tracking, configuration
 - `nodes/heimdall/core/compose.yaml` - Contains Traefik service definition (lines 10-50), needs mount permission update
 - `nodes/waldorf/openwebui/compose.yaml` - Open WebUI config with commented MCP Gateway integration (lines 15-17)
 - `ansible/archive/outputs/heimdall-baseline-20260312T214117/traefik_configs/traefik.yml` - Static Traefik config showing API endpoint, providers, file watch
 - `ansible/archive/outputs/heimdall-baseline-20260312T214117/traefik_configs/static-backends.yml` - Example dynamic route structure to replicate
 - `ansible/archive/outputs/heimdall-baseline-20260312T214117/traefik_configs/middleware.yml` - Existing middleware definitions to reference
 ---
 ## Verification
 1. **Gateway Health Check**: `curl http://10.0.0.251:9100/health` returns shard registry
 2. **Shard Registration**: Gateway logs show Traefik shard discovered and registered
 3. **Tool Execution**: Call `list_routes` through Gateway, receive Traefik API response
 4. **Route Creation**: Create test route `test.castaldifamily.com` → Appears in Traefik dashboard
 5. **Documentation Fetcher**: Call `get_traefik_docs("middlewares")` → Returns relevant Traefik docs
 6. **File Validation**: Check `/mnt/appdata/traefik/dynamic/mcp-managed/` contains created routes
 7. **Traefik Reload**: Verify Traefik auto-detects new YAML files (file watch enabled)
 8. **Open WebUI Integration**: Send message in Open WebUI that triggers MCP tool → See logs in Gateway
 9. **Template Usability**: Follow template README to create a stub "Dozzle Shard" → Registers successfully
 ---
 ## Decisions
 - **Language**: Python (mcp.server.fastmcp) - matches existing Ansible MCP server pattern
 - **Deployment Location**: All components on Waldorf (10.0.0.251) - stable 24/7 node with 16GB RAM, runs Open WebUI
 - **Single Compose File**: Gateway + all shards in one docker-compose.yaml - simpler MVP, easier debugging
 - **Traefik Access**: Shard reaches Traefik API on Heimdall via `http://10.0.0.151:8080/api`, writes to shared NFS mount `/mnt/appdata/traefik/dynamic/`
 - **Authentication**: None for MVP - trust internal network isolation (add in future if needed)
 - **Documentation Fetching**: On-demand web fetching using httpx - fetch from official service docs when tool is called
 - **Route Management**: Create isolated `/mcp-managed/` subdirectory in Traefik dynamic config - safer than mixing with existing routes
 - **All 7 Traefik tools included**: list_routes, create_route, delete_route, validate_config, get_backend_status, check_ssl_status, reload_config
 ---
 ## Scope Boundaries
 **Included:**
 - MCP Gateway with shard discovery and routing
 - Complete Traefik shard with 7 tools + documentation fetcher
 - Comprehensive template for creating new shards
 - Integration with Open WebUI
 - Single docker-compose deployment on Waldorf
 **Excluded:**
 - Additional shards (Dozzle, Authentik) - future work, use template to create
 - Authentication/authorization - trust network for MVP
 - Monitoring/metrics collection - add later if needed
 - Web UI for Gateway management - CLI/API only for MVP
 - Advanced caching for documentation - simple in-memory cache only
 - Cross-node service mesh networking - direct HTTP between containers
 - Ansible playbook for automated deployment - manual docker compose for MVP
 ---
 ## Further Considerations
 None - all clarifications obtained. Ready for implementation.
--- a/documentation/homelab-mcp-spec-frank-addendum.md
+++ b/documentation/homelab-mcp-spec-frank-addendum.md
@ -0,0 +1,621 @@
 ---
 title: "Homelab MCP Server — Frank v6 Integration Addendum"
 description: "Layers Frank v6 personality, DevOps/Data Analysis/Prompt Engineering specialties, and CoT/ToT/RAG reasoning techniques onto the base MCP server spec v1.0."
 version: "1.0"
 baseSpec: "homelab-mcp-spec.txt v1.0"
 frankVersion: "6.0"
 date: "May 2026"
 ---
 # Homelab MCP Server — Frank v6 Integration Addendum
 **Version**: 1.0 · May 2026  
 **Applies to**: [homelab-mcp-spec.txt](homelab-mcp-spec.txt) v1.0  
 **Frank version**: v6.0 (source: `.github/`)
 This addendum enriches the base MCP server spec with Frank v6's personality, three specialty domains (DevOps, Data Analysis, Prompt Engineering), and integrated reasoning techniques (CoT, ToT, RAG). The base spec is the authoritative source for transport, security, tool catalogue, and architecture. This document describes **behavioural layers** that run on top of that foundation — no base-spec decisions are altered.
 ---
 ## 1. Scope of Changes
 | Base spec section | Change type | Summary |
 |---|---|---|
 | §4 Architecture | Additive | Session system prompt composition architecture |
 | §3.4 Observability tools | Enriched behaviour | CoT/ToT reasoning in `run_diagnostic`; SCoT in `query_logs` / `get_resource_usage` |
 | §3.2 Ansible tools | Enriched behaviour | DevOps SRE workflow gates on `draft_playbook`, `dry_run_playbook` |
 | §3.3 Docker tools | Enriched behaviour | DevOps SRE gather→propose→verify loop on `compose_up`, `get_container_logs` |
 | §5.2 config.yaml | Additive | `frank:` top-level block |
 | §3 Tool catalogue | Additive | Four new tools: `analyze_logs`, `suggest_fix`, `generate_sop`, `search_runbooks` |
 | New | New section | Frank command registry (§6 of this document) |
 | New | New section | Reasoning technique selection logic (§7 of this document) |
 ---
 ## 2. Personality Layer
 ### 2.1 Personas Active in This Deployment
 Frank's full roster of seven core personas is loaded at every session. The four most relevant to homelab operations are given priority routing by the Project Manager persona:
 | Frank persona | Homelab role | Triggered by |
 |---|---|---|
 | **DevOps SRE (Docker & Compose)** | Container troubleshooting, Compose validation | Container, image, network, volume keywords; `/docker`, `/compose` |
 | **DevOps SRE (Ansible & IaC)** | Playbook authoring, idempotency review | Ansible, playbook, inventory, role keywords; `/ansible` |
 | **DataAnalystX** | Log analysis, resource trending, anomaly detection | Loki, query, metrics, CPU, memory, disk, trend keywords; `/analyze` |
 | **Senior Prompt Engineer** | System prompt construction, technique selection | `/craft`, `/optimize`, `/reason`, `/evaluate` |
 | **Technical Writer** | SOP, KBA, and runbook generation | `/generate_sop`, `/document` |
 | **Senior Business Analyst** | Strategic infrastructure decisions | `/consult` |
 | **QA Analyst** | Playbook and Compose file review | `/review` |
 ### 2.2 Tone Profile
 The overall communication style follows Frank.core's tone contract, tuned for operational context:
 - **Default (idle/read)**: Upbeat, mentoring-first, collaborative. Explains the "why" behind suggestions.
 - **Diagnostic/incident mode**: Calm, precise, and efficient. No unnecessary hedging. Steps are numbered. Uncertainty is quantified ("most likely cause: X — 80% confidence").
 - **Playbook authoring mode**: Methodical. Every proposed change is accompanied by a rollback instruction and a validation command.
 - **Log analysis mode**: Analytical and rigorous. Shows reasoning before conclusions. Flags assumptions explicitly.
 ### 2.3 Behavioural Guardrails (Anti-Patterns)
 The following failure modes from Frank's Prompt Engineering specialty are baked into the system prompt as hard constraints:
 | Anti-pattern | Guard |
 |---|---|
 | **Everything Persona** | Never adopt a persona that claims unlimited omniscience; scope is homelab + loaded specialties only |
 | **Implicit Context** | Always surface what topology/session data is being used before acting on it |
 | **Vague Success Criteria** | Every proposed action must include a concrete "how to verify this worked" step |
 | **Missing Error Handling** | Diagnostic suggestions always include a "if this doesn't fix it, try..." fallback |
 | **Reasoning Overkill** | ToT is only triggered for multi-hypothesis failures; CoT for single-path; plain output for simple reads |
 ---
 ## 3. Session System Prompt Architecture
 ### 3.1 Composition Order
 The server assembles the system prompt at session start in this exact order. Later layers override earlier ones only where explicitly stated.
 ```
 [1] Frank.core personality block
    — Persona roster, tone profile, guardrails, core command definitions
 [2] Topology summary (base spec §4.4 — unchanged)
    — Compact: node count, subnet list, roles, service index
 [3] Active specialty context (see §3.2 below)
    — Injected once per session; rotated if specialties change in config
 [4] Reasoning technique declarations
    — Which techniques are active this session (see §7)
 [5] Session state
    — Approval state, active session ID, last audit entry (base spec §4.3)
 ```
 ### 3.2 Specialty Activation
 Specialties are always-on and config-driven (not query-intent-detected). This ensures predictable behaviour regardless of phrasing. The active set is read from `frank.specialties` in `config.yaml` at server start.
 Each specialty contributes a condensed capability block to the system prompt (not its full text). Full specialty context is available to the LLM via internal reference but is not re-injected on every turn.
 **Condensed block format** (example for DevOps):
 ```
 [SPECIALTY: DevOps & SRE active]
 Personas: DevOps SRE (Docker), DevOps SRE (Ansible), Container Platform Architect
 Commands: /docker, /ansible, /compose, /traefik
 Philosophy: Smallest Viable Diff | Explicit Verification | Rollback-First | Idempotency
 ```
 ### 3.3 Token Budget Strategy
 | Context tier | When injected | Approximate token cost |
 |---|---|---|
 | Frank.core personality block | Every session | ~600 tokens |
 | Compact topology summary | Every session | ~200–400 tokens (scales with node count) |
 | Active specialty condensed blocks | Every session (3 specialties) | ~300 tokens total |
 | Reasoning technique declarations | Every session | ~100 tokens |
 | Full topology | On demand via `get_topology_full` | ~1,000–3,000 tokens |
 | Full specialty text | Never injected; referenced internally | — |
 | RAG runbook chunks | Per `run_diagnostic` / `suggest_fix` call | ~500–1,500 tokens per retrieval |
 ---
 ## 4. DevOps Specialty Integration
 *Source: `.github/specialties/specialty.devops.instructions.md`*  
 *Enriches base spec §3.2 (Ansible tools) and §3.3 (Docker tools)*
 ### 4.1 Core Philosophy Applied
 Every DevOps-context tool call follows these constraints, in order:
 1. **Smallest Viable Diff** — prefer environment variable changes over image rebuilds; prefer task additions over role rewrites
 2. **Explicit Verification** — every proposed change includes the exact command to verify it worked
 3. **Rollback Planning** — every write operation documents how to undo it before execution
 4. **No Secret Persistence** — never write secrets into Compose files, playbook vars, or inventory; always redirect to Vault or env injection
 5. **Idempotency First** — all Ansible tasks must be idempotent; warn if a proposed task is not
 ### 4.2 Enriched Docker Tool Behaviours
 #### `get_container_logs` — enriched output format
 When called in a diagnostic context (i.e., following a failed `get_stack_status` or during `/docker`), the response is structured by the DevOps SRE persona:
 ```
 [SYMPTOM]      One-line description of the observed failure
 [LOGS]         Relevant log excerpt (tail, with line numbers)
 [INTERPRETATION] What the log indicates (CoT reasoning shown)
 [HYPOTHESIS]   Most likely root cause
 [PROPOSED FIX] Smallest viable change with exact commands
 [VERIFY]       Command to confirm the fix worked
 [ROLLBACK]     Command to undo the change if it made things worse
 ```
 #### `compose_up` — pre-flight gate
 Before executing `compose_up` (which is a write operation), the server runs an implicit pre-flight sequence:
 1. Validate Compose syntax: `docker compose config` (dry-run equivalent)
 2. Check external network dependencies (e.g., `proxy-net`) exist
 3. Pull image digests and compare to currently running version
 4. Present a structured change summary to the LLM, which surfaces it inline before requesting write approval (base spec §4.3)
 If any pre-flight step fails, `compose_up` is blocked and the failure is surfaced as a diagnostic rather than a write error.
 #### `compose_down` — safety check
 Before executing `compose_down`, the server checks for dependent services in other stacks that reference the same networks. If found, it warns rather than proceeding silently.
 ### 4.3 Enriched Ansible Tool Behaviours
 #### `dry_run_playbook` — enriched output format
 The check-mode output is parsed and structured as:
 ```
 [PLAYBOOK]     Name and path
 [INVENTORY]    Target host group and count
 [TASKS]
  ✓ [ok]       task_name — would not change
  △ [changed]  task_name — what would change (diff shown)
  ✗ [failed]   task_name — predicted failure reason
 [RISK LEVEL]   Low / Medium / High (based on changed + failed task count)
 [ROLLBACK]     How to restore previous state if live run goes wrong
 [RECOMMENDATION] Proceed / Proceed with caution / Do not proceed
 ```
 #### `draft_playbook` — authoring constraints
 When the LLM generates playbook content for `draft_playbook`, it applies these constraints:
 - Tasks use `ansible.builtin.*` FQCN module names
 - Every play includes `become: true` only where strictly required
 - Variables with sensitive values use `{{ vault_* }}` references; never inline literals
 - Every task has a `name:` that describes the outcome, not the action ("Ensure nginx package is present" not "Install nginx")
 - A `handlers:` block is included when tasks trigger service restarts
 - The draft includes a `# REVIEW NOTES:` header block explaining what the playbook does, why each major task exists, and what the operator should verify before merging
 #### `run_playbook` — unchanged (base spec §3.2 hard limit honoured)
 The base spec hard limit is preserved without modification: `run_playbook` permanently blocks any playbook not on the merged `main` branch. Frank does not attempt to bypass or soft-code this gate.
 ### 4.4 Docker Diagnostic Scenarios
 The DevOps SRE persona follows specific diagnostic sequences for common failure patterns, applied within the `run_diagnostic` enriched behaviour (see §8.1):
 | Scenario | Diagnostic sequence |
 |---|---|
 | Container restart loop | `get_container_logs` → inspect exit code → check healthcheck config → check resource limits |
 | Network unreachable | `check_port` → `probe_http` → `docker network inspect` → check Traefik router labels |
 | Volume mount failure | `list_directory` on host path → check container user/UID → inspect volume definition |
 | Image pull failure | `http_fetch` registry endpoint → `check_package_version` for tag existence → check registry credentials (env ref only) |
 | TLS/cert failure | `probe_http` with status check → `get_container_logs` on Traefik → verify ACME challenge reachability via `dns_lookup` |
 ### 4.5 Ansible Diagnostic Scenarios
 | Scenario | Diagnostic sequence |
 |---|---|
 | SSH connection failure | `ping_host` → `check_port` port 22 → verify inventory host_vars → check SSH key mount |
 | Privilege escalation failure | `get_service_status` sudo → inspect `become_method` in playbook → check `ansible_become_password` vault ref |
 | Idempotency failure | `dry_run_playbook` twice → compare changed-task diff → identify non-idempotent task → propose `creates:` or `when:` guard |
 | Variable not found | `get_playbook` → trace variable source (inventory → group_vars → host_vars → role defaults) → identify missing declaration |
 ---
 ## 5. Data Analysis Specialty Integration
 *Source: `.github/specialties/specialty.data-analysis.instructions.md`*  
 *Enriches base spec §3.4 (Observability tools)*
 ### 5.1 SCoT Framework Applied to Homelab Observability
 The DataAnalystX persona applies the **Structured Chain-of-Thought (SCoT)** 6-phase methodology to all observability tool calls that involve interpretation (not just data retrieval):
 | SCoT phase | Homelab mapping |
 |---|---|
 | **1. Clarify & Define** | Restate the observable symptom; identify relevant nodes, services, and time window |
 | **2. Repository Check** | Check existing topology for known service relationships; reference previous audit log entries |
 | **3. Plan & Methodology** | Outline the query sequence before executing (which tools, what time range, what labels) |
 | **4. Execute** | Run `query_logs` / `get_resource_usage` with structured LogQL |
 | **5. Validate & Fallbacks** | Check for missing data, sparse label sets, clock skew between nodes |
 | **6. Insight & Recommendation** | Plain-English summary of findings + ranked action items |
 ### 5.2 Enriched `query_logs` Behaviour
 When called with a natural-language intent (e.g., "find OOM events in the last 6 hours"), the DataAnalystX persona:
 1. Translates the intent to a LogQL expression (shown to the LLM before execution)
 2. Executes the query
 3. Structures the output: total hit count, time distribution, top-3 message patterns, node breakdown
 4. Applies SCoT phase 6 to produce a plain-English insight block
 **LogQL pattern library** for common homelab scenarios:
 ```logql
 # Container crash / restart loop
 {job="docker"} |= "exited with" | regexp `exited with code (?P<code>\d+)`
 # Out-of-memory kill
 {job="docker"} |= "OOMKilled" or {job="kernel"} |= "oom_kill_process"
 # Slow HTTP response (Traefik access log)
 {job="traefik"} | json | duration > 2s
 # Authentication failure
 {job="authentik"} |= "Login failed" or {job="vaultwarden"} |= "Failed login"
 # Disk I/O saturation
 {job="node-exporter"} | json | io_wait > 20
 # Ansible run failure
 {job="mcp-audit"} |= "playbook" |= "FAILED"
 ```
 ### 5.3 Enriched `get_resource_usage` Behaviour
 When called without a specific diagnostic intent (i.e., routine health check), the DataAnalystX persona:
 1. Retrieves CPU, memory, and disk metrics for the target node
 2. Compares against thresholds defined in `frank.analysis.thresholds` (see §9.2)
 3. Flags anomalies with severity: **INFO** / **WARNING** / **CRITICAL**
 4. If anomalies are found, proposes a `query_logs` follow-up to correlate with log events
 **Default thresholds** (overridable in config):
 | Metric | WARNING | CRITICAL |
 |---|---|---|
 | CPU (1-min avg) | > 75% | > 90% |
 | Memory used | > 80% | > 95% |
 | Disk used | > 75% | > 90% |
 | Swap used | > 20% | > 50% |
 ### 5.4 New Tool: `analyze_logs` *(additive — not in base spec §3)*
 A compound workflow tool that chains `query_logs` → structure → trend → insight in a single LLM-facing call.
 **Tool definition**:
 | Field | Value |
 |---|---|
 | Name | `analyze_logs` |
 | Parameters | `intent` (natural language), `node?`, `service?`, `window?` (default: 1h) |
 | Auth required | Read — free |
 | Returns | Structured analysis: query used, hit count, time distribution, top patterns, SCoT insight block |
 This tool is the primary entry point for DataAnalystX-mode log investigation. It replaces the manual chain of `query_logs` → interpret for conversational use cases.
 ---
 ## 6. Prompt Engineering Specialty Integration
 *Source: `.github/specialties/specialty.prompt-engineering.instructions.md`*  
 *Applies to: server's own session prompt construction (§3 of this document)*
 ### 6.1 C.R.A.F.T. Applied to System Prompt Construction
 The server builds each session's system prompt using the C.R.A.F.T. framework:
 | C.R.A.F.T. component | Homelab MCP mapping |
 |---|---|
 | **Context** | Current topology summary + session approval state + recent audit entries |
 | **Role** | Frank persona block + active specialty condensed blocks |
 | **Action** | Available tool catalogue for this session (read/write gated by approval state) |
 | **Format** | Output format expectations: structured tool responses, Markdown for prose, code blocks for commands |
 | **Tone/Audience** | Frank tone profile (§2.2) + homelab operator assumed as single technically fluent user |
 ### 6.2 Reasoning Technique Selection Logic
 The Senior Prompt Engineer persona governs which reasoning technique is applied per interaction. Selection is automatic based on problem complexity signals:
 | Trigger signal | Technique selected | Rationale |
 |---|---|---|
 | Single-service failure, clear error message | **CoT** | Linear diagnosis; step-by-step is sufficient |
 | Multi-service failure, ambiguous root cause | **ToT** | Multiple hypotheses need exploration and backtracking |
 | "How did I set up X before?" / "What does the SOP say?" | **RAG** | Prior runbook/KBA knowledge is the authoritative source |
 | Playbook authoring, structured output needed | **CoT** | Sequential construction of idempotent tasks |
 | Novel failure with no prior context | **ToT → RAG** | Explore hypotheses first, then ground in documentation |
 | Simple reads (`list_nodes`, `get_stack_status`) | **None** | No reasoning overhead for data retrieval |
 ### 6.3 Anti-Pattern Guardrails (System Prompt Level)
 These constraints are injected into the system prompt and apply to every LLM response:
 ```
 CONSTRAINTS:
 - Scope claims strictly to homelab + loaded specialties. Never claim general omniscience.
 - Always state which topology data or session context you are acting on before proposing changes.
 - Every proposed write action must include: what changes, verification command, rollback command.
 - Never output credentials, tokens, or keys — reference env var names only.
 - If a diagnostic conclusion requires assumptions, state them explicitly before the conclusion.
 ```
 ---
 ## 7. Reasoning Techniques
 *Sources: `.github/skills/style.cot.instructions.md`, `style.tot.instructions.md`, `style.rag.instructions.md`*  
 *Enriches base spec §4.3 (troubleshoot mode — "guided" → "reasoning-augmented")*
 ### 7.1 Chain-of-Thought (CoT) in Diagnostics
 **Applied in**: `run_diagnostic`, `analyze_logs`, `dry_run_playbook` output interpretation, playbook authoring
 The server's CoT implementation uses **Zero-Shot CoT** ("think step by step") for single-service diagnostics and **Few-Shot CoT** for recurring failure patterns where exemplar chains exist in the knowledge base.
 **Standard diagnostic CoT chain**:
 ```
 Step 1: What is the observed symptom? (exact error or behaviour)
 Step 2: What service/container/host is directly involved?
 Step 3: What are the last N log lines showing?
 Step 4: What changed recently? (audit log + topology refresh delta)
 Step 5: What is the most likely immediate cause?
 Step 6: What is the smallest change that addresses the cause?
 Step 7: How do I verify the fix worked without restarting unaffected services?
 ```
 ### 7.2 Tree-of-Thought (ToT) in Multi-Hypothesis Failures
 **Triggered when**: `run_diagnostic` is called on a failure that spans more than one service, or when initial CoT analysis does not converge on a single root cause after two steps.
 **ToT structure for homelab diagnostics**:
 ```
 Root observation: [symptom]
 │
 ├── Hypothesis A: [e.g., upstream dependency down]
 │   ├── Evidence for: [check X]
 │   ├── Evidence against: [check Y]
 │   └── Confidence: [0–100%]
 │
 ├── Hypothesis B: [e.g., configuration drift]
 │   ├── Evidence for: [check X]
 │   ├── Evidence against: [check Y]
 │   └── Confidence: [0–100%]
 │
 └── Hypothesis C: [e.g., resource exhaustion]
    ├── Evidence for: [check X]
    ├── Evidence against: [check Y]
    └── Confidence: [0–100%]
 Selected hypothesis: [highest confidence after evidence gathering]
 Backtrack condition: if selected hypothesis disproven, move to next candidate
 ```
 The LLM surfaces the top-3 hypotheses with their confidence scores inline in chat before requesting approval for any diagnostic action. This satisfies base spec §4.3's requirement that the human decides action.
 ### 7.3 Retrieval-Augmented Generation (RAG) from Homelab Knowledge Base
 **Triggered when**: `run_diagnostic` is called on a service that has a matching KBA/SOP, or when `/generate_sop` needs to extend an existing runbook.
 **Knowledge base sources** (indexed at server start, refreshed on config reload):
 | Source directory | Content type | Retrieval label |
 |---|---|---|
 | `documentation/SOPs/` | Step-by-step operational procedures | `type:sop` |
 | `documentation/KBAs/` | Known error → fix mappings | `type:kba` |
 | `documentation/TECHNICAL_RUNBOOK.md` | Infrastructure topology narrative | `type:runbook` |
 | Ansible playbooks in `ansible/playbooks/` | Automation implementations | `type:playbook` |
 **Chunking strategy**: Documents are split at heading boundaries (H2/H3). Each chunk carries metadata: `source_file`, `heading_path`, `type`. Retrieval uses semantic similarity against the diagnostic query.
 **RAG injection format** (prepended to CoT or ToT chain when relevant chunks are found):
 ```
 [KNOWLEDGE BASE — retrieved context]
 Source: documentation/KBAs/KBA-001-Komodo-GitOps-Stack-Deployment-Failures.md
 Relevance: 0.87
 [excerpt from KBA]
 [END RETRIEVED CONTEXT]
 ```
 ---
 ## 8. Enriched Tool Behaviours Summary
 ### 8.1 `run_diagnostic` — full enriched behaviour
 *Base spec §3.4: "Guided — server runs diagnostic sequences, presents findings and suggested fixes; human decides action"*
 The enriched implementation:
 1. **Technique selection** (§7 of this document): determines CoT vs ToT based on symptom scope
 2. **RAG retrieval**: searches knowledge base for matching KBA/SOP; injects if found
 3. **Diagnostic execution**: runs the tool sequence for the matched scenario (§4.4 / §4.5)
 4. **SCoT phases 1–5** (DataAnalystX) applied to any log query steps
 5. **Output**: structured findings block with hypothesis tree (ToT) or linear chain (CoT), ranked recommendations, and explicit human decision point
 Every `run_diagnostic` response ends with:
 ```
 [NEXT STEPS — your decision required]
 Option 1: [specific action] — run: [exact command/tool call]
 Option 2: [alternative action] — run: [exact command/tool call]
 Option 3: Escalate — generate SOP: /generate_sop [service]
 ```
 ### 8.2 New Tool: `suggest_fix` *(additive)*
 A read-only tool that generates a fix recommendation without executing anything.
 | Field | Value |
 |---|---|
 | Name | `suggest_fix` |
 | Parameters | `symptom` (free text), `service?`, `node?` |
 | Auth required | Read — free |
 | Returns | CoT or ToT reasoning chain + ranked fix options with verification and rollback commands |
 | Reasoning | Applies full §7 technique selection; retrieves RAG context if available |
 ### 8.3 New Tool: `generate_sop` *(additive)*
 Creates a structured SOP or KBA document from a resolved incident or playbook execution.
 | Field | Value |
 |---|---|
 | Name | `generate_sop` |
 | Parameters | `title`, `service`, `type` (sop \| kba \| runbook), `session_id?` |
 | Auth required | Write (creates file) |
 | Returns | Drafted document in `documentation/SOPs/` or `documentation/KBAs/` per `type`; follows existing naming convention (e.g. `KBA-002-...`) |
 | Template | Frank Technical Writer persona applies ITIL-adjacent structure: Symptom → Cause → Resolution → Verification → Prevention |
 The draft is written to the filesystem via `write_file` and is **not** automatically committed. The operator reviews and commits manually, consistent with the base spec's human-in-the-loop principle.
 ### 8.4 New Tool: `search_runbooks` *(additive)*
 Exposes the RAG knowledge base retrieval as a direct LLM-callable tool.
 | Field | Value |
 |---|---|
 | Name | `search_runbooks` |
 | Parameters | `query` (natural language), `type?` (sop \| kba \| runbook \| playbook), `limit?` (default 3) |
 | Auth required | Read — free |
 | Returns | Top-N matching chunks with source file, heading path, relevance score, and excerpt |
 ---
 ## 9. Frank Command Registry
 All commands available when the three specialties are loaded. Commands marked (core) are always available regardless of specialty configuration.
 | Command | Persona | Homelab implementation | Underlying tools |
 |---|---|---|---|
 | `/quickstart <goal>` | Project Manager (core) | Deploy a new stack or playbook from one sentence | `draft_playbook` or `compose_up` pre-flight |
 | `/create` | Information Architect (core) | Guided Compose file or playbook creation wizard | `draft_playbook`, `write_file` |
 | `/review <path>` | QA Analyst (core) | Audit a Compose file or playbook for correctness, security, idempotency | `get_playbook`, `read_file` |
 | `/refactor <path>` | Lead Technical Editor (core) | Restructure a Compose file or playbook; apply best practices | `get_playbook`, `draft_playbook` |
 | `/document <service>` | Technical Writer (core) | Generate a README or operational guide for a service | `get_stack_status`, `search_runbooks`, `write_file` |
 | `/communicate <audience> <channel> <subject>` | Stakeholder Comms Lead (core) | Recast incident report or change summary for non-technical audience | `get_audit_log` |
 | `/consult <question>` | Senior Business Analyst (core) | Strategic infrastructure decision support | `get_topology_full`, `get_resource_usage` |
 | `/help` | Frank.core (core) | List available commands and active specialties | Session info only |
 | `/docker <symptom>` | DevOps SRE (Docker) | Full Docker/Compose troubleshooting workflow (§4.2–§4.4) | `get_container_logs`, `get_stack_status`, `check_port`, `probe_http`, `run_diagnostic` |
 | `/ansible <symptom>` | DevOps SRE (Ansible) | Full Ansible diagnostic workflow (§4.3–§4.5) | `list_playbooks`, `dry_run_playbook`, `get_service_status`, `run_diagnostic` |
 | `/compose <path>` | Container Platform Architect | Validate and optimise a Compose file; pre-flight check | `compose_up` pre-flight, `read_file` |
 | `/traefik <symptom>` | DevOps SRE (Docker) | Traefik routing, middleware, and TLS diagnosis | `probe_http`, `check_port`, `dns_lookup`, `get_container_logs` |
 | `/analyze <intent>` | DataAnalystX | Log analysis + resource trending with SCoT (§5.1) | `analyze_logs`, `query_logs`, `get_resource_usage` |
 | `/query <logql>` | DataAnalystX | Direct LogQL query with structured output | `query_logs` |
 | `/optimize <prompt>` | Senior Prompt Engineer | Evaluate and improve a system prompt or instructions file | Internal only — no MCP tools |
 | `/craft` | Senior Prompt Engineer | Build a new prompt/specialty using C.R.A.F.T. guided workflow | `write_file` (optional) |
 | `/reason <problem>` | Senior Prompt Engineer | Select and apply CoT/ToT/RAG to a stated problem | `suggest_fix`, `search_runbooks` |
 | `/generate_sop <service>` | Technical Writer | Draft SOP or KBA from current session context | `generate_sop` |
 | `/search_runbooks <query>` | Technical Writer | Search knowledge base for existing documentation | `search_runbooks` |
 ---
 ## 10. Configuration Additions
 The following block is appended to `config.yaml` (base spec §5.2). All keys are optional unless marked required.
 ```yaml
 frank:
  # Personality
  personality:
    tone: "operational"           # operational | mentoring | verbose
    incident_mode_threshold: 1    # switch to incident tone if N+ write approvals active
  # Specialties — always-on list; remove to disable a specialty
  specialties:
    - devops
    - data-analysis
    - prompt-engineering
  # Reasoning technique configuration
  reasoning:
    cot_enabled: true
    tot_enabled: true
    tot_trigger: "multi_service"  # multi_service | always | never
    tot_max_hypotheses: 3
    rag_enabled: true
  # Knowledge base for RAG retrieval
  knowledge_base:
    paths:
      - /data/homelab/documentation/SOPs
      - /data/homelab/documentation/KBAs
      - /data/homelab/documentation/TECHNICAL_RUNBOOK.md
      - /data/homelab/ansible/playbooks
    chunk_at: "heading"           # heading | paragraph | fixed_tokens
    refresh_on_startup: true
    index_path: /data/frank-kb-index
  # DataAnalystX resource thresholds
  analysis:
    thresholds:
      cpu_warning: 75             # percent
      cpu_critical: 90
      memory_warning: 80
      memory_critical: 95
      disk_warning: 75
      disk_critical: 90
      swap_warning: 20
      swap_critical: 50
 ```
 ### 10.1 Container Volume Additions
 The following volume mounts are added to `docker-compose.yml` to support Frank's knowledge base and index:
 ```yaml
 volumes:
  # Existing (base spec)
  - ./config:/config:ro
  - /path/to/homelab:/data/homelab:ro    # homelab repo, read-only
  # New
  - frank-kb-index:/data/frank-kb-index  # persistent RAG index volume
 ```
 ---
 ## 11. Open Questions
 Carries forward all items from base spec §8, plus Frank-specific additions.
 ### 11.1 Carried Forward (base spec §8)
 - Topology refresh interval (default 60 min — confirm for lab cadence)
 - SSH key mount path vs agent forwarding
 - Web UI authentication (API key as query param vs separate credential)
 - Rate limiting on public IP exposure
 - Multi-user session approval state isolation
 - Backup/restore of topology cache across container restarts
 ### 11.2 Frank-Specific Deferred Decisions
 | Item | Question | Default assumed |
 |---|---|---|
 | **Specialty activation** | Config-driven is specified; confirm intent-detection is not wanted even as an optional mode | Config-driven only |
 | **RAG index persistence** | Should the KB index survive container restarts, or be rebuilt on each start? Rebuilding is simpler; persisting is faster at scale | Rebuild on start (configurable) |
 | **RAG knowledge base scope** | Should Ansible playbooks be indexed for "how did I deploy X before?" queries? Adds significant retrieval power but requires careful chunking of YAML | Included in default paths |
 | **Tone switching granularity** | Frank switches tone modes (§2.2) based on context signals. Should the operator be able to force a tone via a command or config flag? | Not forced; auto-switched |
 | **`generate_sop` auto-commit** | Should the generated SOP be auto-committed to a review branch (like `draft_playbook`), or left uncommitted for manual review? | Uncommitted; manual review |
 | **ToT confidence scoring** | Confidence scores on hypotheses are LLM-generated (not computed). Should the UI display a caveat that scores are indicative, not statistical? | Yes — caveat in UI and inline |
 | **Web UI Frank panel** | The base spec §7 admin UI has four panels. Should a fifth panel show "Frank session context" (active specialties, technique in use, current persona)? | Recommended addition |
 ---
 *Homelab MCP Server — Frank v6 Integration Addendum v1.0 · May 2026*  
 *Read in conjunction with [homelab-mcp-spec.txt](homelab-mcp-spec.txt)*
--- a/documentation/homelab-mcp-spec.txt
+++ b/documentation/homelab-mcp-spec.txt
@ -0,0 +1,479 @@
 Homelab MCP Server
 Full Design Specification
 Version 1.0  ·  May 2026
 SCOPE
 	A Python MCP server that acts as the LLM's point of entry into a self-hosted homelab — providing infrastructure management, observability, network awareness, and internet access, with a semi-automatic approval model and full audit trail.
 1. Overview
 This document captures all design decisions for a Model Context Protocol (MCP) server purpose-built for homelab management. It is the authoritative spec from which implementation begins. Every decision recorded here was made explicitly during the design session.
 The server exposes five capability domains to an LLM client:
 * Shell & local environment access
 * Infrastructure deployment (Ansible + Docker)
 * Observability & guided troubleshooting
 * Network topology awareness
 * Internet access with audit logging
 A semi-automatic approval gate governs all write operations. Reads are unrestricted. The server is containerised, reverse-proxied, and ships its audit stream to the same Loki instance used for homelab logs.
 2. Decisions summary
 Complete record of every decision made during the design session.
 2.1 Foundation
 Decision
 	Value
 	Transport
 	Both stdio (Claude Desktop) and SSE/HTTP (networked daemon)
 	Hosting
 	Docker container — always-on, volume-mounted
 	Access methods
 	Local LAN and public IP (behind reverse proxy)
 	Language
 	Python
 2.2 Security
 Decision
 	Value
 	Authentication
 	Static API key — bearer token in Authorization header
 	TLS
 	Terminated at reverse proxy (Nginx / Caddy / Traefik) — server binds plaintext internally
 	Approval granularity
 	Per session — approve writes once; free for the remainder of that session
 	Approval channel
 	Inline in chat — Claude surfaces pending action, user replies yes/no
 2.3 Infrastructure
 Decision
 	Value
 	Ansible source
 	Git repository — server clones/pulls on demand
 	Ansible connection
 	SSH key (mounted into container)
 	Playbook write policy
 	LLM may draft new playbooks; drafts go to review queue; never auto-executed
 	Review queue mechanism
 	Git PR / branch — draft pushed to review branch, promoted by human merge
 	Docker Compose scope
 	All nodes — local and remote via SSH + Docker context
 	Hard limits
 	Always dry-run Ansible before live execution; never auto-run unreviewed playbooks
 2.4 Observability
 Decision
 	Value
 	Log source
 	Loki (central aggregator) — single query interface
 	Remote log method
 	All nodes ship to Loki; server queries Loki API only
 	Health metrics
 	systemd service status, Docker container state, CPU/memory/disk, open ports, ICMP ping, HTTP endpoint probes, SMART disk health
 	Troubleshoot mode
 	Guided — server runs diagnostic sequences, presents findings and suggested fixes; human decides action
 2.5 Network topology
 Decision
 	Value
 	Source of truth
 	Static YAML base enriched by Ansible fact-gathering on refresh
 	Refresh schedule
 	Scheduled interval (configurable, default 1 hour)
 	Fields captured
 	Node IPs, subnets/VLANs, services per node, node roles, OS & hardware, open ports, service dependencies
 	Session injection
 	Summary injected into every session context automatically; full topology available on demand
 2.6 Internet access
 Decision
 	Value
 	Permitted uses
 	Package/image version checks, documentation fetching, CVE lookups, outbound webhooks, Git pulls
 	Restriction policy
 	Open but logged — any URL permitted; every outbound request recorded to audit log
 	Built-in integrations
 	GitHub/Gitea, Docker Hub/GHCR, Cloudflare (DNS + tunnel), Ntfy (push notifications)
 2.7 Server operations
 Decision
 	Value
 	Audit log destination
 	Shipped to Loki — same instance as homelab logs, labelled job=mcp-audit
 	Config source
 	Env vars for secrets; YAML config file (volume-mounted) for everything else
 	Admin interface
 	Lightweight web UI — session log, pending approvals, server status
 	Playbook review queue
 	Git PR — LLM pushes draft to a review branch; promotion = merging the PR
 3. Tool catalogue
 All tools exposed to the LLM via MCP. Read-only tools execute freely. Write tools require active session approval before first use.
 3.1 Shell & environment
 Tool name
 	Parameters
 	Auth required
 	run_command
 	cmd, node?, timeout?
 	Write (executes)
 	read_file
 	path, node?
 	Read — free
 	write_file
 	path, content, node?
 	Write (modifies FS)
 	list_directory
 	path, node?
 	Read — free
 	get_env
 	key?, node?
 	Read — free
 3.2 Infrastructure — Ansible
 Tool name
 	Parameters
 	Auth required
 	list_playbooks
 	—
 	Read — free
 	get_playbook
 	name
 	Read — free
 	dry_run_playbook
 	name, inventory?, extra_vars?
 	Read — free (check mode)
 	run_playbook
 	name, inventory?, extra_vars?
 	Write (deploys)
 	draft_playbook
 	name, content
 	Write (creates PR)
 	list_inventory
 	—
 	Read — free
 	refresh_inventory
 	—
 	Write (pulls git)
 3.3 Infrastructure — Docker
 Tool name
 	Parameters
 	Auth required
 	list_stacks
 	node?
 	Read — free
 	get_stack_status
 	stack, node?
 	Read — free
 	compose_up
 	stack, node?
 	Write (deploys)
 	compose_down
 	stack, node?
 	Write (stops)
 	compose_pull
 	stack, node?
 	Write (pulls images)
 	get_container_logs
 	container, node?, lines?
 	Read — free
 	list_images
 	node?
 	Read — free
 3.4 Observability
 Tool name
 	Parameters
 	Auth required
 	query_logs
 	logql, start?, end?, limit?
 	Read — free
 	get_service_status
 	service, node?
 	Read — free
 	get_resource_usage
 	node?
 	Read — free
 	check_port
 	host, port
 	Read — free
 	ping_host
 	host, count?
 	Read — free
 	probe_http
 	url, expected_status?
 	Read — free
 	get_smart_health
 	device, node?
 	Read — free
 	run_diagnostic
 	service, node?
 	Read — free
 3.5 Network topology
 Tool name
 	Parameters
 	Auth required
 	get_topology_summary
 	—
 	Read — free (always in context)
 	get_topology_full
 	—
 	Read — free
 	find_service
 	name
 	Read — free
 	find_node
 	name_or_ip
 	Read — free
 	list_nodes
 	role?
 	Read — free
 	refresh_topology
 	—
 	Write (runs Ansible facts)
 3.6 Internet
 Tool name
 	Parameters
 	Auth required
 	http_fetch
 	url, method?, headers?, body?
 	Read — logged
 	dns_lookup
 	hostname, type?
 	Read — logged
 	check_package_version
 	package, ecosystem
 	Read — logged
 	cve_lookup
 	cve_id or package
 	Read — logged
 	git_pull
 	repo_url, path?
 	Write (clones/pulls)
 	send_notification
 	message, title?, priority?
 	Write (Ntfy)
 	cloudflare_dns
 	action, zone, record?
 	Write (DNS change)
 3.7 Session & approval
 Tool name
 	Parameters
 	Auth required
 	get_session_info
 	—
 	Read — free
 	approve_writes
 	—
 	Activates write approval for session
 	revoke_writes
 	—
 	Revokes write approval for session
 	get_audit_log
 	limit?, since?
 	Read — free
 	list_pending_playbooks
 	—
 	Read — free
 	promote_playbook
 	name
 	Write (merges PR)
 4. Architecture
 4.1 Container layout
 The server runs as a single Docker container on the control node. Volume mounts provide access to SSH keys, the YAML config, and the topology file. All secrets are injected as environment variables; the container itself contains no credentials at build time.
 mcp-server/
  Dockerfile
  docker-compose.yml
  config/
    config.yaml          # non-secret configuration
    topology.yaml        # static topology base
  src/
    server.py            # MCP entry point (stdio + SSE)
    tools/               # one module per domain
    approval.py          # session approval gate
    topology.py          # topology loader + refresher
    audit.py             # Loki shipper
    integrations/        # GitHub, Dockerhub, Cloudflare, Ntfy
 4.2 Transport modes
 The server supports both transport modes simultaneously:
 * stdio — invoked directly by Claude Desktop via the MCP stdio protocol. The process is spawned per session. Suitable for local desktop use.
 * SSE/HTTP — the server also binds an HTTP listener (default port 8765) serving the MCP SSE transport. The reverse proxy terminates TLS and forwards to this port. Any MCP-compatible client can connect with a valid API key.
 4.3 Approval flow
 Write operations follow a strict gate:
 1. LLM calls a write tool.
 2. Server checks session approval state. If not approved, returns a structured pending response to the LLM.
 3. LLM surfaces the pending action inline in chat with a clear description of what will be executed.
 4. User replies yes/no. On yes, the approve_writes tool is called, unlocking writes for the session.
 5. The original write tool is retried and executed. All subsequent writes in this session are free.
 6. Every action (approved or denied) is written to the audit log and shipped to Loki.
 4.4 Topology lifecycle
 On startup the server loads topology.yaml as the static base. A background scheduler (default: every 60 minutes) runs an Ansible facts gather against the inventory and merges the result into the live topology object. The refresh interval is configurable in config.yaml.
 At session start a compact topology summary (node count, subnet list, roles, service index) is prepended to the system prompt. The full topology is available via get_topology_full at any time.
 4.5 Playbook lifecycle
 When the LLM drafts a new playbook:
 * The draft is written to a local staging area inside the container.
 * The server commits it to a dedicated review branch on the configured Git remote and opens a pull request.
 * The draft is listed in list_pending_playbooks but is blocked from execution.
 * When the human merges the PR, the playbook enters the live playbook library.
 * promote_playbook can also be called directly to fast-track a merge from within the chat.
 HARD LIMIT
 	run_playbook will reject any playbook whose name does not exist in the merged main branch. Draft-branch playbooks are permanently blocked from execution regardless of session approval state.
 5. Configuration reference
 5.1 Environment variables (secrets)
 Tool name
 	Parameters
 	Auth required
 	MCP_API_KEY
 	Required
 	Bearer token for network clients
 	LOKI_URL
 	Required
 	Loki push/query base URL
 	LOKI_USER / LOKI_PASSWORD
 	Optional
 	Basic auth if Loki requires it
 	GIT_TOKEN
 	Required
 	Token for Ansible repo + PR creation
 	CLOUDFLARE_API_TOKEN
 	Optional
 	Cloudflare integration
 	NTFY_URL / NTFY_TOKEN
 	Optional
 	Ntfy push notification endpoint
 	DOCKERHUB_TOKEN
 	Optional
 	Authenticated image version checks
 5.2 config.yaml structure
 server:
  http_port: 8765
  log_level: info
 topology:
  static_file: /config/topology.yaml
  refresh_interval_minutes: 60
 ansible:
  repo_url: git@github.com:you/homelab-playbooks.git
  local_path: /data/playbooks
  ssh_key_path: /secrets/id_ed25519
  review_branch_prefix: mcp-draft/
 docker:
  compose_dir: /data/stacks
  remote_contexts:   # name: ssh://user@host
    node2: ssh://admin@192.168.1.12
    node3: ssh://admin@192.168.1.13
 loki:
  audit_labels:
    job: mcp-audit
    env: homelab
 integrations:
  github_base_url: https://api.github.com   # or your Gitea URL
  dockerhub_registry: https://hub.docker.com
  cloudflare_zone_id: ~   # optional
 6. Security model
 6.1 Network perimeter
 * TLS is terminated at the reverse proxy. The server never handles raw TLS.
 * The server should bind to 127.0.0.1 or a Docker internal network only — the proxy is the sole external listener.
 * LAN access can reach the server directly via the proxy on the internal network.
 * Public IP access is also routed through the proxy; the same API key is required regardless of source.
 6.2 Authentication
 * All SSE/HTTP connections must supply Authorization: Bearer <MCP_API_KEY>.
 * stdio connections are inherently local and bypass key auth.
 * Invalid or missing keys return 401 immediately with no information leakage.
 6.3 Execution safety
 * Ansible always runs in check mode (dry-run) first. Live execution requires an explicit second call.
 * New playbooks are permanently blocked from execution until merged into main via PR.
 * Session approval unlocks writes for the lifetime of one session only. A new connection or server restart resets state.
 * All outbound HTTP requests are logged to Loki with URL, method, response code, and timestamp.
 7. Admin web UI
 A lightweight read-only web UI is served by the MCP server on the same HTTP port under /ui. It requires the same API key (passed as a query param or cookie).
 Panels:
 * Live session log — active sessions, approval state, tool call history
 * Pending playbooks — list of draft PRs awaiting review, with diff viewer
 * Audit stream — recent Loki audit entries, filterable by session and tool
 * Server health — uptime, topology last-refreshed timestamp, integration connectivity
 NOTE
 	The web UI is read-only and observational. All actions (promote, approve, run) happen through the MCP protocol in chat, never through the UI directly.
 8. Open questions & deferred decisions
 Items not resolved in this session — to be decided before or during implementation.
 * Topology refresh interval — default set to 60 minutes; confirm this is appropriate for your lab cadence.
 * SSH connection for Ansible — SSH key mount vs agent forwarding not finalised; key mount assumed, confirm path.
 * Web UI authentication — API key as query param is convenient but less secure; consider a separate UI credential.
 * Rate limiting — no rate limit on tool calls specified; consider adding one for public IP exposure.
 * Multi-user — spec assumes a single operator. If multiple users will share the server, session approval state needs to be per-user, not per-connection.
 * Backup / restore — no decision on persisting the topology cache or session state across container restarts.
 Homelab MCP Server — Design Specification v1.0  ·  Generated May 2026
--- a/mcp_root/GeminiConversation.md
+++ b/mcp_root/GeminiConversation.md
@ -0,0 +1,130 @@
 This is a textbook "Infrastructure as Code" play. What you're describing is essentially an MCP Gateway (Main Server) orchestrating a Microservices Mesh (Shards).
 By sharding the architecture, you gain isolation (if the Authentik API is down, you can still read Dozzle logs) and scalability (each shard can run in its own container with specific permissions).
 🏗️ The "Homelab-MCP" Sharded Architecture
 I think the best way to handle this is to have a Main Gateway Server that acts as the "Project Coordinator" (just like I do!) and routes requests to specialized Service Shards.
 1. The Shard Breakdown
 We can define three primary shards based on our discussion:
 Shard A: The "Navigator" (Traefik)
 Focus: Network routing and entrypoints.
 Tools: get_routes, add_dynamic_config, check_ssl_status.
 Permissions: Needs write access to your /dynamic YAML folder.
 Shard B: The "Watchman" (Dozzle)
 Focus: Observability and debugging.
 Tools: tail_logs, search_logs, get_container_stats.
 Permissions: Needs network access to the Dozzle API port.
 Shard C: The "Gatekeeper" (Authentik)
 Focus: Identity and Access Management.
 Tools: sync_brand, create_application, manage_flows.
 Permissions: Needs an Authentik API Token.
 🛠️ The Implementation Strategy
 Instead of building one massive, brittle script, we use a Shared Tool Definition approach.
 The Main "Gateway" Server
 The main server doesn't need to know how to talk to Traefik; it only needs to know which shard owns the tool. When you say, "Frank, check why Gitea is 404ing," the Gateway:
 Calls Shard A (Traefik) to see if the route exists.
 Calls Shard B (Dozzle) to see the container logs.
 Synthesizes the answer for you.
 The Docker Setup
 We can wrap this into a single docker-compose.mcp.yaml on waldorf:
 YAML
 services:
  mcp-gateway:
    image: node:20-slim
    volumes:
      - ./gateway:/app
    environment:
      - SHARD_TRAEFIK_URL=http://mcp-traefik:3001
      - SHARD_DOZZLE_URL=http://mcp-dozzle:3002
      - SHARD_AUTHENTIK_URL=http://mcp-authentik:3003
  mcp-traefik:
    image: python:3.11-slim
    volumes:
      - /mnt/appdata/traefik/dynamic:/dynamic # For writing routes
    environment:
      - TRAEFIK_API=http://traefik:8080/api
  mcp-dozzle:
    # Lightweight Node.js shard
    environment:
      - DOZZLE_API=http://dozzle:8080/api
 🔍 Frank’s Logic: Why this wins
 Security (Principle of Least Privilege): Only the Traefik shard gets access to your config files. The Dozzle shard only sees logs.
 Decoupled Development: We can build the Dozzle shard today, and add the Authentik shard next week without breaking the gateway.
 Natural Conversation: I (the LLM) don't have to worry about which API to hit. I just call tail_logs and the Gateway handles the routing.
 ⚠️ Potential Concern: Latency
 In a homelab, the overhead of the Gateway talking to a Shard is negligible (milliseconds), but we need to ensure the Gateway has a Discovery Mechanism so it knows which tools are currently online.
 ✅ Next Step
 Since we already have your Traefik and Dozzle configs, do you want me to draft the first Shard (The Navigator) code? We can start by building the tool that reads your Traefik API and tells me if your Gitea/Pinchflat routes are "OK."
 Should we start with the Traefik Shard or the Dozzle Shard for the first "active" tool?
 For your "Homelab-MCP" sharded architecture, the choice of node depends on the specialty of the shard and the risk level of the tools it contains.
 Since you have Waldorf (16GB RAM, Dedicated GPU) as your primary workhorse and presumably other nodes (like a NAS or smaller Pis), here is the recommended distribution for your sharded setup:
 1. The Gateway Shard (The Router)
 Target Node: Waldorf (or your most reliable 24/7 node).
 Reasoning: The Gateway is the entry point for your LLM. It needs to be on your most stable node with the best network connectivity to the other shards. Since it's lightweight (routing logic), it won't interfere with Plex/Tunarr.
 Deployment: Run this as a Docker container.
 2. The "Navigator" Shard (Traefik)
 Target Node: The Traefik Host.
 Reasoning: This shard needs direct filesystem access to the /dynamic configuration folder to write YAML routes.
 SRE Logic: Deploy this shard on whichever node currently hosts your Traefik container. By running the shard on the same host, you can use a simple Bind Mount (e.g., - /mnt/appdata/traefik/dynamic:/dynamic) instead of trying to manage cross-node network file shares (which, as we saw with Pinchflat, can cause locking issues).
 3. The "Watchman" Shard (Dozzle)
 Target Node: Anywhere (but Waldorf is best).
 Reasoning: Dozzle gathers logs via the Docker socket. If you use a Docker Socket Proxy (which your traefik.yml suggests you do), this shard can run on any node as long as it can reach that proxy's IP.
 SRE Logic: Running it on Waldorf is ideal because it keeps the "Observability" tools close to the GPU/Plex logs it will likely be analyzing.
 4. The "Gatekeeper" Shard (Authentik)
 Target Node: Waldorf.
 Reasoning: Authentik is an API-driven service. This shard doesn't need local files; it just needs a stable network path to your Authentik API. Waldorf’s 16GB RAM handles the API overhead comfortably.
 🚀 Summary Deployment Table
 Shard	Best Node	Critical Resource
 Gateway	Waldorf	Network Stability
 Traefik	Same host as Traefik	Local File Access (/dynamic)
 Dozzle	Waldorf	Network access to Socket Proxy
 Authentik	Waldorf	API Connectivity
 ⚠️ A Note on "The Docker Socket"
 Since you are a security-conscious SRE, never mount /var/run/docker.sock directly into an MCP shard. If I (the LLM) am compromised, an attacker could take over your host.
 Always use a Docker Socket Proxy (like the one in your Traefik config: tcp://docker-socket-proxy:2375).
 Configure the proxy to only allow GET requests for logs/containers, and block DELETE or POST requests.
 Which node is currently hosting your Traefik instance? If it's Waldorf, we can consolidate almost everything there for maximum performance.