homelab/documentation/homelab-mcp-spec.txt
Nathan 8ec4b7da8f Add Frank v6 Integration Addendum and Base MCP Server Specification
- Introduced the "Homelab MCP Server — Frank v6 Integration Addendum" document detailing the enhancements brought by Frank v6, including personality layers, specialty domains, and reasoning techniques.
- Updated the base MCP server specification to include comprehensive design decisions, tool catalog, architecture, and security model.
- Added new tools and enriched behaviors for DevOps, Data Analysis, and Prompt Engineering specialties.
- Established a structured command registry for easy access to functionalities across specialties.
- Documented configuration additions for Frank's personality, specialties, reasoning techniques, and knowledge base integration.
2026-05-12 23:02:59 -04:00

479 lines
14 KiB
Plaintext

Homelab MCP Server
Full Design Specification
Version 1.0 · May 2026
SCOPE
A Python MCP server that acts as the LLM's point of entry into a self-hosted homelab — providing infrastructure management, observability, network awareness, and internet access, with a semi-automatic approval model and full audit trail.
1. Overview
This document captures all design decisions for a Model Context Protocol (MCP) server purpose-built for homelab management. It is the authoritative spec from which implementation begins. Every decision recorded here was made explicitly during the design session.
The server exposes five capability domains to an LLM client:
* Shell & local environment access
* Infrastructure deployment (Ansible + Docker)
* Observability & guided troubleshooting
* Network topology awareness
* Internet access with audit logging
A semi-automatic approval gate governs all write operations. Reads are unrestricted. The server is containerised, reverse-proxied, and ships its audit stream to the same Loki instance used for homelab logs.
2. Decisions summary
Complete record of every decision made during the design session.
2.1 Foundation
Decision
Value
Transport
Both stdio (Claude Desktop) and SSE/HTTP (networked daemon)
Hosting
Docker container — always-on, volume-mounted
Access methods
Local LAN and public IP (behind reverse proxy)
Language
Python
2.2 Security
Decision
Value
Authentication
Static API key — bearer token in Authorization header
TLS
Terminated at reverse proxy (Nginx / Caddy / Traefik) — server binds plaintext internally
Approval granularity
Per session — approve writes once; free for the remainder of that session
Approval channel
Inline in chat — Claude surfaces pending action, user replies yes/no
2.3 Infrastructure
Decision
Value
Ansible source
Git repository — server clones/pulls on demand
Ansible connection
SSH key (mounted into container)
Playbook write policy
LLM may draft new playbooks; drafts go to review queue; never auto-executed
Review queue mechanism
Git PR / branch — draft pushed to review branch, promoted by human merge
Docker Compose scope
All nodes — local and remote via SSH + Docker context
Hard limits
Always dry-run Ansible before live execution; never auto-run unreviewed playbooks
2.4 Observability
Decision
Value
Log source
Loki (central aggregator) — single query interface
Remote log method
All nodes ship to Loki; server queries Loki API only
Health metrics
systemd service status, Docker container state, CPU/memory/disk, open ports, ICMP ping, HTTP endpoint probes, SMART disk health
Troubleshoot mode
Guided — server runs diagnostic sequences, presents findings and suggested fixes; human decides action
2.5 Network topology
Decision
Value
Source of truth
Static YAML base enriched by Ansible fact-gathering on refresh
Refresh schedule
Scheduled interval (configurable, default 1 hour)
Fields captured
Node IPs, subnets/VLANs, services per node, node roles, OS & hardware, open ports, service dependencies
Session injection
Summary injected into every session context automatically; full topology available on demand
2.6 Internet access
Decision
Value
Permitted uses
Package/image version checks, documentation fetching, CVE lookups, outbound webhooks, Git pulls
Restriction policy
Open but logged — any URL permitted; every outbound request recorded to audit log
Built-in integrations
GitHub/Gitea, Docker Hub/GHCR, Cloudflare (DNS + tunnel), Ntfy (push notifications)
2.7 Server operations
Decision
Value
Audit log destination
Shipped to Loki — same instance as homelab logs, labelled job=mcp-audit
Config source
Env vars for secrets; YAML config file (volume-mounted) for everything else
Admin interface
Lightweight web UI — session log, pending approvals, server status
Playbook review queue
Git PR — LLM pushes draft to a review branch; promotion = merging the PR
3. Tool catalogue
All tools exposed to the LLM via MCP. Read-only tools execute freely. Write tools require active session approval before first use.
3.1 Shell & environment
Tool name
Parameters
Auth required
run_command
cmd, node?, timeout?
Write (executes)
read_file
path, node?
Read — free
write_file
path, content, node?
Write (modifies FS)
list_directory
path, node?
Read — free
get_env
key?, node?
Read — free
3.2 Infrastructure — Ansible
Tool name
Parameters
Auth required
list_playbooks
Read — free
get_playbook
name
Read — free
dry_run_playbook
name, inventory?, extra_vars?
Read — free (check mode)
run_playbook
name, inventory?, extra_vars?
Write (deploys)
draft_playbook
name, content
Write (creates PR)
list_inventory
Read — free
refresh_inventory
Write (pulls git)
3.3 Infrastructure — Docker
Tool name
Parameters
Auth required
list_stacks
node?
Read — free
get_stack_status
stack, node?
Read — free
compose_up
stack, node?
Write (deploys)
compose_down
stack, node?
Write (stops)
compose_pull
stack, node?
Write (pulls images)
get_container_logs
container, node?, lines?
Read — free
list_images
node?
Read — free
3.4 Observability
Tool name
Parameters
Auth required
query_logs
logql, start?, end?, limit?
Read — free
get_service_status
service, node?
Read — free
get_resource_usage
node?
Read — free
check_port
host, port
Read — free
ping_host
host, count?
Read — free
probe_http
url, expected_status?
Read — free
get_smart_health
device, node?
Read — free
run_diagnostic
service, node?
Read — free
3.5 Network topology
Tool name
Parameters
Auth required
get_topology_summary
Read — free (always in context)
get_topology_full
Read — free
find_service
name
Read — free
find_node
name_or_ip
Read — free
list_nodes
role?
Read — free
refresh_topology
Write (runs Ansible facts)
3.6 Internet
Tool name
Parameters
Auth required
http_fetch
url, method?, headers?, body?
Read — logged
dns_lookup
hostname, type?
Read — logged
check_package_version
package, ecosystem
Read — logged
cve_lookup
cve_id or package
Read — logged
git_pull
repo_url, path?
Write (clones/pulls)
send_notification
message, title?, priority?
Write (Ntfy)
cloudflare_dns
action, zone, record?
Write (DNS change)
3.7 Session & approval
Tool name
Parameters
Auth required
get_session_info
Read — free
approve_writes
Activates write approval for session
revoke_writes
Revokes write approval for session
get_audit_log
limit?, since?
Read — free
list_pending_playbooks
Read — free
promote_playbook
name
Write (merges PR)
4. Architecture
4.1 Container layout
The server runs as a single Docker container on the control node. Volume mounts provide access to SSH keys, the YAML config, and the topology file. All secrets are injected as environment variables; the container itself contains no credentials at build time.
mcp-server/
Dockerfile
docker-compose.yml
config/
config.yaml # non-secret configuration
topology.yaml # static topology base
src/
server.py # MCP entry point (stdio + SSE)
tools/ # one module per domain
approval.py # session approval gate
topology.py # topology loader + refresher
audit.py # Loki shipper
integrations/ # GitHub, Dockerhub, Cloudflare, Ntfy
4.2 Transport modes
The server supports both transport modes simultaneously:
* stdio — invoked directly by Claude Desktop via the MCP stdio protocol. The process is spawned per session. Suitable for local desktop use.
* SSE/HTTP — the server also binds an HTTP listener (default port 8765) serving the MCP SSE transport. The reverse proxy terminates TLS and forwards to this port. Any MCP-compatible client can connect with a valid API key.
4.3 Approval flow
Write operations follow a strict gate:
1. LLM calls a write tool.
2. Server checks session approval state. If not approved, returns a structured pending response to the LLM.
3. LLM surfaces the pending action inline in chat with a clear description of what will be executed.
4. User replies yes/no. On yes, the approve_writes tool is called, unlocking writes for the session.
5. The original write tool is retried and executed. All subsequent writes in this session are free.
6. Every action (approved or denied) is written to the audit log and shipped to Loki.
4.4 Topology lifecycle
On startup the server loads topology.yaml as the static base. A background scheduler (default: every 60 minutes) runs an Ansible facts gather against the inventory and merges the result into the live topology object. The refresh interval is configurable in config.yaml.
At session start a compact topology summary (node count, subnet list, roles, service index) is prepended to the system prompt. The full topology is available via get_topology_full at any time.
4.5 Playbook lifecycle
When the LLM drafts a new playbook:
* The draft is written to a local staging area inside the container.
* The server commits it to a dedicated review branch on the configured Git remote and opens a pull request.
* The draft is listed in list_pending_playbooks but is blocked from execution.
* When the human merges the PR, the playbook enters the live playbook library.
* promote_playbook can also be called directly to fast-track a merge from within the chat.
HARD LIMIT
run_playbook will reject any playbook whose name does not exist in the merged main branch. Draft-branch playbooks are permanently blocked from execution regardless of session approval state.
5. Configuration reference
5.1 Environment variables (secrets)
Tool name
Parameters
Auth required
MCP_API_KEY
Required
Bearer token for network clients
LOKI_URL
Required
Loki push/query base URL
LOKI_USER / LOKI_PASSWORD
Optional
Basic auth if Loki requires it
GIT_TOKEN
Required
Token for Ansible repo + PR creation
CLOUDFLARE_API_TOKEN
Optional
Cloudflare integration
NTFY_URL / NTFY_TOKEN
Optional
Ntfy push notification endpoint
DOCKERHUB_TOKEN
Optional
Authenticated image version checks
5.2 config.yaml structure
server:
http_port: 8765
log_level: info
topology:
static_file: /config/topology.yaml
refresh_interval_minutes: 60
ansible:
repo_url: git@github.com:you/homelab-playbooks.git
local_path: /data/playbooks
ssh_key_path: /secrets/id_ed25519
review_branch_prefix: mcp-draft/
docker:
compose_dir: /data/stacks
remote_contexts: # name: ssh://user@host
node2: ssh://admin@192.168.1.12
node3: ssh://admin@192.168.1.13
loki:
audit_labels:
job: mcp-audit
env: homelab
integrations:
github_base_url: https://api.github.com # or your Gitea URL
dockerhub_registry: https://hub.docker.com
cloudflare_zone_id: ~ # optional
6. Security model
6.1 Network perimeter
* TLS is terminated at the reverse proxy. The server never handles raw TLS.
* The server should bind to 127.0.0.1 or a Docker internal network only — the proxy is the sole external listener.
* LAN access can reach the server directly via the proxy on the internal network.
* Public IP access is also routed through the proxy; the same API key is required regardless of source.
6.2 Authentication
* All SSE/HTTP connections must supply Authorization: Bearer <MCP_API_KEY>.
* stdio connections are inherently local and bypass key auth.
* Invalid or missing keys return 401 immediately with no information leakage.
6.3 Execution safety
* Ansible always runs in check mode (dry-run) first. Live execution requires an explicit second call.
* New playbooks are permanently blocked from execution until merged into main via PR.
* Session approval unlocks writes for the lifetime of one session only. A new connection or server restart resets state.
* All outbound HTTP requests are logged to Loki with URL, method, response code, and timestamp.
7. Admin web UI
A lightweight read-only web UI is served by the MCP server on the same HTTP port under /ui. It requires the same API key (passed as a query param or cookie).
Panels:
* Live session log — active sessions, approval state, tool call history
* Pending playbooks — list of draft PRs awaiting review, with diff viewer
* Audit stream — recent Loki audit entries, filterable by session and tool
* Server health — uptime, topology last-refreshed timestamp, integration connectivity
NOTE
The web UI is read-only and observational. All actions (promote, approve, run) happen through the MCP protocol in chat, never through the UI directly.
8. Open questions & deferred decisions
Items not resolved in this session — to be decided before or during implementation.
* Topology refresh interval — default set to 60 minutes; confirm this is appropriate for your lab cadence.
* SSH connection for Ansible — SSH key mount vs agent forwarding not finalised; key mount assumed, confirm path.
* Web UI authentication — API key as query param is convenient but less secure; consider a separate UI credential.
* Rate limiting — no rate limit on tool calls specified; consider adding one for public IP exposure.
* Multi-user — spec assumes a single operator. If multiple users will share the server, session approval state needs to be per-user, not per-connection.
* Backup / restore — no decision on persisting the topology cache or session state across container restarts.
Homelab MCP Server — Design Specification v1.0 · Generated May 2026