- Introduced the "Homelab MCP Server — Frank v6 Integration Addendum" document detailing the enhancements brought by Frank v6, including personality layers, specialty domains, and reasoning techniques. - Updated the base MCP server specification to include comprehensive design decisions, tool catalog, architecture, and security model. - Added new tools and enriched behaviors for DevOps, Data Analysis, and Prompt Engineering specialties. - Established a structured command registry for easy access to functionalities across specialties. - Documented configuration additions for Frank's personality, specialties, reasoning techniques, and knowledge base integration.
479 lines
14 KiB
Plaintext
479 lines
14 KiB
Plaintext
Homelab MCP Server
|
|
Full Design Specification
|
|
Version 1.0 · May 2026
|
|
SCOPE
|
|
A Python MCP server that acts as the LLM's point of entry into a self-hosted homelab — providing infrastructure management, observability, network awareness, and internet access, with a semi-automatic approval model and full audit trail.
|
|
|
|
|
|
1. Overview
|
|
This document captures all design decisions for a Model Context Protocol (MCP) server purpose-built for homelab management. It is the authoritative spec from which implementation begins. Every decision recorded here was made explicitly during the design session.
|
|
|
|
|
|
The server exposes five capability domains to an LLM client:
|
|
* Shell & local environment access
|
|
* Infrastructure deployment (Ansible + Docker)
|
|
* Observability & guided troubleshooting
|
|
* Network topology awareness
|
|
* Internet access with audit logging
|
|
|
|
|
|
A semi-automatic approval gate governs all write operations. Reads are unrestricted. The server is containerised, reverse-proxied, and ships its audit stream to the same Loki instance used for homelab logs.
|
|
|
|
|
|
2. Decisions summary
|
|
Complete record of every decision made during the design session.
|
|
|
|
|
|
2.1 Foundation
|
|
Decision
|
|
Value
|
|
Transport
|
|
Both stdio (Claude Desktop) and SSE/HTTP (networked daemon)
|
|
Hosting
|
|
Docker container — always-on, volume-mounted
|
|
Access methods
|
|
Local LAN and public IP (behind reverse proxy)
|
|
Language
|
|
Python
|
|
|
|
|
|
2.2 Security
|
|
Decision
|
|
Value
|
|
Authentication
|
|
Static API key — bearer token in Authorization header
|
|
TLS
|
|
Terminated at reverse proxy (Nginx / Caddy / Traefik) — server binds plaintext internally
|
|
Approval granularity
|
|
Per session — approve writes once; free for the remainder of that session
|
|
Approval channel
|
|
Inline in chat — Claude surfaces pending action, user replies yes/no
|
|
|
|
|
|
2.3 Infrastructure
|
|
Decision
|
|
Value
|
|
Ansible source
|
|
Git repository — server clones/pulls on demand
|
|
Ansible connection
|
|
SSH key (mounted into container)
|
|
Playbook write policy
|
|
LLM may draft new playbooks; drafts go to review queue; never auto-executed
|
|
Review queue mechanism
|
|
Git PR / branch — draft pushed to review branch, promoted by human merge
|
|
Docker Compose scope
|
|
All nodes — local and remote via SSH + Docker context
|
|
Hard limits
|
|
Always dry-run Ansible before live execution; never auto-run unreviewed playbooks
|
|
|
|
|
|
2.4 Observability
|
|
Decision
|
|
Value
|
|
Log source
|
|
Loki (central aggregator) — single query interface
|
|
Remote log method
|
|
All nodes ship to Loki; server queries Loki API only
|
|
Health metrics
|
|
systemd service status, Docker container state, CPU/memory/disk, open ports, ICMP ping, HTTP endpoint probes, SMART disk health
|
|
Troubleshoot mode
|
|
Guided — server runs diagnostic sequences, presents findings and suggested fixes; human decides action
|
|
|
|
|
|
2.5 Network topology
|
|
Decision
|
|
Value
|
|
Source of truth
|
|
Static YAML base enriched by Ansible fact-gathering on refresh
|
|
Refresh schedule
|
|
Scheduled interval (configurable, default 1 hour)
|
|
Fields captured
|
|
Node IPs, subnets/VLANs, services per node, node roles, OS & hardware, open ports, service dependencies
|
|
Session injection
|
|
Summary injected into every session context automatically; full topology available on demand
|
|
|
|
|
|
2.6 Internet access
|
|
Decision
|
|
Value
|
|
Permitted uses
|
|
Package/image version checks, documentation fetching, CVE lookups, outbound webhooks, Git pulls
|
|
Restriction policy
|
|
Open but logged — any URL permitted; every outbound request recorded to audit log
|
|
Built-in integrations
|
|
GitHub/Gitea, Docker Hub/GHCR, Cloudflare (DNS + tunnel), Ntfy (push notifications)
|
|
|
|
|
|
2.7 Server operations
|
|
Decision
|
|
Value
|
|
Audit log destination
|
|
Shipped to Loki — same instance as homelab logs, labelled job=mcp-audit
|
|
Config source
|
|
Env vars for secrets; YAML config file (volume-mounted) for everything else
|
|
Admin interface
|
|
Lightweight web UI — session log, pending approvals, server status
|
|
Playbook review queue
|
|
Git PR — LLM pushes draft to a review branch; promotion = merging the PR
|
|
|
|
|
|
3. Tool catalogue
|
|
All tools exposed to the LLM via MCP. Read-only tools execute freely. Write tools require active session approval before first use.
|
|
|
|
|
|
3.1 Shell & environment
|
|
Tool name
|
|
Parameters
|
|
Auth required
|
|
run_command
|
|
cmd, node?, timeout?
|
|
Write (executes)
|
|
read_file
|
|
path, node?
|
|
Read — free
|
|
write_file
|
|
path, content, node?
|
|
Write (modifies FS)
|
|
list_directory
|
|
path, node?
|
|
Read — free
|
|
get_env
|
|
key?, node?
|
|
Read — free
|
|
|
|
|
|
3.2 Infrastructure — Ansible
|
|
Tool name
|
|
Parameters
|
|
Auth required
|
|
list_playbooks
|
|
—
|
|
Read — free
|
|
get_playbook
|
|
name
|
|
Read — free
|
|
dry_run_playbook
|
|
name, inventory?, extra_vars?
|
|
Read — free (check mode)
|
|
run_playbook
|
|
name, inventory?, extra_vars?
|
|
Write (deploys)
|
|
draft_playbook
|
|
name, content
|
|
Write (creates PR)
|
|
list_inventory
|
|
—
|
|
Read — free
|
|
refresh_inventory
|
|
—
|
|
Write (pulls git)
|
|
|
|
|
|
3.3 Infrastructure — Docker
|
|
Tool name
|
|
Parameters
|
|
Auth required
|
|
list_stacks
|
|
node?
|
|
Read — free
|
|
get_stack_status
|
|
stack, node?
|
|
Read — free
|
|
compose_up
|
|
stack, node?
|
|
Write (deploys)
|
|
compose_down
|
|
stack, node?
|
|
Write (stops)
|
|
compose_pull
|
|
stack, node?
|
|
Write (pulls images)
|
|
get_container_logs
|
|
container, node?, lines?
|
|
Read — free
|
|
list_images
|
|
node?
|
|
Read — free
|
|
|
|
|
|
3.4 Observability
|
|
Tool name
|
|
Parameters
|
|
Auth required
|
|
query_logs
|
|
logql, start?, end?, limit?
|
|
Read — free
|
|
get_service_status
|
|
service, node?
|
|
Read — free
|
|
get_resource_usage
|
|
node?
|
|
Read — free
|
|
check_port
|
|
host, port
|
|
Read — free
|
|
ping_host
|
|
host, count?
|
|
Read — free
|
|
probe_http
|
|
url, expected_status?
|
|
Read — free
|
|
get_smart_health
|
|
device, node?
|
|
Read — free
|
|
run_diagnostic
|
|
service, node?
|
|
Read — free
|
|
|
|
|
|
3.5 Network topology
|
|
Tool name
|
|
Parameters
|
|
Auth required
|
|
get_topology_summary
|
|
—
|
|
Read — free (always in context)
|
|
get_topology_full
|
|
—
|
|
Read — free
|
|
find_service
|
|
name
|
|
Read — free
|
|
find_node
|
|
name_or_ip
|
|
Read — free
|
|
list_nodes
|
|
role?
|
|
Read — free
|
|
refresh_topology
|
|
—
|
|
Write (runs Ansible facts)
|
|
|
|
|
|
3.6 Internet
|
|
Tool name
|
|
Parameters
|
|
Auth required
|
|
http_fetch
|
|
url, method?, headers?, body?
|
|
Read — logged
|
|
dns_lookup
|
|
hostname, type?
|
|
Read — logged
|
|
check_package_version
|
|
package, ecosystem
|
|
Read — logged
|
|
cve_lookup
|
|
cve_id or package
|
|
Read — logged
|
|
git_pull
|
|
repo_url, path?
|
|
Write (clones/pulls)
|
|
send_notification
|
|
message, title?, priority?
|
|
Write (Ntfy)
|
|
cloudflare_dns
|
|
action, zone, record?
|
|
Write (DNS change)
|
|
|
|
|
|
3.7 Session & approval
|
|
Tool name
|
|
Parameters
|
|
Auth required
|
|
get_session_info
|
|
—
|
|
Read — free
|
|
approve_writes
|
|
—
|
|
Activates write approval for session
|
|
revoke_writes
|
|
—
|
|
Revokes write approval for session
|
|
get_audit_log
|
|
limit?, since?
|
|
Read — free
|
|
list_pending_playbooks
|
|
—
|
|
Read — free
|
|
promote_playbook
|
|
name
|
|
Write (merges PR)
|
|
|
|
|
|
4. Architecture
|
|
4.1 Container layout
|
|
The server runs as a single Docker container on the control node. Volume mounts provide access to SSH keys, the YAML config, and the topology file. All secrets are injected as environment variables; the container itself contains no credentials at build time.
|
|
|
|
|
|
mcp-server/
|
|
Dockerfile
|
|
docker-compose.yml
|
|
config/
|
|
config.yaml # non-secret configuration
|
|
topology.yaml # static topology base
|
|
src/
|
|
server.py # MCP entry point (stdio + SSE)
|
|
tools/ # one module per domain
|
|
approval.py # session approval gate
|
|
topology.py # topology loader + refresher
|
|
audit.py # Loki shipper
|
|
integrations/ # GitHub, Dockerhub, Cloudflare, Ntfy
|
|
|
|
|
|
4.2 Transport modes
|
|
The server supports both transport modes simultaneously:
|
|
|
|
|
|
* stdio — invoked directly by Claude Desktop via the MCP stdio protocol. The process is spawned per session. Suitable for local desktop use.
|
|
* SSE/HTTP — the server also binds an HTTP listener (default port 8765) serving the MCP SSE transport. The reverse proxy terminates TLS and forwards to this port. Any MCP-compatible client can connect with a valid API key.
|
|
|
|
|
|
4.3 Approval flow
|
|
Write operations follow a strict gate:
|
|
|
|
|
|
1. LLM calls a write tool.
|
|
2. Server checks session approval state. If not approved, returns a structured pending response to the LLM.
|
|
3. LLM surfaces the pending action inline in chat with a clear description of what will be executed.
|
|
4. User replies yes/no. On yes, the approve_writes tool is called, unlocking writes for the session.
|
|
5. The original write tool is retried and executed. All subsequent writes in this session are free.
|
|
6. Every action (approved or denied) is written to the audit log and shipped to Loki.
|
|
|
|
|
|
4.4 Topology lifecycle
|
|
On startup the server loads topology.yaml as the static base. A background scheduler (default: every 60 minutes) runs an Ansible facts gather against the inventory and merges the result into the live topology object. The refresh interval is configurable in config.yaml.
|
|
|
|
|
|
At session start a compact topology summary (node count, subnet list, roles, service index) is prepended to the system prompt. The full topology is available via get_topology_full at any time.
|
|
|
|
|
|
4.5 Playbook lifecycle
|
|
When the LLM drafts a new playbook:
|
|
|
|
|
|
* The draft is written to a local staging area inside the container.
|
|
* The server commits it to a dedicated review branch on the configured Git remote and opens a pull request.
|
|
* The draft is listed in list_pending_playbooks but is blocked from execution.
|
|
* When the human merges the PR, the playbook enters the live playbook library.
|
|
* promote_playbook can also be called directly to fast-track a merge from within the chat.
|
|
|
|
|
|
HARD LIMIT
|
|
run_playbook will reject any playbook whose name does not exist in the merged main branch. Draft-branch playbooks are permanently blocked from execution regardless of session approval state.
|
|
|
|
|
|
5. Configuration reference
|
|
5.1 Environment variables (secrets)
|
|
Tool name
|
|
Parameters
|
|
Auth required
|
|
MCP_API_KEY
|
|
Required
|
|
Bearer token for network clients
|
|
LOKI_URL
|
|
Required
|
|
Loki push/query base URL
|
|
LOKI_USER / LOKI_PASSWORD
|
|
Optional
|
|
Basic auth if Loki requires it
|
|
GIT_TOKEN
|
|
Required
|
|
Token for Ansible repo + PR creation
|
|
CLOUDFLARE_API_TOKEN
|
|
Optional
|
|
Cloudflare integration
|
|
NTFY_URL / NTFY_TOKEN
|
|
Optional
|
|
Ntfy push notification endpoint
|
|
DOCKERHUB_TOKEN
|
|
Optional
|
|
Authenticated image version checks
|
|
|
|
|
|
5.2 config.yaml structure
|
|
server:
|
|
http_port: 8765
|
|
log_level: info
|
|
|
|
|
|
topology:
|
|
static_file: /config/topology.yaml
|
|
refresh_interval_minutes: 60
|
|
|
|
|
|
ansible:
|
|
repo_url: git@github.com:you/homelab-playbooks.git
|
|
local_path: /data/playbooks
|
|
ssh_key_path: /secrets/id_ed25519
|
|
review_branch_prefix: mcp-draft/
|
|
|
|
|
|
docker:
|
|
compose_dir: /data/stacks
|
|
remote_contexts: # name: ssh://user@host
|
|
node2: ssh://admin@192.168.1.12
|
|
node3: ssh://admin@192.168.1.13
|
|
|
|
|
|
loki:
|
|
audit_labels:
|
|
job: mcp-audit
|
|
env: homelab
|
|
|
|
|
|
integrations:
|
|
github_base_url: https://api.github.com # or your Gitea URL
|
|
dockerhub_registry: https://hub.docker.com
|
|
cloudflare_zone_id: ~ # optional
|
|
|
|
|
|
6. Security model
|
|
6.1 Network perimeter
|
|
* TLS is terminated at the reverse proxy. The server never handles raw TLS.
|
|
* The server should bind to 127.0.0.1 or a Docker internal network only — the proxy is the sole external listener.
|
|
* LAN access can reach the server directly via the proxy on the internal network.
|
|
* Public IP access is also routed through the proxy; the same API key is required regardless of source.
|
|
|
|
|
|
6.2 Authentication
|
|
* All SSE/HTTP connections must supply Authorization: Bearer <MCP_API_KEY>.
|
|
* stdio connections are inherently local and bypass key auth.
|
|
* Invalid or missing keys return 401 immediately with no information leakage.
|
|
|
|
|
|
6.3 Execution safety
|
|
* Ansible always runs in check mode (dry-run) first. Live execution requires an explicit second call.
|
|
* New playbooks are permanently blocked from execution until merged into main via PR.
|
|
* Session approval unlocks writes for the lifetime of one session only. A new connection or server restart resets state.
|
|
* All outbound HTTP requests are logged to Loki with URL, method, response code, and timestamp.
|
|
|
|
|
|
7. Admin web UI
|
|
A lightweight read-only web UI is served by the MCP server on the same HTTP port under /ui. It requires the same API key (passed as a query param or cookie).
|
|
|
|
|
|
Panels:
|
|
* Live session log — active sessions, approval state, tool call history
|
|
* Pending playbooks — list of draft PRs awaiting review, with diff viewer
|
|
* Audit stream — recent Loki audit entries, filterable by session and tool
|
|
* Server health — uptime, topology last-refreshed timestamp, integration connectivity
|
|
|
|
|
|
NOTE
|
|
The web UI is read-only and observational. All actions (promote, approve, run) happen through the MCP protocol in chat, never through the UI directly.
|
|
|
|
|
|
8. Open questions & deferred decisions
|
|
Items not resolved in this session — to be decided before or during implementation.
|
|
|
|
|
|
* Topology refresh interval — default set to 60 minutes; confirm this is appropriate for your lab cadence.
|
|
* SSH connection for Ansible — SSH key mount vs agent forwarding not finalised; key mount assumed, confirm path.
|
|
* Web UI authentication — API key as query param is convenient but less secure; consider a separate UI credential.
|
|
* Rate limiting — no rate limit on tool calls specified; consider adding one for public IP exposure.
|
|
* Multi-user — spec assumes a single operator. If multiple users will share the server, session approval state needs to be per-user, not per-connection.
|
|
* Backup / restore — no decision on persisting the topology cache or session state across container restarts.
|
|
|
|
|
|
Homelab MCP Server — Design Specification v1.0 · Generated May 2026 |