homelab/documentation/homelab-mcp-spec.txt

Homelab MCP Server
Full Design Specification
Version 1.0  ·  May 2026
SCOPE
	A Python MCP server that acts as the LLM's point of entry into a self-hosted homelab — providing infrastructure management, observability, network awareness, and internet access, with a semi-automatic approval model and full audit trail.


1. Overview
This document captures all design decisions for a Model Context Protocol (MCP) server purpose-built for homelab management. It is the authoritative spec from which implementation begins. Every decision recorded here was made explicitly during the design session.


The server exposes five capability domains to an LLM client:
* Shell & local environment access
* Infrastructure deployment (Ansible + Docker)
* Observability & guided troubleshooting
* Network topology awareness
* Internet access with audit logging


A semi-automatic approval gate governs all write operations. Reads are unrestricted. The server is containerised, reverse-proxied, and ships its audit stream to the same Loki instance used for homelab logs.


2. Decisions summary
Complete record of every decision made during the design session.


2.1 Foundation
Decision
	Value
	Transport
	Both stdio (Claude Desktop) and SSE/HTTP (networked daemon)
	Hosting
	Docker container — always-on, volume-mounted
	Access methods
	Local LAN and public IP (behind reverse proxy)
	Language
	Python


2.2 Security
Decision
	Value
	Authentication
	Static API key — bearer token in Authorization header
	TLS
	Terminated at reverse proxy (Nginx / Caddy / Traefik) — server binds plaintext internally
	Approval granularity
	Per session — approve writes once; free for the remainder of that session
	Approval channel
	Inline in chat — Claude surfaces pending action, user replies yes/no


2.3 Infrastructure
Decision
	Value
	Ansible source
	Git repository — server clones/pulls on demand
	Ansible connection
	SSH key (mounted into container)
	Playbook write policy
	LLM may draft new playbooks; drafts go to review queue; never auto-executed
	Review queue mechanism
	Git PR / branch — draft pushed to review branch, promoted by human merge
	Docker Compose scope
	All nodes — local and remote via SSH + Docker context
	Hard limits
	Always dry-run Ansible before live execution; never auto-run unreviewed playbooks


2.4 Observability
Decision
	Value
	Log source
	Loki (central aggregator) — single query interface
	Remote log method
	All nodes ship to Loki; server queries Loki API only
	Health metrics
	systemd service status, Docker container state, CPU/memory/disk, open ports, ICMP ping, HTTP endpoint probes, SMART disk health
	Troubleshoot mode
	Guided — server runs diagnostic sequences, presents findings and suggested fixes; human decides action


2.5 Network topology
Decision
	Value
	Source of truth
	Static YAML base enriched by Ansible fact-gathering on refresh
	Refresh schedule
	Scheduled interval (configurable, default 1 hour)
	Fields captured
	Node IPs, subnets/VLANs, services per node, node roles, OS & hardware, open ports, service dependencies
	Session injection
	Summary injected into every session context automatically; full topology available on demand


2.6 Internet access
Decision
	Value
	Permitted uses
	Package/image version checks, documentation fetching, CVE lookups, outbound webhooks, Git pulls
	Restriction policy
	Open but logged — any URL permitted; every outbound request recorded to audit log
	Built-in integrations
	GitHub/Gitea, Docker Hub/GHCR, Cloudflare (DNS + tunnel), Ntfy (push notifications)


2.7 Server operations
Decision
	Value
	Audit log destination
	Shipped to Loki — same instance as homelab logs, labelled job=mcp-audit
	Config source
	Env vars for secrets; YAML config file (volume-mounted) for everything else
	Admin interface
	Lightweight web UI — session log, pending approvals, server status
	Playbook review queue
	Git PR — LLM pushes draft to a review branch; promotion = merging the PR


3. Tool catalogue
All tools exposed to the LLM via MCP. Read-only tools execute freely. Write tools require active session approval before first use.


3.1 Shell & environment
Tool name
	Parameters
	Auth required
	run_command
	cmd, node?, timeout?
	Write (executes)
	read_file
	path, node?
	Read — free
	write_file
	path, content, node?
	Write (modifies FS)
	list_directory
	path, node?
	Read — free
	get_env
	key?, node?
	Read — free


3.2 Infrastructure — Ansible
Tool name
	Parameters
	Auth required
	list_playbooks
	—
	Read — free
	get_playbook
	name
	Read — free
	dry_run_playbook
	name, inventory?, extra_vars?
	Read — free (check mode)
	run_playbook
	name, inventory?, extra_vars?
	Write (deploys)
	draft_playbook
	name, content
	Write (creates PR)
	list_inventory
	—
	Read — free
	refresh_inventory
	—
	Write (pulls git)


3.3 Infrastructure — Docker
Tool name
	Parameters
	Auth required
	list_stacks
	node?
	Read — free
	get_stack_status
	stack, node?
	Read — free
	compose_up
	stack, node?
	Write (deploys)
	compose_down
	stack, node?
	Write (stops)
	compose_pull
	stack, node?
	Write (pulls images)
	get_container_logs
	container, node?, lines?
	Read — free
	list_images
	node?
	Read — free


3.4 Observability
Tool name
	Parameters
	Auth required
	query_logs
	logql, start?, end?, limit?
	Read — free
	get_service_status
	service, node?
	Read — free
	get_resource_usage
	node?
	Read — free
	check_port
	host, port
	Read — free
	ping_host
	host, count?
	Read — free
	probe_http
	url, expected_status?
	Read — free
	get_smart_health
	device, node?
	Read — free
	run_diagnostic
	service, node?
	Read — free


3.5 Network topology
Tool name
	Parameters
	Auth required
	get_topology_summary
	—
	Read — free (always in context)
	get_topology_full
	—
	Read — free
	find_service
	name
	Read — free
	find_node
	name_or_ip
	Read — free
	list_nodes
	role?
	Read — free
	refresh_topology
	—
	Write (runs Ansible facts)


3.6 Internet
Tool name
	Parameters
	Auth required
	http_fetch
	url, method?, headers?, body?
	Read — logged
	dns_lookup
	hostname, type?
	Read — logged
	check_package_version
	package, ecosystem
	Read — logged
	cve_lookup
	cve_id or package
	Read — logged
	git_pull
	repo_url, path?
	Write (clones/pulls)
	send_notification
	message, title?, priority?
	Write (Ntfy)
	cloudflare_dns
	action, zone, record?
	Write (DNS change)


3.7 Session & approval
Tool name
	Parameters
	Auth required
	get_session_info
	—
	Read — free
	approve_writes
	—
	Activates write approval for session
	revoke_writes
	—
	Revokes write approval for session
	get_audit_log
	limit?, since?
	Read — free
	list_pending_playbooks
	—
	Read — free
	promote_playbook
	name
	Write (merges PR)


4. Architecture
4.1 Container layout
The server runs as a single Docker container on the control node. Volume mounts provide access to SSH keys, the YAML config, and the topology file. All secrets are injected as environment variables; the container itself contains no credentials at build time.


mcp-server/
  Dockerfile
  docker-compose.yml
  config/
    config.yaml          # non-secret configuration
    topology.yaml        # static topology base
  src/
    server.py            # MCP entry point (stdio + SSE)
    tools/               # one module per domain
    approval.py          # session approval gate
    topology.py          # topology loader + refresher
    audit.py             # Loki shipper
    integrations/        # GitHub, Dockerhub, Cloudflare, Ntfy


4.2 Transport modes
The server supports both transport modes simultaneously:


* stdio — invoked directly by Claude Desktop via the MCP stdio protocol. The process is spawned per session. Suitable for local desktop use.
* SSE/HTTP — the server also binds an HTTP listener (default port 8765) serving the MCP SSE transport. The reverse proxy terminates TLS and forwards to this port. Any MCP-compatible client can connect with a valid API key.


4.3 Approval flow
Write operations follow a strict gate:


1. LLM calls a write tool.
2. Server checks session approval state. If not approved, returns a structured pending response to the LLM.
3. LLM surfaces the pending action inline in chat with a clear description of what will be executed.
4. User replies yes/no. On yes, the approve_writes tool is called, unlocking writes for the session.
5. The original write tool is retried and executed. All subsequent writes in this session are free.
6. Every action (approved or denied) is written to the audit log and shipped to Loki.


4.4 Topology lifecycle
On startup the server loads topology.yaml as the static base. A background scheduler (default: every 60 minutes) runs an Ansible facts gather against the inventory and merges the result into the live topology object. The refresh interval is configurable in config.yaml.


At session start a compact topology summary (node count, subnet list, roles, service index) is prepended to the system prompt. The full topology is available via get_topology_full at any time.


4.5 Playbook lifecycle
When the LLM drafts a new playbook:


* The draft is written to a local staging area inside the container.
* The server commits it to a dedicated review branch on the configured Git remote and opens a pull request.
* The draft is listed in list_pending_playbooks but is blocked from execution.
* When the human merges the PR, the playbook enters the live playbook library.
* promote_playbook can also be called directly to fast-track a merge from within the chat.


HARD LIMIT
	run_playbook will reject any playbook whose name does not exist in the merged main branch. Draft-branch playbooks are permanently blocked from execution regardless of session approval state.


5. Configuration reference
5.1 Environment variables (secrets)
Tool name
	Parameters
	Auth required
	MCP_API_KEY
	Required
	Bearer token for network clients
	LOKI_URL
	Required
	Loki push/query base URL
	LOKI_USER / LOKI_PASSWORD
	Optional
	Basic auth if Loki requires it
	GIT_TOKEN
	Required
	Token for Ansible repo + PR creation
	CLOUDFLARE_API_TOKEN
	Optional
	Cloudflare integration
	NTFY_URL / NTFY_TOKEN
	Optional
	Ntfy push notification endpoint
	DOCKERHUB_TOKEN
	Optional
	Authenticated image version checks


5.2 config.yaml structure
server:
  http_port: 8765
  log_level: info


topology:
  static_file: /config/topology.yaml
  refresh_interval_minutes: 60


ansible:
  repo_url: git@github.com:you/homelab-playbooks.git
  local_path: /data/playbooks
  ssh_key_path: /secrets/id_ed25519
  review_branch_prefix: mcp-draft/


docker:
  compose_dir: /data/stacks
  remote_contexts:   # name: ssh://user@host
    node2: ssh://admin@192.168.1.12
    node3: ssh://admin@192.168.1.13


loki:
  audit_labels:
    job: mcp-audit
    env: homelab


integrations:
  github_base_url: https://api.github.com   # or your Gitea URL
  dockerhub_registry: https://hub.docker.com
  cloudflare_zone_id: ~   # optional


6. Security model
6.1 Network perimeter
* TLS is terminated at the reverse proxy. The server never handles raw TLS.
* The server should bind to 127.0.0.1 or a Docker internal network only — the proxy is the sole external listener.
* LAN access can reach the server directly via the proxy on the internal network.
* Public IP access is also routed through the proxy; the same API key is required regardless of source.


6.2 Authentication
* All SSE/HTTP connections must supply Authorization: Bearer <MCP_API_KEY>.
* stdio connections are inherently local and bypass key auth.
* Invalid or missing keys return 401 immediately with no information leakage.


6.3 Execution safety
* Ansible always runs in check mode (dry-run) first. Live execution requires an explicit second call.
* New playbooks are permanently blocked from execution until merged into main via PR.
* Session approval unlocks writes for the lifetime of one session only. A new connection or server restart resets state.
* All outbound HTTP requests are logged to Loki with URL, method, response code, and timestamp.


7. Admin web UI
A lightweight read-only web UI is served by the MCP server on the same HTTP port under /ui. It requires the same API key (passed as a query param or cookie).


Panels:
* Live session log — active sessions, approval state, tool call history
* Pending playbooks — list of draft PRs awaiting review, with diff viewer
* Audit stream — recent Loki audit entries, filterable by session and tool
* Server health — uptime, topology last-refreshed timestamp, integration connectivity


NOTE
	The web UI is read-only and observational. All actions (promote, approve, run) happen through the MCP protocol in chat, never through the UI directly.


8. Open questions & deferred decisions
Items not resolved in this session — to be decided before or during implementation.


* Topology refresh interval — default set to 60 minutes; confirm this is appropriate for your lab cadence.
* SSH connection for Ansible — SSH key mount vs agent forwarding not finalised; key mount assumed, confirm path.
* Web UI authentication — API key as query param is convenient but less secure; consider a separate UI credential.
* Rate limiting — no rate limit on tool calls specified; consider adding one for public IP exposure.
* Multi-user — spec assumes a single operator. If multiple users will share the server, session approval state needs to be per-user, not per-connection.
* Backup / restore — no decision on persisting the topology cache or session state across container restarts.


Homelab MCP Server — Design Specification v1.0  ·  Generated May 2026