Homelab MCP Server Full Design Specification Version 1.0 · May 2026 SCOPE A Python MCP server that acts as the LLM's point of entry into a self-hosted homelab — providing infrastructure management, observability, network awareness, and internet access, with a semi-automatic approval model and full audit trail. 1. Overview This document captures all design decisions for a Model Context Protocol (MCP) server purpose-built for homelab management. It is the authoritative spec from which implementation begins. Every decision recorded here was made explicitly during the design session. The server exposes five capability domains to an LLM client: * Shell & local environment access * Infrastructure deployment (Ansible + Docker) * Observability & guided troubleshooting * Network topology awareness * Internet access with audit logging A semi-automatic approval gate governs all write operations. Reads are unrestricted. The server is containerised, reverse-proxied, and ships its audit stream to the same Loki instance used for homelab logs. 2. Decisions summary Complete record of every decision made during the design session. 2.1 Foundation Decision Value Transport Both stdio (Claude Desktop) and SSE/HTTP (networked daemon) Hosting Docker container — always-on, volume-mounted Access methods Local LAN and public IP (behind reverse proxy) Language Python 2.2 Security Decision Value Authentication Static API key — bearer token in Authorization header TLS Terminated at reverse proxy (Nginx / Caddy / Traefik) — server binds plaintext internally Approval granularity Per session — approve writes once; free for the remainder of that session Approval channel Inline in chat — Claude surfaces pending action, user replies yes/no 2.3 Infrastructure Decision Value Ansible source Git repository — server clones/pulls on demand Ansible connection SSH key (mounted into container) Playbook write policy LLM may draft new playbooks; drafts go to review queue; never auto-executed Review queue mechanism Git PR / branch — draft pushed to review branch, promoted by human merge Docker Compose scope All nodes — local and remote via SSH + Docker context Hard limits Always dry-run Ansible before live execution; never auto-run unreviewed playbooks 2.4 Observability Decision Value Log source Loki (central aggregator) — single query interface Remote log method All nodes ship to Loki; server queries Loki API only Health metrics systemd service status, Docker container state, CPU/memory/disk, open ports, ICMP ping, HTTP endpoint probes, SMART disk health Troubleshoot mode Guided — server runs diagnostic sequences, presents findings and suggested fixes; human decides action 2.5 Network topology Decision Value Source of truth Static YAML base enriched by Ansible fact-gathering on refresh Refresh schedule Scheduled interval (configurable, default 1 hour) Fields captured Node IPs, subnets/VLANs, services per node, node roles, OS & hardware, open ports, service dependencies Session injection Summary injected into every session context automatically; full topology available on demand 2.6 Internet access Decision Value Permitted uses Package/image version checks, documentation fetching, CVE lookups, outbound webhooks, Git pulls Restriction policy Open but logged — any URL permitted; every outbound request recorded to audit log Built-in integrations GitHub/Gitea, Docker Hub/GHCR, Cloudflare (DNS + tunnel), Ntfy (push notifications) 2.7 Server operations Decision Value Audit log destination Shipped to Loki — same instance as homelab logs, labelled job=mcp-audit Config source Env vars for secrets; YAML config file (volume-mounted) for everything else Admin interface Lightweight web UI — session log, pending approvals, server status Playbook review queue Git PR — LLM pushes draft to a review branch; promotion = merging the PR 3. Tool catalogue All tools exposed to the LLM via MCP. Read-only tools execute freely. Write tools require active session approval before first use. 3.1 Shell & environment Tool name Parameters Auth required run_command cmd, node?, timeout? Write (executes) read_file path, node? Read — free write_file path, content, node? Write (modifies FS) list_directory path, node? Read — free get_env key?, node? Read — free 3.2 Infrastructure — Ansible Tool name Parameters Auth required list_playbooks — Read — free get_playbook name Read — free dry_run_playbook name, inventory?, extra_vars? Read — free (check mode) run_playbook name, inventory?, extra_vars? Write (deploys) draft_playbook name, content Write (creates PR) list_inventory — Read — free refresh_inventory — Write (pulls git) 3.3 Infrastructure — Docker Tool name Parameters Auth required list_stacks node? Read — free get_stack_status stack, node? Read — free compose_up stack, node? Write (deploys) compose_down stack, node? Write (stops) compose_pull stack, node? Write (pulls images) get_container_logs container, node?, lines? Read — free list_images node? Read — free 3.4 Observability Tool name Parameters Auth required query_logs logql, start?, end?, limit? Read — free get_service_status service, node? Read — free get_resource_usage node? Read — free check_port host, port Read — free ping_host host, count? Read — free probe_http url, expected_status? Read — free get_smart_health device, node? Read — free run_diagnostic service, node? Read — free 3.5 Network topology Tool name Parameters Auth required get_topology_summary — Read — free (always in context) get_topology_full — Read — free find_service name Read — free find_node name_or_ip Read — free list_nodes role? Read — free refresh_topology — Write (runs Ansible facts) 3.6 Internet Tool name Parameters Auth required http_fetch url, method?, headers?, body? Read — logged dns_lookup hostname, type? Read — logged check_package_version package, ecosystem Read — logged cve_lookup cve_id or package Read — logged git_pull repo_url, path? Write (clones/pulls) send_notification message, title?, priority? Write (Ntfy) cloudflare_dns action, zone, record? Write (DNS change) 3.7 Session & approval Tool name Parameters Auth required get_session_info — Read — free approve_writes — Activates write approval for session revoke_writes — Revokes write approval for session get_audit_log limit?, since? Read — free list_pending_playbooks — Read — free promote_playbook name Write (merges PR) 4. Architecture 4.1 Container layout The server runs as a single Docker container on the control node. Volume mounts provide access to SSH keys, the YAML config, and the topology file. All secrets are injected as environment variables; the container itself contains no credentials at build time. mcp-server/ Dockerfile docker-compose.yml config/ config.yaml # non-secret configuration topology.yaml # static topology base src/ server.py # MCP entry point (stdio + SSE) tools/ # one module per domain approval.py # session approval gate topology.py # topology loader + refresher audit.py # Loki shipper integrations/ # GitHub, Dockerhub, Cloudflare, Ntfy 4.2 Transport modes The server supports both transport modes simultaneously: * stdio — invoked directly by Claude Desktop via the MCP stdio protocol. The process is spawned per session. Suitable for local desktop use. * SSE/HTTP — the server also binds an HTTP listener (default port 8765) serving the MCP SSE transport. The reverse proxy terminates TLS and forwards to this port. Any MCP-compatible client can connect with a valid API key. 4.3 Approval flow Write operations follow a strict gate: 1. LLM calls a write tool. 2. Server checks session approval state. If not approved, returns a structured pending response to the LLM. 3. LLM surfaces the pending action inline in chat with a clear description of what will be executed. 4. User replies yes/no. On yes, the approve_writes tool is called, unlocking writes for the session. 5. The original write tool is retried and executed. All subsequent writes in this session are free. 6. Every action (approved or denied) is written to the audit log and shipped to Loki. 4.4 Topology lifecycle On startup the server loads topology.yaml as the static base. A background scheduler (default: every 60 minutes) runs an Ansible facts gather against the inventory and merges the result into the live topology object. The refresh interval is configurable in config.yaml. At session start a compact topology summary (node count, subnet list, roles, service index) is prepended to the system prompt. The full topology is available via get_topology_full at any time. 4.5 Playbook lifecycle When the LLM drafts a new playbook: * The draft is written to a local staging area inside the container. * The server commits it to a dedicated review branch on the configured Git remote and opens a pull request. * The draft is listed in list_pending_playbooks but is blocked from execution. * When the human merges the PR, the playbook enters the live playbook library. * promote_playbook can also be called directly to fast-track a merge from within the chat. HARD LIMIT run_playbook will reject any playbook whose name does not exist in the merged main branch. Draft-branch playbooks are permanently blocked from execution regardless of session approval state. 5. Configuration reference 5.1 Environment variables (secrets) Tool name Parameters Auth required MCP_API_KEY Required Bearer token for network clients LOKI_URL Required Loki push/query base URL LOKI_USER / LOKI_PASSWORD Optional Basic auth if Loki requires it GIT_TOKEN Required Token for Ansible repo + PR creation CLOUDFLARE_API_TOKEN Optional Cloudflare integration NTFY_URL / NTFY_TOKEN Optional Ntfy push notification endpoint DOCKERHUB_TOKEN Optional Authenticated image version checks 5.2 config.yaml structure server: http_port: 8765 log_level: info topology: static_file: /config/topology.yaml refresh_interval_minutes: 60 ansible: repo_url: git@github.com:you/homelab-playbooks.git local_path: /data/playbooks ssh_key_path: /secrets/id_ed25519 review_branch_prefix: mcp-draft/ docker: compose_dir: /data/stacks remote_contexts: # name: ssh://user@host node2: ssh://admin@192.168.1.12 node3: ssh://admin@192.168.1.13 loki: audit_labels: job: mcp-audit env: homelab integrations: github_base_url: https://api.github.com # or your Gitea URL dockerhub_registry: https://hub.docker.com cloudflare_zone_id: ~ # optional 6. Security model 6.1 Network perimeter * TLS is terminated at the reverse proxy. The server never handles raw TLS. * The server should bind to 127.0.0.1 or a Docker internal network only — the proxy is the sole external listener. * LAN access can reach the server directly via the proxy on the internal network. * Public IP access is also routed through the proxy; the same API key is required regardless of source. 6.2 Authentication * All SSE/HTTP connections must supply Authorization: Bearer . * stdio connections are inherently local and bypass key auth. * Invalid or missing keys return 401 immediately with no information leakage. 6.3 Execution safety * Ansible always runs in check mode (dry-run) first. Live execution requires an explicit second call. * New playbooks are permanently blocked from execution until merged into main via PR. * Session approval unlocks writes for the lifetime of one session only. A new connection or server restart resets state. * All outbound HTTP requests are logged to Loki with URL, method, response code, and timestamp. 7. Admin web UI A lightweight read-only web UI is served by the MCP server on the same HTTP port under /ui. It requires the same API key (passed as a query param or cookie). Panels: * Live session log — active sessions, approval state, tool call history * Pending playbooks — list of draft PRs awaiting review, with diff viewer * Audit stream — recent Loki audit entries, filterable by session and tool * Server health — uptime, topology last-refreshed timestamp, integration connectivity NOTE The web UI is read-only and observational. All actions (promote, approve, run) happen through the MCP protocol in chat, never through the UI directly. 8. Open questions & deferred decisions Items not resolved in this session — to be decided before or during implementation. * Topology refresh interval — default set to 60 minutes; confirm this is appropriate for your lab cadence. * SSH connection for Ansible — SSH key mount vs agent forwarding not finalised; key mount assumed, confirm path. * Web UI authentication — API key as query param is convenient but less secure; consider a separate UI credential. * Rate limiting — no rate limit on tool calls specified; consider adding one for public IP exposure. * Multi-user — spec assumes a single operator. If multiple users will share the server, session approval state needs to be per-user, not per-connection. * Backup / restore — no decision on persisting the topology cache or session state across container restarts. Homelab MCP Server — Design Specification v1.0 · Generated May 2026