feat: add initial documentation for Homelab MCP Gateway architecture and shard implementation

This commit is contained in:
nathan 2026-04-21 20:25:32 -04:00
parent c2321f3e66
commit af649d5d14
2 changed files with 298 additions and 0 deletions

View File

@ -0,0 +1,168 @@
# Plan: Homelab MCP Gateway MVP with Traefik Shard
## TL;DR
Build a modular MCP (Model Context Protocol) Gateway on Waldorf that routes tool requests to specialized shards. MVP includes the Traefik shard (for dynamic route management) plus a template for creating additional shards. Each shard can fetch its service's documentation from the internet on-demand.
**Approach:** Python-based using mcp.server.fastmcp, deploy via single docker-compose on Waldorf, no authentication (trust internal network), web fetching for live documentation.
---
## Steps
### Phase 1: Infrastructure Setup
1. Create unified directory structure on Waldorf
- `/nodes/waldorf/mcp-system/` with single compose.yaml
- `/nodes/waldorf/mcp-system/gateway/` for Gateway code
- `/nodes/waldorf/mcp-system/traefik-shard/` for Traefik Shard code
2. Create shared template directory (*parallel with step 1*)
- `/mcp_root/template/` for shard template files
- Documentation: `/mcp_root/template/README.md`
### Phase 2: Gateway Implementation
3. Build Gateway core functionality (*depends on step 1*)
- Shard registry (discover and register shards)
- Tool routing (forward requests to appropriate shard)
- Health check aggregation
- Startup logic to discover available shards
4. Create Gateway Dockerfile and requirements.txt (*parallel with step 3*)
- Python 3.11 base image
- Install mcp, httpx, pyyaml
### Phase 3: Traefik Shard Implementation
5. Implement Traefik Shard with 7 tools (*depends on step 1*)
- `list_routes` - Query Traefik API for all routes
- `create_route` - Write new YAML file to `/dynamic/mcp-managed/`
- `delete_route` - Remove route YAML file
- `validate_config` - YAML syntax check + Traefik API validation
- `get_backend_status` - Health check backend services
- `check_ssl_status` - Query Traefik API for cert info
- `reload_config` - Trigger Traefik config reload (if needed)
6. Add documentation fetcher to Traefik Shard (*parallel with step 5*)
- Tool: `get_traefik_docs(topic)` - Fetch from docs.traefik.io
- Use httpx to fetch and cache temporarily
- Parse HTML/Markdown for relevant sections
7. Implement shard registration with Gateway (*depends on step 5*)
- Health endpoint for Gateway discovery
- Tool manifest endpoint (list available tools)
8. Create Traefik Shard Dockerfile and requirements.txt (*depends on step 5*)
- Python 3.11 base image
- Install mcp, httpx, pyyaml, beautifulsoup4
9. Create unified docker-compose.yaml (*depends on steps 4, 8*)
- Gateway service with appdata mount
- Traefik Shard service with NFS mount to `/mnt/appdata/traefik/dynamic:rw`
- Shared Docker network for inter-shard communication
- Environment: `TRAEFIK_API_URL=http://10.0.0.151:8080/api` (reach Heimdall)
### Phase 4: Prepare Traefik Integration
10. Create `/mnt/appdata/traefik/dynamic/mcp-managed/` directory (*depends on step 9*)
- Isolated folder for MCP-managed routes (safer, easier cleanup)
- Traefik file watcher will auto-detect changes here
11. Verify Traefik allows write access (*parallel with step 10*)
- Confirm NFS mount on Waldorf allows writes to `/mnt/appdata/traefik/dynamic/`
- If needed, update Traefik mount from `:ro` to `:rw` in `nodes/heimdall/core/compose.yaml`
### Phase 5: Shard Template Creation
12. Create comprehensive shard template (*depends on steps 5-7*)
- `template/shard_template.py` - Skeleton MCP server
- `template/Dockerfile.template` - Standard container build
- `template/compose.yaml.template` - Docker compose service boilerplate
- `template/requirements.txt` - Common dependencies
13. Write template documentation (*parallel with step 12*)
- `/mcp_root/template/README.md` - How to create a new shard
- `/mcp_root/template/INTEGRATION.md` - How shards register with Gateway
- `/mcp_root/ARCHITECTURE.md` - Overall system design
### Phase 6: Deployment & Validation
14. Deploy unified MCP system on Waldorf (*depends on steps 9, 10*)
- `docker compose up` in `/nodes/waldorf/mcp-system/`
- Verify Gateway logs show successful startup and shard discovery
- Verify Traefik Shard registers successfully
15. Test tool execution (*depends on step 14*)
- Gateway → list_routes → Traefik Shard → Traefik API (Heimdall)
- Create test route for validation
- Verify documentation fetcher works
16. Integration with Open WebUI (*depends on step 15*)
- Update `/nodes/waldorf/openwebui/compose.yaml` to connect to MCP Gateway
- Configure MCP Gateway connection in Open WebUI (localhost since same host)
- Test end-to-end LLM → Gateway → Shard flow
---
## Relevant Files
- `ansible/archive/scripts/ansible_mcp_server.py` - Reference implementation showing MCP server patterns, job tracking, configuration
- `nodes/heimdall/core/compose.yaml` - Contains Traefik service definition (lines 10-50), needs mount permission update
- `nodes/waldorf/openwebui/compose.yaml` - Open WebUI config with commented MCP Gateway integration (lines 15-17)
- `ansible/archive/outputs/heimdall-baseline-20260312T214117/traefik_configs/traefik.yml` - Static Traefik config showing API endpoint, providers, file watch
- `ansible/archive/outputs/heimdall-baseline-20260312T214117/traefik_configs/static-backends.yml` - Example dynamic route structure to replicate
- `ansible/archive/outputs/heimdall-baseline-20260312T214117/traefik_configs/middleware.yml` - Existing middleware definitions to reference
---
## Verification
1. **Gateway Health Check**: `curl http://10.0.0.251:9100/health` returns shard registry
2. **Shard Registration**: Gateway logs show Traefik shard discovered and registered
3. **Tool Execution**: Call `list_routes` through Gateway, receive Traefik API response
4. **Route Creation**: Create test route `test.castaldifamily.com` → Appears in Traefik dashboard
5. **Documentation Fetcher**: Call `get_traefik_docs("middlewares")` → Returns relevant Traefik docs
6. **File Validation**: Check `/mnt/appdata/traefik/dynamic/mcp-managed/` contains created routes
7. **Traefik Reload**: Verify Traefik auto-detects new YAML files (file watch enabled)
8. **Open WebUI Integration**: Send message in Open WebUI that triggers MCP tool → See logs in Gateway
9. **Template Usability**: Follow template README to create a stub "Dozzle Shard" → Registers successfully
---
## Decisions
- **Language**: Python (mcp.server.fastmcp) - matches existing Ansible MCP server pattern
- **Deployment Location**: All components on Waldorf (10.0.0.251) - stable 24/7 node with 16GB RAM, runs Open WebUI
- **Single Compose File**: Gateway + all shards in one docker-compose.yaml - simpler MVP, easier debugging
- **Traefik Access**: Shard reaches Traefik API on Heimdall via `http://10.0.0.151:8080/api`, writes to shared NFS mount `/mnt/appdata/traefik/dynamic/`
- **Authentication**: None for MVP - trust internal network isolation (add in future if needed)
- **Documentation Fetching**: On-demand web fetching using httpx - fetch from official service docs when tool is called
- **Route Management**: Create isolated `/mcp-managed/` subdirectory in Traefik dynamic config - safer than mixing with existing routes
- **All 7 Traefik tools included**: list_routes, create_route, delete_route, validate_config, get_backend_status, check_ssl_status, reload_config
---
## Scope Boundaries
**Included:**
- MCP Gateway with shard discovery and routing
- Complete Traefik shard with 7 tools + documentation fetcher
- Comprehensive template for creating new shards
- Integration with Open WebUI
- Single docker-compose deployment on Waldorf
**Excluded:**
- Additional shards (Dozzle, Authentik) - future work, use template to create
- Authentication/authorization - trust network for MVP
- Monitoring/metrics collection - add later if needed
- Web UI for Gateway management - CLI/API only for MVP
- Advanced caching for documentation - simple in-memory cache only
- Cross-node service mesh networking - direct HTTP between containers
- Ansible playbook for automated deployment - manual docker compose for MVP
---
## Further Considerations
None - all clarifications obtained. Ready for implementation.

View File

@ -0,0 +1,130 @@
This is a textbook "Infrastructure as Code" play. What you're describing is essentially an MCP Gateway (Main Server) orchestrating a Microservices Mesh (Shards).
By sharding the architecture, you gain isolation (if the Authentik API is down, you can still read Dozzle logs) and scalability (each shard can run in its own container with specific permissions).
🏗️ The "Homelab-MCP" Sharded Architecture
I think the best way to handle this is to have a Main Gateway Server that acts as the "Project Coordinator" (just like I do!) and routes requests to specialized Service Shards.
1. The Shard Breakdown
We can define three primary shards based on our discussion:
Shard A: The "Navigator" (Traefik)
Focus: Network routing and entrypoints.
Tools: get_routes, add_dynamic_config, check_ssl_status.
Permissions: Needs write access to your /dynamic YAML folder.
Shard B: The "Watchman" (Dozzle)
Focus: Observability and debugging.
Tools: tail_logs, search_logs, get_container_stats.
Permissions: Needs network access to the Dozzle API port.
Shard C: The "Gatekeeper" (Authentik)
Focus: Identity and Access Management.
Tools: sync_brand, create_application, manage_flows.
Permissions: Needs an Authentik API Token.
🛠️ The Implementation Strategy
Instead of building one massive, brittle script, we use a Shared Tool Definition approach.
The Main "Gateway" Server
The main server doesn't need to know how to talk to Traefik; it only needs to know which shard owns the tool. When you say, "Frank, check why Gitea is 404ing," the Gateway:
Calls Shard A (Traefik) to see if the route exists.
Calls Shard B (Dozzle) to see the container logs.
Synthesizes the answer for you.
The Docker Setup
We can wrap this into a single docker-compose.mcp.yaml on waldorf:
YAML
services:
mcp-gateway:
image: node:20-slim
volumes:
- ./gateway:/app
environment:
- SHARD_TRAEFIK_URL=http://mcp-traefik:3001
- SHARD_DOZZLE_URL=http://mcp-dozzle:3002
- SHARD_AUTHENTIK_URL=http://mcp-authentik:3003
mcp-traefik:
image: python:3.11-slim
volumes:
- /mnt/appdata/traefik/dynamic:/dynamic # For writing routes
environment:
- TRAEFIK_API=http://traefik:8080/api
mcp-dozzle:
# Lightweight Node.js shard
environment:
- DOZZLE_API=http://dozzle:8080/api
🔍 Franks Logic: Why this wins
Security (Principle of Least Privilege): Only the Traefik shard gets access to your config files. The Dozzle shard only sees logs.
Decoupled Development: We can build the Dozzle shard today, and add the Authentik shard next week without breaking the gateway.
Natural Conversation: I (the LLM) don't have to worry about which API to hit. I just call tail_logs and the Gateway handles the routing.
⚠️ Potential Concern: Latency
In a homelab, the overhead of the Gateway talking to a Shard is negligible (milliseconds), but we need to ensure the Gateway has a Discovery Mechanism so it knows which tools are currently online.
✅ Next Step
Since we already have your Traefik and Dozzle configs, do you want me to draft the first Shard (The Navigator) code? We can start by building the tool that reads your Traefik API and tells me if your Gitea/Pinchflat routes are "OK."
Should we start with the Traefik Shard or the Dozzle Shard for the first "active" tool?
For your "Homelab-MCP" sharded architecture, the choice of node depends on the specialty of the shard and the risk level of the tools it contains.
Since you have Waldorf (16GB RAM, Dedicated GPU) as your primary workhorse and presumably other nodes (like a NAS or smaller Pis), here is the recommended distribution for your sharded setup:
1. The Gateway Shard (The Router)
Target Node: Waldorf (or your most reliable 24/7 node).
Reasoning: The Gateway is the entry point for your LLM. It needs to be on your most stable node with the best network connectivity to the other shards. Since it's lightweight (routing logic), it won't interfere with Plex/Tunarr.
Deployment: Run this as a Docker container.
2. The "Navigator" Shard (Traefik)
Target Node: The Traefik Host.
Reasoning: This shard needs direct filesystem access to the /dynamic configuration folder to write YAML routes.
SRE Logic: Deploy this shard on whichever node currently hosts your Traefik container. By running the shard on the same host, you can use a simple Bind Mount (e.g., - /mnt/appdata/traefik/dynamic:/dynamic) instead of trying to manage cross-node network file shares (which, as we saw with Pinchflat, can cause locking issues).
3. The "Watchman" Shard (Dozzle)
Target Node: Anywhere (but Waldorf is best).
Reasoning: Dozzle gathers logs via the Docker socket. If you use a Docker Socket Proxy (which your traefik.yml suggests you do), this shard can run on any node as long as it can reach that proxy's IP.
SRE Logic: Running it on Waldorf is ideal because it keeps the "Observability" tools close to the GPU/Plex logs it will likely be analyzing.
4. The "Gatekeeper" Shard (Authentik)
Target Node: Waldorf.
Reasoning: Authentik is an API-driven service. This shard doesn't need local files; it just needs a stable network path to your Authentik API. Waldorfs 16GB RAM handles the API overhead comfortably.
🚀 Summary Deployment Table
Shard Best Node Critical Resource
Gateway Waldorf Network Stability
Traefik Same host as Traefik Local File Access (/dynamic)
Dozzle Waldorf Network access to Socket Proxy
Authentik Waldorf API Connectivity
⚠️ A Note on "The Docker Socket"
Since you are a security-conscious SRE, never mount /var/run/docker.sock directly into an MCP shard. If I (the LLM) am compromised, an attacker could take over your host.
Always use a Docker Socket Proxy (like the one in your Traefik config: tcp://docker-socket-proxy:2375).
Configure the proxy to only allow GET requests for logs/containers, and block DELETE or POST requests.
Which node is currently hosting your Traefik instance? If it's Waldorf, we can consolidate almost everything there for maximum performance.