<|im_start|>system You are a Senior Infrastructure Observability Engineer and DevOps Specialist. Your expertise includes Prometheus Query Language (PromQL), MCP tool execution, and homelab architecture. OPERATIONAL DIRECTIVES: 1. EXECUTION MODE: Sequential Step-by-Step. Do not proceed to Phase N until Phase N-1 is complete. 2. DATA INTEGRITY: Do not summarize or truncate raw data during collection. Maintain a "Ground Truth" context buffer from documentation. 3. TOOL USAGE: You have access to the Prometheus MCP server and local filesystem tools. Use them precisely. 4. OUTPUT FORMAT: Follow the provided Markdown schema strictly. <|im_end|> <|im_start|>user # Task: Automated Infrastructure Audit and Drift Analysis ## Phase 1: Context Ingestion (Ground Truth) - Action: Read all documentation files in the repository. - Extraction Targets: - `architecture_standards`: Tech stack & node roles. - `service_inventory`: Expected running components. - `network_topology`: Expected interconnects. - `health_baselines`: SLOs and defined alert rules. - Goal: Create a mental map of what *should* be running. ## Phase 2: Environment Survey (Live State) Using the **Prometheus MCP**, execute the following queries. Store the results for comparison. 1. `up`: Identify all reporting instances and UP/DOWN status. 2. `scrape_samples_scraped`: List all active scrape targets. 3. `node_cpu_seconds_total`, `node_memory_MemTotal_bytes`, `node_filesystem_size_bytes`: Calculate per-host utilization. 4. `container_start_time_seconds`, `container_last_seen`: Identify running containers and restart counts. 5. `ALERTS{alertstate="firing"}`: Capture currently active alerts. 6. Staleness Check: Identify targets where `time() - last_over_time(up[5m]) > 300`. ## Phase 3: Reporting & Reconciliation Produce a structured report. Perform a "Drift Analysis" by comparing Phase 1 (Docs) vs Phase 2 (Live). ### Environment Report — {DATE} #### 1. Infrastructure Summary - **Nodes:** [Total Count] | [Role Distribution] - **Status:** [Overall Health %] #### 2. Service Health | Service | Status | Key Metrics (CPU/Mem) | Anomalies | | :--- | :--- | :--- | :--- | | [Name] | [Up/Down] | [Data] | [Notes] | #### 3. Alerts & Drift Analysis - **Firing Alerts:** [List alerts from ALERTS metric] - **Unauthorized Services:** [Services found in Prometheus but NOT in documentation] - **Missing Telemetry:** [Services defined in documentation but NOT found in Prometheus] #### 4. Recommended Actions Ranked by severity: **CRITICAL** → **WARNING** → **INFO** - [Action 1] - [Action 2] <|im_end|>