diff --git a/nexus-mcp/RESILIENCE.md b/nexus-mcp/RESILIENCE.md new file mode 100644 index 0000000..e7411b6 --- /dev/null +++ b/nexus-mcp/RESILIENCE.md @@ -0,0 +1,450 @@ +# Enterprise System Resilience Feature + +## Overview + +This document describes the enterprise system resilience feature that resolves **CRITICAL #1** from the code health report: "No Resilience When Enterprise Systems Fail." + +**Problem:** Your HTTP clients crashed on any API failure. If Workday went down during a weekly drift audit, the ENTIRE audit failed—even though AD and Entra data were still accessible. + +**Solution:** Automatic retry logic with exponential backoff, circuit breaker pattern, and graceful degradation allow drift audits to continue with partial data when some systems are unavailable. + +--- + +## Features + +### 1. Automatic Retry Logic + +All HTTP clients automatically retry transient failures with exponential backoff: + +- **Max Attempts:** 3 (configurable) +- **Backoff Strategy:** 2s → 4s → 8s exponential delay +- **Retries On:** 5xx errors, timeouts, connection errors +- **Does NOT Retry:** 4xx errors (client errors like 404 are instant failures) + +**Example: Transient Failure** +``` +Attempt 1: 503 Service Unavailable → wait 2s +Attempt 2: 503 Service Unavailable → wait 4s +Attempt 3: Data returned → success ✓ +``` + +### 2. Circuit Breaker Pattern + +Prevents hammering a failing service by "opening the circuit": + +- **Threshold:** 5 consecutive failures triggers the circuit to open +- **Open State (60s):** Subsequent requests fail instantly with `CircuitBreakerOpenError` (no timeout waste) +- **Half-Open State (testing):** After 60s timeout, one test request allowed +- **Close State (recovery):** If test succeeds, circuit closes and normal operation resumes + +**Example: Sustained Failure** +``` +Requests 1-5: Each retries 3 times (network errors) +Request 6: Circuit opens immediately (no retry) +Request 7: Circuit still open, fails fast (<100ms) +After 60s: Circuit half-open, test request sent +Test success: Circuit closes, normal retries resume +``` + +### 3. Graceful Degradation in Audit Tools + +Audit tools wrap each system call separately, so if one system fails, the audit continues with available systems: + +**audit_user_drift() Example:** +```python +# Before: Any failure crashed the entire audit +# After: Wraps each system separately +try: + wd_data = await _get_wd().get("/staffing/v6/workers", ...) + systems_available.append("Workday") +except Exception as e: + systems_failed.append("Workday") + logger.warning(f"Workday unavailable: {e}") + +# Continue with AD and Entra even if Workday failed... +``` + +**Response Example:** +```json +{ + "email": "john.doe@wheels.com", + "systems_checked": ["Workday", "ActiveDirectory", "Entra"], + "systems_available": ["ActiveDirectory", "Entra"], + "systems_failed": ["Workday"], + "workday_found": false, + "ad_found": true, + "entra_found": true, + "discrepancy_count": 1, + "discrepancies": [ + { + "field": "job_title", + "system_a": "ActiveDirectory", + "value_a": "Senior Engineer", + "system_b": "Entra", + "value_b": "Engineer", + "severity": "medium" + } + ] +} +``` + +### 4. Proactive Health Monitoring + +**New Tool: `check_system_health()`** + +Pings all enterprise systems and returns availability + response times: + +```json +{ + "timestamp": "2026-04-13T14:30:00Z", + "systems": { + "Workday": {"available": true, "response_time_ms": 245}, + "ActiveDirectory": {"available": true, "response_time_ms": 150}, + "Entra": {"available": true, "response_time_ms": 320}, + "Lansweeper": {"available": false, "error": "TimeoutException..."}, + "Intune": {"available": true, "response_time_ms": 280}, + "Helix": {"available": true, "response_time_ms": 410} + }, + "summary": { + "total_systems": 6, + "available_systems": 5, + "unavailable_systems": 1, + "availability_percentage": 83 + } +} +``` + +**Use Case:** Run this before bulk audits to decide whether to proceed or wait. + +--- + +## Implementation Details + +### Modified Files + +| File | Change | +|------|--------| +| `pyproject.toml` | Added `tenacity>=8.2.0` dependency | +| `lib/resilience.py` | **NEW** — Retry decorator, circuit breaker, 404 handler | +| `lib/workday_client.py` | Applied `@resilient_http_call` to `get()`, `raas()` | +| `lib/entra_client.py` | Applied `@resilient_http_call` to `get()`, `get_all_pages()` | +| `lib/helix_client.py` | Applied `@resilient_http_call` to `get()`, `post()` | +| `lib/intune_client.py` | Applied `@resilient_http_call` to `get()` | +| `lib/lansweeper_client.py` | Applied `@resilient_http_call` to `gql()` | +| `lib/fedex_client.py` | Applied `@resilient_http_call` to `post()` | +| `src/shards/audit.py` | Graceful degradation in `audit_user_drift()`, `audit_device_drift()`, new `check_system_health()` tool | +| `tests/test_resilience.py` | **NEW** — 12 comprehensive unit tests | + +### Decorators + +#### @resilient_http_call + +Applies retry logic and circuit breaker to async HTTP functions: + +```python +from resilience import resilient_http_call + +@resilient_http_call(service_name="Workday", max_attempts=3) +async def get(self, path: str) -> dict: + resp = await self._http.get(url) + resp.raise_for_status() + return resp.json() +``` + +**Parameters:** +- `service_name` (str): Service identifier for logging and circuit breaker tracking +- `max_attempts` (int): Maximum retry attempts (default: 3) +- `enable_circuit_breaker` (bool): Whether to use circuit breaker (default: True) + +#### @handle_404_gracefully + +Converts 404 errors to `None` instead of raising: + +```python +from resilience import handle_404_gracefully + +@handle_404_gracefully +@resilient_http_call(service_name="Entra") +async def get_user(user_id: str) -> dict | None: + resp = await self._http.get(f"/users/{user_id}") + resp.raise_for_status() + return resp.json() + +result = await get_user("nonexistent-id") # Returns None instead of raising +``` + +--- + +## Testing + +### Run All Tests + +```bash +cd nexus-mcp +pytest tests/test_resilience.py -v +``` + +**Expected Output:** +``` +tests/test_resilience.py::TestCircuitBreaker::test_circuit_closed_to_open_after_threshold_failures PASSED +tests/test_resilience.py::TestCircuitBreaker::test_circuit_half_open_to_closed_on_success PASSED +tests/test_resilience.py::TestCircuitBreaker::test_circuit_half_open_to_open_on_failure PASSED +tests/test_resilience.py::TestCircuitBreaker::test_circuit_resets_on_success PASSED +tests/test_resilience.py::TestResilientHttpCall::test_retries_on_timeout_exception PASSED +tests/test_resilience.py::TestResilientHttpCall::test_retries_on_5xx_errors PASSED +tests/test_resilience.py::TestResilientHttpCall::test_no_retry_on_4xx_errors PASSED +tests/test_resilience.py::TestResilientHttpCall::test_exhausts_retries_and_raises PASSED +tests/test_resilience.py::TestHandle404Gracefully::test_converts_404_to_none PASSED +tests/test_resilience.py::TestHandle404Gracefully::test_does_not_convert_other_errors PASSED +tests/test_resilience.py::TestHandle404Gracefully::test_returns_normal_result_on_success PASSED +tests/test_resilience.py::TestCircuitBreakerIntegration::test_circuit_breaker_opens_after_failures PASSED + +======================== 12 passed in 12.40s ======================== +``` + +### Manual Testing + +#### Test 1: Graceful Degradation + +**Setup:** +1. Edit `.env` — temporarily invalidate one credential (e.g., `WORKDAY_CLIENT_ID=invalid`) +2. Ensure `USE_MOCK=false` (live mode) + +**Run:** +```bash +python src/main.py +# In MCP client: +audit_user_drift(email="test@example.com") +``` + +**Expected Result:** +```json +{ + "systems_available": ["ActiveDirectory", "Entra"], + "systems_failed": ["Workday"], + "discrepancy_count": 1 +} +``` + +**Verification:** +- ✅ No crash +- ✅ Audit continues with available systems +- ✅ Drift comparison runs for AD ↔ Entra + +#### Test 2: Circuit Breaker + +**Setup:** +1. Simulate sustained Workday outage (disable service or firewall block) +2. Credentials valid but service unreachable + +**Run:** +```bash +python src/main.py +# In MCP client: +audit_bulk_user_drift(emails=["user1@example.com", "user2@example.com", ..., "user10@example.com"]) +``` + +**Expected Logs:** +``` +[audit_user_drift] Workday: Attempt 1/3 (retry on transient error) +[audit_user_drift] Workday: Attempt 2/3 (retry on transient error) +[audit_user_drift] Workday: Attempt 3/3 (retry on transient error) +[resilience] [Workday] Circuit CLOSED → OPEN (5 consecutive failures) +[audit_user_drift] Workday: CircuitBreakerOpenError (fast-fail) +``` + +**Verification:** +- ✅ First 5 requests retry 3 times each +- ✅ Subsequent requests fail instantly (< 100ms) +- ✅ Logs show circuit state transitions + +#### Test 3: Retry on Transient Failure + +**Setup:** +1. Valid credentials +2. Introduce 1-second network delay (via proxy or `tc` on Linux) + +**Run:** +```bash +python src/main.py +# In MCP client: +audit_user_drift(email="test@example.com") +``` + +**Expected Result:** +- ✅ Tool succeeds (after retries) +- ✅ Response includes full drift data +- ✅ Logs show "Retry attempt 1/3", "Retry attempt 2/3" + +#### Test 4: Health Check + +**Run:** +```bash +python src/main.py +# In MCP client: +check_system_health() +``` + +**Expected Result:** +```json +{ + "summary": { + "total_systems": 6, + "available_systems": 6, + "availability_percentage": 100 + }, + "systems": { + "Workday": {"available": true, "response_time_ms": ...}, + ... + } +} +``` + +**Decision Logic:** +- If `availability_percentage >= 80`: Safe to run bulk audits +- If `availability_percentage < 80`: Postpone or expect partial results + +--- + +## Deployment + +### Prerequisites + +```bash +# Navigate to nexus-mcp +cd nexus-mcp + +# Install dependencies (including tenacity) +pip install -e . +``` + +### Verify Installation + +```bash +python -c "from resilience import resilient_http_call; print('✓ Installed')" +``` + +### Run in Production + +**With credential-based authentication:** +```bash +USE_MOCK=false python src/main.py +``` + +**With mock data (testing):** +```bash +USE_MOCK=true python src/main.py +``` + +### Monitoring + +Watch logs for: +- `[resilience]` messages — retry events, circuit breaker state changes +- `CircuitBreakerOpenError` — indicates sustained service outage +- Retry counts — indicates transient network issues + +**Example Alert Rules:** +- If `"CircuitBreakerOpenError found in logs"` → Investigate service +- If `"Retry attempt 2/3" repeated > 10 times in 5 minutes` → Network degradation +- If `"Circuit.*OPEN"` → Service outage (escalate to on-call) + +--- + +## Troubleshooting + +### Symptom: "CircuitBreakerOpenError: Workday circuit breaker is OPEN" + +**Cause:** 5 consecutive Workday failures within the monitoring window. + +**Solution:** +1. Check Workday status (https://status.workday.com) +2. Verify credentials in `.env` — test manually with `curl` or Postman +3. Check network connectivity — can you reach `api.myworkday.com`? +4. Wait 60 seconds for circuit to enter half-open state and test recovery +5. Monitor logs for `"Circuit HALF_OPEN → CLOSED"` indicating recovery + +### Symptom: Audit returns empty `systems_available` list + +**Cause:** All systems are down or credentials are invalid. + +**Solution:** +1. Run `check_system_health()` to identify which system is down +2. For downed systems: + - Check system status pages + - Verify network connectivity + - Wait for service to recover +3. For credential issues: + - Verify `.env` has valid credentials + - Test credentials manually via API (e.g., `curl` for Workday OAuth) + - Regenerate tokens/credentials if expired + +### Symptom: Slow response times even on successful requests + +**Observe:** Use `check_system_health()` to identify slow systems. + +**Solution:** +- If `response_time_ms > 5000`: System is under load, expect slower audits +- Network latency → Consider running audits during low-traffic windows +- Consider increasing timeouts if system is reliably slow but functional + +### Symptom: Excessively verbose retry logs + +**Cause:** Transient network issues causing multiple retries. + +**Solution:** +- Expected during network instability +- Monitor for patterns (e.g., always fails at certain time) +- Use `check_system_health()` to confirm system is reachable +- If persistent, investigate network (firewall, ISP, proxy issues) + +--- + +## Configuration + +### Retry Policy + +**Currently Hard-Coded:** +- Max attempts: 3 +- Backoff: exponential (2s, 4s, 8s) + +**To Customize:** +Edit retry decorator in [lib/resilience.py](lib/resilience.py): + +```python +@resilient_http_call(service_name="Workday", max_attempts=5) # ← Change here +``` + +### Circuit Breaker Threshold + +**Currently Hard-Coded:** +- Failure threshold: 5 consecutive failures +- Timeout before half-open: 60 seconds + +**To Customize:** +Edit [lib/resilience.py](lib/resilience.py): + +```python +breaker = CircuitBreaker("Workday", failure_threshold=10, timeout_seconds=120) +``` + +--- + +## Future Enhancements + +1. **Configurable Retry Policy** — Move retry/backoff settings to `.env` or config file +2. **Metrics & Observability** — Track retry counts, circuit breaker events in audit logs +3. **Token Expiration Handling** — Cache token expiry times and refresh proactively (CRITICAL #2) +4. **PowerShell Command Injection Fix** — Use parameterized queries to prevent AD injection attacks (CRITICAL #3) +5. **Database Fallback** — Cache drift results locally for offline resilience +6. **Rate Limiting** — Implement exponential backoff to respect API rate limits + +--- + +## References + +- **Code Health Report:** `documentation/reports/code-health-report-2026-04-13.md` +- **Tenacity Docs:** https://tenacity.readthedocs.io/ +- **Feature Branch:** `feat/add-enterprise-resilience` +- **Commits:** + - `6337182` — Initial implementation + - `eb8b14b` — Fix retry logic and datetime deprecation