nexus-mcp/documentation/RESILIENCE.md

# Enterprise System Resilience Feature

## Overview

This document describes the enterprise system resilience feature that resolves **CRITICAL #1** from the code health report: "No Resilience When Enterprise Systems Fail."

**Problem:** Your HTTP clients crashed on any API failure. If Workday went down during a weekly drift audit, the ENTIRE audit failed—even though AD and Entra data were still accessible.

**Solution:** Automatic retry logic with exponential backoff, circuit breaker pattern, and graceful degradation allow drift audits to continue with partial data when some systems are unavailable.

---

## Features

### 1. Automatic Retry Logic

All HTTP clients automatically retry transient failures with exponential backoff:

- **Max Attempts:** 3 (configurable)
- **Backoff Strategy:** 2s → 4s → 8s exponential delay
- **Retries On:** 5xx errors, timeouts, connection errors
- **Does NOT Retry:** 4xx errors (client errors like 404 are instant failures)

**Example: Transient Failure**
```
Attempt 1: 503 Service Unavailable → wait 2s
Attempt 2: 503 Service Unavailable → wait 4s
Attempt 3: Data returned → success ✓
```

### 2. Circuit Breaker Pattern

Prevents hammering a failing service by "opening the circuit":

- **Threshold:** 5 consecutive failures triggers the circuit to open
- **Open State (60s):** Subsequent requests fail instantly with `CircuitBreakerOpenError` (no timeout waste)
- **Half-Open State (testing):** After 60s timeout, one test request allowed
- **Close State (recovery):** If test succeeds, circuit closes and normal operation resumes

**Example: Sustained Failure**
```
Requests 1-5:  Each retries 3 times (network errors)
Request 6:     Circuit opens immediately (no retry)
Request 7:     Circuit still open, fails fast (<100ms)
After 60s:     Circuit half-open, test request sent
Test success:  Circuit closes, normal retries resume
```

### 3. Graceful Degradation in Audit Tools

Audit tools wrap each system call separately, so if one system fails, the audit continues with available systems:

**audit_user_drift() Example:**
```python
# Before: Any failure crashed the entire audit
# After: Wraps each system separately
try:
    wd_data = await _get_wd().get("/staffing/v6/workers", ...)
    systems_available.append("Workday")
except Exception as e:
    systems_failed.append("Workday")
    logger.warning(f"Workday unavailable: {e}")

# Continue with AD and Entra even if Workday failed...
```

**Response Example:**
```json
{
  "email": "john.doe@wheels.com",
  "systems_checked": ["Workday", "ActiveDirectory", "Entra"],
  "systems_available": ["ActiveDirectory", "Entra"],
  "systems_failed": ["Workday"],
  "workday_found": false,
  "ad_found": true,
  "entra_found": true,
  "discrepancy_count": 1,
  "discrepancies": [
    {
      "field": "job_title",
      "system_a": "ActiveDirectory",
      "value_a": "Senior Engineer",
      "system_b": "Entra",
      "value_b": "Engineer",
      "severity": "medium"
    }
  ]
}
```

### 4. Proactive Health Monitoring

**New Tool: `check_system_health()`**

Pings all enterprise systems and returns availability + response times:

```json
{
  "timestamp": "2026-04-13T14:30:00Z",
  "systems": {
    "Workday": {"available": true, "response_time_ms": 245},
    "ActiveDirectory": {"available": true, "response_time_ms": 150},
    "Entra": {"available": true, "response_time_ms": 320},
    "Lansweeper": {"available": false, "error": "TimeoutException..."},
    "Intune": {"available": true, "response_time_ms": 280},
    "Helix": {"available": true, "response_time_ms": 410}
  },
  "summary": {
    "total_systems": 6,
    "available_systems": 5,
    "unavailable_systems": 1,
    "availability_percentage": 83
  }
}
```

**Use Case:** Run this before bulk audits to decide whether to proceed or wait.

---

## Implementation Details

### Modified Files

| File | Change |
|------|--------|
| `pyproject.toml` | Added `tenacity>=8.2.0` dependency |
| `lib/resilience.py` | **NEW** — Retry decorator, circuit breaker, 404 handler |
| `lib/workday_client.py` | Applied `@resilient_http_call` to `get()`, `raas()` |
| `lib/entra_client.py` | Applied `@resilient_http_call` to `get()`, `get_all_pages()` |
| `lib/helix_client.py` | Applied `@resilient_http_call` to `get()`, `post()` |
| `lib/intune_client.py` | Applied `@resilient_http_call` to `get()` |
| `lib/lansweeper_client.py` | Applied `@resilient_http_call` to `gql()` |
| `lib/fedex_client.py` | Applied `@resilient_http_call` to `post()` |
| `src/shards/audit.py` | Graceful degradation in `audit_user_drift()`, `audit_device_drift()`, new `check_system_health()` tool |
| `tests/test_resilience.py` | **NEW** — 12 comprehensive unit tests |

### Decorators

#### @resilient_http_call

Applies retry logic and circuit breaker to async HTTP functions:

```python
from resilience import resilient_http_call

@resilient_http_call(service_name="Workday", max_attempts=3)
async def get(self, path: str) -> dict:
    resp = await self._http.get(url)
    resp.raise_for_status()
    return resp.json()
```

**Parameters:**
- `service_name` (str): Service identifier for logging and circuit breaker tracking
- `max_attempts` (int): Maximum retry attempts (default: 3)
- `enable_circuit_breaker` (bool): Whether to use circuit breaker (default: True)

#### @handle_404_gracefully

Converts 404 errors to `None` instead of raising:

```python
from resilience import handle_404_gracefully

@handle_404_gracefully
@resilient_http_call(service_name="Entra")
async def get_user(user_id: str) -> dict | None:
    resp = await self._http.get(f"/users/{user_id}")
    resp.raise_for_status()
    return resp.json()

result = await get_user("nonexistent-id")  # Returns None instead of raising
```

---

## Testing

### Run All Tests

```bash
cd nexus-mcp
pytest tests/test_resilience.py -v
```

**Expected Output:**
```
tests/test_resilience.py::TestCircuitBreaker::test_circuit_closed_to_open_after_threshold_failures PASSED
tests/test_resilience.py::TestCircuitBreaker::test_circuit_half_open_to_closed_on_success PASSED
tests/test_resilience.py::TestCircuitBreaker::test_circuit_half_open_to_open_on_failure PASSED
tests/test_resilience.py::TestCircuitBreaker::test_circuit_resets_on_success PASSED
tests/test_resilience.py::TestResilientHttpCall::test_retries_on_timeout_exception PASSED
tests/test_resilience.py::TestResilientHttpCall::test_retries_on_5xx_errors PASSED
tests/test_resilience.py::TestResilientHttpCall::test_no_retry_on_4xx_errors PASSED
tests/test_resilience.py::TestResilientHttpCall::test_exhausts_retries_and_raises PASSED
tests/test_resilience.py::TestHandle404Gracefully::test_converts_404_to_none PASSED
tests/test_resilience.py::TestHandle404Gracefully::test_does_not_convert_other_errors PASSED
tests/test_resilience.py::TestHandle404Gracefully::test_returns_normal_result_on_success PASSED
tests/test_resilience.py::TestCircuitBreakerIntegration::test_circuit_breaker_opens_after_failures PASSED

======================== 12 passed in 12.40s ========================
```

### Manual Testing

#### Test 1: Graceful Degradation

**Setup:**
1. Edit `.env` — temporarily invalidate one credential (e.g., `WORKDAY_CLIENT_ID=invalid`)
2. Ensure `USE_MOCK=false` (live mode)

**Run:**
```bash
python src/main.py
# In MCP client:
audit_user_drift(email="test@example.com")
```

**Expected Result:**
```json
{
  "systems_available": ["ActiveDirectory", "Entra"],
  "systems_failed": ["Workday"],
  "discrepancy_count": 1
}
```

**Verification:**
- ✅ No crash
- ✅ Audit continues with available systems
- ✅ Drift comparison runs for AD ↔ Entra

#### Test 2: Circuit Breaker

**Setup:**
1. Simulate sustained Workday outage (disable service or firewall block)
2. Credentials valid but service unreachable

**Run:**
```bash
python src/main.py
# In MCP client:
audit_bulk_user_drift(emails=["user1@example.com", "user2@example.com", ..., "user10@example.com"])
```

**Expected Logs:**
```
[audit_user_drift] Workday: Attempt 1/3 (retry on transient error)
[audit_user_drift] Workday: Attempt 2/3 (retry on transient error)
[audit_user_drift] Workday: Attempt 3/3 (retry on transient error)
[resilience] [Workday] Circuit CLOSED → OPEN (5 consecutive failures)
[audit_user_drift] Workday: CircuitBreakerOpenError (fast-fail)
```

**Verification:**
- ✅ First 5 requests retry 3 times each
- ✅ Subsequent requests fail instantly (< 100ms)
- ✅ Logs show circuit state transitions

#### Test 3: Retry on Transient Failure

**Setup:**
1. Valid credentials
2. Introduce 1-second network delay (via proxy or `tc` on Linux)

**Run:**
```bash
python src/main.py
# In MCP client:
audit_user_drift(email="test@example.com")
```

**Expected Result:**
- ✅ Tool succeeds (after retries)
- ✅ Response includes full drift data
- ✅ Logs show "Retry attempt 1/3", "Retry attempt 2/3"

#### Test 4: Health Check

**Run:**
```bash
python src/main.py
# In MCP client:
check_system_health()
```

**Expected Result:**
```json
{
  "summary": {
    "total_systems": 6,
    "available_systems": 6,
    "availability_percentage": 100
  },
  "systems": {
    "Workday": {"available": true, "response_time_ms": ...},
    ...
  }
}
```

**Decision Logic:**
- If `availability_percentage >= 80`: Safe to run bulk audits
- If `availability_percentage < 80`: Postpone or expect partial results

---

## Deployment

### Prerequisites

```bash
# Navigate to nexus-mcp
cd nexus-mcp

# Install dependencies (including tenacity)
pip install -e .
```

### Verify Installation

```bash
python -c "from resilience import resilient_http_call; print('✓ Installed')"
```

### Run in Production

**With credential-based authentication:**
```bash
USE_MOCK=false python src/main.py
```

**With mock data (testing):**
```bash
USE_MOCK=true python src/main.py
```

### Monitoring

Watch logs for:
- `[resilience]` messages — retry events, circuit breaker state changes
- `CircuitBreakerOpenError` — indicates sustained service outage
- Retry counts — indicates transient network issues

**Example Alert Rules:**
- If `"CircuitBreakerOpenError found in logs"` → Investigate service
- If `"Retry attempt 2/3" repeated > 10 times in 5 minutes` → Network degradation
- If `"Circuit.*OPEN"` → Service outage (escalate to on-call)

---

## Troubleshooting

### Symptom: "CircuitBreakerOpenError: Workday circuit breaker is OPEN"

**Cause:** 5 consecutive Workday failures within the monitoring window.

**Solution:**
1. Check Workday status (https://status.workday.com)
2. Verify credentials in `.env` — test manually with `curl` or Postman
3. Check network connectivity — can you reach `api.myworkday.com`?
4. Wait 60 seconds for circuit to enter half-open state and test recovery
5. Monitor logs for `"Circuit HALF_OPEN → CLOSED"` indicating recovery

### Symptom: Audit returns empty `systems_available` list

**Cause:** All systems are down or credentials are invalid.

**Solution:**
1. Run `check_system_health()` to identify which system is down
2. For downed systems:
   - Check system status pages
   - Verify network connectivity
   - Wait for service to recover
3. For credential issues:
   - Verify `.env` has valid credentials
   - Test credentials manually via API (e.g., `curl` for Workday OAuth)
   - Regenerate tokens/credentials if expired

### Symptom: Slow response times even on successful requests

**Observe:** Use `check_system_health()` to identify slow systems.

**Solution:**
- If `response_time_ms > 5000`: System is under load, expect slower audits
- Network latency → Consider running audits during low-traffic windows
- Consider increasing timeouts if system is reliably slow but functional

### Symptom: Excessively verbose retry logs

**Cause:** Transient network issues causing multiple retries.

**Solution:**
- Expected during network instability
- Monitor for patterns (e.g., always fails at certain time)
- Use `check_system_health()` to confirm system is reachable
- If persistent, investigate network (firewall, ISP, proxy issues)

---

## Configuration

### Retry Policy

**Currently Hard-Coded:**
- Max attempts: 3
- Backoff: exponential (2s, 4s, 8s)

**To Customize:**
Edit retry decorator in [lib/resilience.py](lib/resilience.py):

```python
@resilient_http_call(service_name="Workday", max_attempts=5)  # ← Change here
```

### Circuit Breaker Threshold

**Currently Hard-Coded:**
- Failure threshold: 5 consecutive failures
- Timeout before half-open: 60 seconds

**To Customize:**
Edit [lib/resilience.py](lib/resilience.py):

```python
breaker = CircuitBreaker("Workday", failure_threshold=10, timeout_seconds=120)
```

---

## Future Enhancements

1. **Configurable Retry Policy** — Move retry/backoff settings to `.env` or config file
2. **Metrics & Observability** — Track retry counts, circuit breaker events in audit logs
3. **Token Expiration Handling** — Cache token expiry times and refresh proactively (CRITICAL #2)
4. **PowerShell Command Injection Fix** — Use parameterized queries to prevent AD injection attacks (CRITICAL #3)
5. **Database Fallback** — Cache drift results locally for offline resilience
6. **Rate Limiting** — Implement exponential backoff to respect API rate limits

---

## References

- **Code Health Report:** `documentation/reports/code-health-report-2026-04-13.md`
- **Tenacity Docs:** https://tenacity.readthedocs.io/
- **Feature Branch:** `feat/add-enterprise-resilience`
- **Commits:**
  - `6337182` — Initial implementation
  - `eb8b14b` — Fix retry logic and datetime deprecation