nexus-mcp/documentation/RESILIENCE.md
nathan e6e4810e47 feat(docs): update tool inventory and add resilience documentation
- Updated Nexus MCP Tool Inventory with new NEXUS references and improved tool descriptions.
- Added comprehensive README.md for Nexus MCP, detailing architecture, folder structure, and tool references.
- Introduced RESILIENCE.md to document the new enterprise system resilience features, including automatic retry logic and circuit breaker patterns.
- Created TEST_VALIDATION_REPORT.md summarizing test results and server capabilities post-rebuild.
- Established a canonical work item register (nexus-work-item-register.md) to track NEXUS-XXX work items and their statuses.
- Updated scripts to reflect changes in work item references from WIS to NEXUS.
2026-04-14 14:53:02 -04:00

451 lines
13 KiB
Markdown

# Enterprise System Resilience Feature
## Overview
This document describes the enterprise system resilience feature that resolves **CRITICAL #1** from the code health report: "No Resilience When Enterprise Systems Fail."
**Problem:** Your HTTP clients crashed on any API failure. If Workday went down during a weekly drift audit, the ENTIRE audit failed—even though AD and Entra data were still accessible.
**Solution:** Automatic retry logic with exponential backoff, circuit breaker pattern, and graceful degradation allow drift audits to continue with partial data when some systems are unavailable.
---
## Features
### 1. Automatic Retry Logic
All HTTP clients automatically retry transient failures with exponential backoff:
- **Max Attempts:** 3 (configurable)
- **Backoff Strategy:** 2s → 4s → 8s exponential delay
- **Retries On:** 5xx errors, timeouts, connection errors
- **Does NOT Retry:** 4xx errors (client errors like 404 are instant failures)
**Example: Transient Failure**
```
Attempt 1: 503 Service Unavailable → wait 2s
Attempt 2: 503 Service Unavailable → wait 4s
Attempt 3: Data returned → success ✓
```
### 2. Circuit Breaker Pattern
Prevents hammering a failing service by "opening the circuit":
- **Threshold:** 5 consecutive failures triggers the circuit to open
- **Open State (60s):** Subsequent requests fail instantly with `CircuitBreakerOpenError` (no timeout waste)
- **Half-Open State (testing):** After 60s timeout, one test request allowed
- **Close State (recovery):** If test succeeds, circuit closes and normal operation resumes
**Example: Sustained Failure**
```
Requests 1-5: Each retries 3 times (network errors)
Request 6: Circuit opens immediately (no retry)
Request 7: Circuit still open, fails fast (<100ms)
After 60s: Circuit half-open, test request sent
Test success: Circuit closes, normal retries resume
```
### 3. Graceful Degradation in Audit Tools
Audit tools wrap each system call separately, so if one system fails, the audit continues with available systems:
**audit_user_drift() Example:**
```python
# Before: Any failure crashed the entire audit
# After: Wraps each system separately
try:
wd_data = await _get_wd().get("/staffing/v6/workers", ...)
systems_available.append("Workday")
except Exception as e:
systems_failed.append("Workday")
logger.warning(f"Workday unavailable: {e}")
# Continue with AD and Entra even if Workday failed...
```
**Response Example:**
```json
{
"email": "john.doe@wheels.com",
"systems_checked": ["Workday", "ActiveDirectory", "Entra"],
"systems_available": ["ActiveDirectory", "Entra"],
"systems_failed": ["Workday"],
"workday_found": false,
"ad_found": true,
"entra_found": true,
"discrepancy_count": 1,
"discrepancies": [
{
"field": "job_title",
"system_a": "ActiveDirectory",
"value_a": "Senior Engineer",
"system_b": "Entra",
"value_b": "Engineer",
"severity": "medium"
}
]
}
```
### 4. Proactive Health Monitoring
**New Tool: `check_system_health()`**
Pings all enterprise systems and returns availability + response times:
```json
{
"timestamp": "2026-04-13T14:30:00Z",
"systems": {
"Workday": {"available": true, "response_time_ms": 245},
"ActiveDirectory": {"available": true, "response_time_ms": 150},
"Entra": {"available": true, "response_time_ms": 320},
"Lansweeper": {"available": false, "error": "TimeoutException..."},
"Intune": {"available": true, "response_time_ms": 280},
"Helix": {"available": true, "response_time_ms": 410}
},
"summary": {
"total_systems": 6,
"available_systems": 5,
"unavailable_systems": 1,
"availability_percentage": 83
}
}
```
**Use Case:** Run this before bulk audits to decide whether to proceed or wait.
---
## Implementation Details
### Modified Files
| File | Change |
|------|--------|
| `pyproject.toml` | Added `tenacity>=8.2.0` dependency |
| `lib/resilience.py` | **NEW** — Retry decorator, circuit breaker, 404 handler |
| `lib/workday_client.py` | Applied `@resilient_http_call` to `get()`, `raas()` |
| `lib/entra_client.py` | Applied `@resilient_http_call` to `get()`, `get_all_pages()` |
| `lib/helix_client.py` | Applied `@resilient_http_call` to `get()`, `post()` |
| `lib/intune_client.py` | Applied `@resilient_http_call` to `get()` |
| `lib/lansweeper_client.py` | Applied `@resilient_http_call` to `gql()` |
| `lib/fedex_client.py` | Applied `@resilient_http_call` to `post()` |
| `src/shards/audit.py` | Graceful degradation in `audit_user_drift()`, `audit_device_drift()`, new `check_system_health()` tool |
| `tests/test_resilience.py` | **NEW** — 12 comprehensive unit tests |
### Decorators
#### @resilient_http_call
Applies retry logic and circuit breaker to async HTTP functions:
```python
from resilience import resilient_http_call
@resilient_http_call(service_name="Workday", max_attempts=3)
async def get(self, path: str) -> dict:
resp = await self._http.get(url)
resp.raise_for_status()
return resp.json()
```
**Parameters:**
- `service_name` (str): Service identifier for logging and circuit breaker tracking
- `max_attempts` (int): Maximum retry attempts (default: 3)
- `enable_circuit_breaker` (bool): Whether to use circuit breaker (default: True)
#### @handle_404_gracefully
Converts 404 errors to `None` instead of raising:
```python
from resilience import handle_404_gracefully
@handle_404_gracefully
@resilient_http_call(service_name="Entra")
async def get_user(user_id: str) -> dict | None:
resp = await self._http.get(f"/users/{user_id}")
resp.raise_for_status()
return resp.json()
result = await get_user("nonexistent-id") # Returns None instead of raising
```
---
## Testing
### Run All Tests
```bash
cd nexus-mcp
pytest tests/test_resilience.py -v
```
**Expected Output:**
```
tests/test_resilience.py::TestCircuitBreaker::test_circuit_closed_to_open_after_threshold_failures PASSED
tests/test_resilience.py::TestCircuitBreaker::test_circuit_half_open_to_closed_on_success PASSED
tests/test_resilience.py::TestCircuitBreaker::test_circuit_half_open_to_open_on_failure PASSED
tests/test_resilience.py::TestCircuitBreaker::test_circuit_resets_on_success PASSED
tests/test_resilience.py::TestResilientHttpCall::test_retries_on_timeout_exception PASSED
tests/test_resilience.py::TestResilientHttpCall::test_retries_on_5xx_errors PASSED
tests/test_resilience.py::TestResilientHttpCall::test_no_retry_on_4xx_errors PASSED
tests/test_resilience.py::TestResilientHttpCall::test_exhausts_retries_and_raises PASSED
tests/test_resilience.py::TestHandle404Gracefully::test_converts_404_to_none PASSED
tests/test_resilience.py::TestHandle404Gracefully::test_does_not_convert_other_errors PASSED
tests/test_resilience.py::TestHandle404Gracefully::test_returns_normal_result_on_success PASSED
tests/test_resilience.py::TestCircuitBreakerIntegration::test_circuit_breaker_opens_after_failures PASSED
======================== 12 passed in 12.40s ========================
```
### Manual Testing
#### Test 1: Graceful Degradation
**Setup:**
1. Edit `.env` — temporarily invalidate one credential (e.g., `WORKDAY_CLIENT_ID=invalid`)
2. Ensure `USE_MOCK=false` (live mode)
**Run:**
```bash
python src/main.py
# In MCP client:
audit_user_drift(email="test@example.com")
```
**Expected Result:**
```json
{
"systems_available": ["ActiveDirectory", "Entra"],
"systems_failed": ["Workday"],
"discrepancy_count": 1
}
```
**Verification:**
- ✅ No crash
- ✅ Audit continues with available systems
- ✅ Drift comparison runs for AD ↔ Entra
#### Test 2: Circuit Breaker
**Setup:**
1. Simulate sustained Workday outage (disable service or firewall block)
2. Credentials valid but service unreachable
**Run:**
```bash
python src/main.py
# In MCP client:
audit_bulk_user_drift(emails=["user1@example.com", "user2@example.com", ..., "user10@example.com"])
```
**Expected Logs:**
```
[audit_user_drift] Workday: Attempt 1/3 (retry on transient error)
[audit_user_drift] Workday: Attempt 2/3 (retry on transient error)
[audit_user_drift] Workday: Attempt 3/3 (retry on transient error)
[resilience] [Workday] Circuit CLOSED → OPEN (5 consecutive failures)
[audit_user_drift] Workday: CircuitBreakerOpenError (fast-fail)
```
**Verification:**
- ✅ First 5 requests retry 3 times each
- ✅ Subsequent requests fail instantly (< 100ms)
- Logs show circuit state transitions
#### Test 3: Retry on Transient Failure
**Setup:**
1. Valid credentials
2. Introduce 1-second network delay (via proxy or `tc` on Linux)
**Run:**
```bash
python src/main.py
# In MCP client:
audit_user_drift(email="test@example.com")
```
**Expected Result:**
- Tool succeeds (after retries)
- Response includes full drift data
- Logs show "Retry attempt 1/3", "Retry attempt 2/3"
#### Test 4: Health Check
**Run:**
```bash
python src/main.py
# In MCP client:
check_system_health()
```
**Expected Result:**
```json
{
"summary": {
"total_systems": 6,
"available_systems": 6,
"availability_percentage": 100
},
"systems": {
"Workday": {"available": true, "response_time_ms": ...},
...
}
}
```
**Decision Logic:**
- If `availability_percentage >= 80`: Safe to run bulk audits
- If `availability_percentage < 80`: Postpone or expect partial results
---
## Deployment
### Prerequisites
```bash
# Navigate to nexus-mcp
cd nexus-mcp
# Install dependencies (including tenacity)
pip install -e .
```
### Verify Installation
```bash
python -c "from resilience import resilient_http_call; print('✓ Installed')"
```
### Run in Production
**With credential-based authentication:**
```bash
USE_MOCK=false python src/main.py
```
**With mock data (testing):**
```bash
USE_MOCK=true python src/main.py
```
### Monitoring
Watch logs for:
- `[resilience]` messages retry events, circuit breaker state changes
- `CircuitBreakerOpenError` indicates sustained service outage
- Retry counts indicates transient network issues
**Example Alert Rules:**
- If `"CircuitBreakerOpenError found in logs"` Investigate service
- If `"Retry attempt 2/3" repeated > 10 times in 5 minutes` Network degradation
- If `"Circuit.*OPEN"` Service outage (escalate to on-call)
---
## Troubleshooting
### Symptom: "CircuitBreakerOpenError: Workday circuit breaker is OPEN"
**Cause:** 5 consecutive Workday failures within the monitoring window.
**Solution:**
1. Check Workday status (https://status.workday.com)
2. Verify credentials in `.env` test manually with `curl` or Postman
3. Check network connectivity can you reach `api.myworkday.com`?
4. Wait 60 seconds for circuit to enter half-open state and test recovery
5. Monitor logs for `"Circuit HALF_OPEN → CLOSED"` indicating recovery
### Symptom: Audit returns empty `systems_available` list
**Cause:** All systems are down or credentials are invalid.
**Solution:**
1. Run `check_system_health()` to identify which system is down
2. For downed systems:
- Check system status pages
- Verify network connectivity
- Wait for service to recover
3. For credential issues:
- Verify `.env` has valid credentials
- Test credentials manually via API (e.g., `curl` for Workday OAuth)
- Regenerate tokens/credentials if expired
### Symptom: Slow response times even on successful requests
**Observe:** Use `check_system_health()` to identify slow systems.
**Solution:**
- If `response_time_ms > 5000`: System is under load, expect slower audits
- Network latency Consider running audits during low-traffic windows
- Consider increasing timeouts if system is reliably slow but functional
### Symptom: Excessively verbose retry logs
**Cause:** Transient network issues causing multiple retries.
**Solution:**
- Expected during network instability
- Monitor for patterns (e.g., always fails at certain time)
- Use `check_system_health()` to confirm system is reachable
- If persistent, investigate network (firewall, ISP, proxy issues)
---
## Configuration
### Retry Policy
**Currently Hard-Coded:**
- Max attempts: 3
- Backoff: exponential (2s, 4s, 8s)
**To Customize:**
Edit retry decorator in [lib/resilience.py](lib/resilience.py):
```python
@resilient_http_call(service_name="Workday", max_attempts=5) # ← Change here
```
### Circuit Breaker Threshold
**Currently Hard-Coded:**
- Failure threshold: 5 consecutive failures
- Timeout before half-open: 60 seconds
**To Customize:**
Edit [lib/resilience.py](lib/resilience.py):
```python
breaker = CircuitBreaker("Workday", failure_threshold=10, timeout_seconds=120)
```
---
## Future Enhancements
1. **Configurable Retry Policy** Move retry/backoff settings to `.env` or config file
2. **Metrics & Observability** Track retry counts, circuit breaker events in audit logs
3. **Token Expiration Handling** Cache token expiry times and refresh proactively (CRITICAL #2)
4. **PowerShell Command Injection Fix** Use parameterized queries to prevent AD injection attacks (CRITICAL #3)
5. **Database Fallback** Cache drift results locally for offline resilience
6. **Rate Limiting** Implement exponential backoff to respect API rate limits
---
## References
- **Code Health Report:** `documentation/reports/code-health-report-2026-04-13.md`
- **Tenacity Docs:** https://tenacity.readthedocs.io/
- **Feature Branch:** `feat/add-enterprise-resilience`
- **Commits:**
- `6337182` Initial implementation
- `eb8b14b` Fix retry logic and datetime deprecation