- Updated Nexus MCP Tool Inventory with new NEXUS references and improved tool descriptions. - Added comprehensive README.md for Nexus MCP, detailing architecture, folder structure, and tool references. - Introduced RESILIENCE.md to document the new enterprise system resilience features, including automatic retry logic and circuit breaker patterns. - Created TEST_VALIDATION_REPORT.md summarizing test results and server capabilities post-rebuild. - Established a canonical work item register (nexus-work-item-register.md) to track NEXUS-XXX work items and their statuses. - Updated scripts to reflect changes in work item references from WIS to NEXUS.
451 lines
13 KiB
Markdown
451 lines
13 KiB
Markdown
# Enterprise System Resilience Feature
|
|
|
|
## Overview
|
|
|
|
This document describes the enterprise system resilience feature that resolves **CRITICAL #1** from the code health report: "No Resilience When Enterprise Systems Fail."
|
|
|
|
**Problem:** Your HTTP clients crashed on any API failure. If Workday went down during a weekly drift audit, the ENTIRE audit failed—even though AD and Entra data were still accessible.
|
|
|
|
**Solution:** Automatic retry logic with exponential backoff, circuit breaker pattern, and graceful degradation allow drift audits to continue with partial data when some systems are unavailable.
|
|
|
|
---
|
|
|
|
## Features
|
|
|
|
### 1. Automatic Retry Logic
|
|
|
|
All HTTP clients automatically retry transient failures with exponential backoff:
|
|
|
|
- **Max Attempts:** 3 (configurable)
|
|
- **Backoff Strategy:** 2s → 4s → 8s exponential delay
|
|
- **Retries On:** 5xx errors, timeouts, connection errors
|
|
- **Does NOT Retry:** 4xx errors (client errors like 404 are instant failures)
|
|
|
|
**Example: Transient Failure**
|
|
```
|
|
Attempt 1: 503 Service Unavailable → wait 2s
|
|
Attempt 2: 503 Service Unavailable → wait 4s
|
|
Attempt 3: Data returned → success ✓
|
|
```
|
|
|
|
### 2. Circuit Breaker Pattern
|
|
|
|
Prevents hammering a failing service by "opening the circuit":
|
|
|
|
- **Threshold:** 5 consecutive failures triggers the circuit to open
|
|
- **Open State (60s):** Subsequent requests fail instantly with `CircuitBreakerOpenError` (no timeout waste)
|
|
- **Half-Open State (testing):** After 60s timeout, one test request allowed
|
|
- **Close State (recovery):** If test succeeds, circuit closes and normal operation resumes
|
|
|
|
**Example: Sustained Failure**
|
|
```
|
|
Requests 1-5: Each retries 3 times (network errors)
|
|
Request 6: Circuit opens immediately (no retry)
|
|
Request 7: Circuit still open, fails fast (<100ms)
|
|
After 60s: Circuit half-open, test request sent
|
|
Test success: Circuit closes, normal retries resume
|
|
```
|
|
|
|
### 3. Graceful Degradation in Audit Tools
|
|
|
|
Audit tools wrap each system call separately, so if one system fails, the audit continues with available systems:
|
|
|
|
**audit_user_drift() Example:**
|
|
```python
|
|
# Before: Any failure crashed the entire audit
|
|
# After: Wraps each system separately
|
|
try:
|
|
wd_data = await _get_wd().get("/staffing/v6/workers", ...)
|
|
systems_available.append("Workday")
|
|
except Exception as e:
|
|
systems_failed.append("Workday")
|
|
logger.warning(f"Workday unavailable: {e}")
|
|
|
|
# Continue with AD and Entra even if Workday failed...
|
|
```
|
|
|
|
**Response Example:**
|
|
```json
|
|
{
|
|
"email": "john.doe@wheels.com",
|
|
"systems_checked": ["Workday", "ActiveDirectory", "Entra"],
|
|
"systems_available": ["ActiveDirectory", "Entra"],
|
|
"systems_failed": ["Workday"],
|
|
"workday_found": false,
|
|
"ad_found": true,
|
|
"entra_found": true,
|
|
"discrepancy_count": 1,
|
|
"discrepancies": [
|
|
{
|
|
"field": "job_title",
|
|
"system_a": "ActiveDirectory",
|
|
"value_a": "Senior Engineer",
|
|
"system_b": "Entra",
|
|
"value_b": "Engineer",
|
|
"severity": "medium"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
### 4. Proactive Health Monitoring
|
|
|
|
**New Tool: `check_system_health()`**
|
|
|
|
Pings all enterprise systems and returns availability + response times:
|
|
|
|
```json
|
|
{
|
|
"timestamp": "2026-04-13T14:30:00Z",
|
|
"systems": {
|
|
"Workday": {"available": true, "response_time_ms": 245},
|
|
"ActiveDirectory": {"available": true, "response_time_ms": 150},
|
|
"Entra": {"available": true, "response_time_ms": 320},
|
|
"Lansweeper": {"available": false, "error": "TimeoutException..."},
|
|
"Intune": {"available": true, "response_time_ms": 280},
|
|
"Helix": {"available": true, "response_time_ms": 410}
|
|
},
|
|
"summary": {
|
|
"total_systems": 6,
|
|
"available_systems": 5,
|
|
"unavailable_systems": 1,
|
|
"availability_percentage": 83
|
|
}
|
|
}
|
|
```
|
|
|
|
**Use Case:** Run this before bulk audits to decide whether to proceed or wait.
|
|
|
|
---
|
|
|
|
## Implementation Details
|
|
|
|
### Modified Files
|
|
|
|
| File | Change |
|
|
|------|--------|
|
|
| `pyproject.toml` | Added `tenacity>=8.2.0` dependency |
|
|
| `lib/resilience.py` | **NEW** — Retry decorator, circuit breaker, 404 handler |
|
|
| `lib/workday_client.py` | Applied `@resilient_http_call` to `get()`, `raas()` |
|
|
| `lib/entra_client.py` | Applied `@resilient_http_call` to `get()`, `get_all_pages()` |
|
|
| `lib/helix_client.py` | Applied `@resilient_http_call` to `get()`, `post()` |
|
|
| `lib/intune_client.py` | Applied `@resilient_http_call` to `get()` |
|
|
| `lib/lansweeper_client.py` | Applied `@resilient_http_call` to `gql()` |
|
|
| `lib/fedex_client.py` | Applied `@resilient_http_call` to `post()` |
|
|
| `src/shards/audit.py` | Graceful degradation in `audit_user_drift()`, `audit_device_drift()`, new `check_system_health()` tool |
|
|
| `tests/test_resilience.py` | **NEW** — 12 comprehensive unit tests |
|
|
|
|
### Decorators
|
|
|
|
#### @resilient_http_call
|
|
|
|
Applies retry logic and circuit breaker to async HTTP functions:
|
|
|
|
```python
|
|
from resilience import resilient_http_call
|
|
|
|
@resilient_http_call(service_name="Workday", max_attempts=3)
|
|
async def get(self, path: str) -> dict:
|
|
resp = await self._http.get(url)
|
|
resp.raise_for_status()
|
|
return resp.json()
|
|
```
|
|
|
|
**Parameters:**
|
|
- `service_name` (str): Service identifier for logging and circuit breaker tracking
|
|
- `max_attempts` (int): Maximum retry attempts (default: 3)
|
|
- `enable_circuit_breaker` (bool): Whether to use circuit breaker (default: True)
|
|
|
|
#### @handle_404_gracefully
|
|
|
|
Converts 404 errors to `None` instead of raising:
|
|
|
|
```python
|
|
from resilience import handle_404_gracefully
|
|
|
|
@handle_404_gracefully
|
|
@resilient_http_call(service_name="Entra")
|
|
async def get_user(user_id: str) -> dict | None:
|
|
resp = await self._http.get(f"/users/{user_id}")
|
|
resp.raise_for_status()
|
|
return resp.json()
|
|
|
|
result = await get_user("nonexistent-id") # Returns None instead of raising
|
|
```
|
|
|
|
---
|
|
|
|
## Testing
|
|
|
|
### Run All Tests
|
|
|
|
```bash
|
|
cd nexus-mcp
|
|
pytest tests/test_resilience.py -v
|
|
```
|
|
|
|
**Expected Output:**
|
|
```
|
|
tests/test_resilience.py::TestCircuitBreaker::test_circuit_closed_to_open_after_threshold_failures PASSED
|
|
tests/test_resilience.py::TestCircuitBreaker::test_circuit_half_open_to_closed_on_success PASSED
|
|
tests/test_resilience.py::TestCircuitBreaker::test_circuit_half_open_to_open_on_failure PASSED
|
|
tests/test_resilience.py::TestCircuitBreaker::test_circuit_resets_on_success PASSED
|
|
tests/test_resilience.py::TestResilientHttpCall::test_retries_on_timeout_exception PASSED
|
|
tests/test_resilience.py::TestResilientHttpCall::test_retries_on_5xx_errors PASSED
|
|
tests/test_resilience.py::TestResilientHttpCall::test_no_retry_on_4xx_errors PASSED
|
|
tests/test_resilience.py::TestResilientHttpCall::test_exhausts_retries_and_raises PASSED
|
|
tests/test_resilience.py::TestHandle404Gracefully::test_converts_404_to_none PASSED
|
|
tests/test_resilience.py::TestHandle404Gracefully::test_does_not_convert_other_errors PASSED
|
|
tests/test_resilience.py::TestHandle404Gracefully::test_returns_normal_result_on_success PASSED
|
|
tests/test_resilience.py::TestCircuitBreakerIntegration::test_circuit_breaker_opens_after_failures PASSED
|
|
|
|
======================== 12 passed in 12.40s ========================
|
|
```
|
|
|
|
### Manual Testing
|
|
|
|
#### Test 1: Graceful Degradation
|
|
|
|
**Setup:**
|
|
1. Edit `.env` — temporarily invalidate one credential (e.g., `WORKDAY_CLIENT_ID=invalid`)
|
|
2. Ensure `USE_MOCK=false` (live mode)
|
|
|
|
**Run:**
|
|
```bash
|
|
python src/main.py
|
|
# In MCP client:
|
|
audit_user_drift(email="test@example.com")
|
|
```
|
|
|
|
**Expected Result:**
|
|
```json
|
|
{
|
|
"systems_available": ["ActiveDirectory", "Entra"],
|
|
"systems_failed": ["Workday"],
|
|
"discrepancy_count": 1
|
|
}
|
|
```
|
|
|
|
**Verification:**
|
|
- ✅ No crash
|
|
- ✅ Audit continues with available systems
|
|
- ✅ Drift comparison runs for AD ↔ Entra
|
|
|
|
#### Test 2: Circuit Breaker
|
|
|
|
**Setup:**
|
|
1. Simulate sustained Workday outage (disable service or firewall block)
|
|
2. Credentials valid but service unreachable
|
|
|
|
**Run:**
|
|
```bash
|
|
python src/main.py
|
|
# In MCP client:
|
|
audit_bulk_user_drift(emails=["user1@example.com", "user2@example.com", ..., "user10@example.com"])
|
|
```
|
|
|
|
**Expected Logs:**
|
|
```
|
|
[audit_user_drift] Workday: Attempt 1/3 (retry on transient error)
|
|
[audit_user_drift] Workday: Attempt 2/3 (retry on transient error)
|
|
[audit_user_drift] Workday: Attempt 3/3 (retry on transient error)
|
|
[resilience] [Workday] Circuit CLOSED → OPEN (5 consecutive failures)
|
|
[audit_user_drift] Workday: CircuitBreakerOpenError (fast-fail)
|
|
```
|
|
|
|
**Verification:**
|
|
- ✅ First 5 requests retry 3 times each
|
|
- ✅ Subsequent requests fail instantly (< 100ms)
|
|
- ✅ Logs show circuit state transitions
|
|
|
|
#### Test 3: Retry on Transient Failure
|
|
|
|
**Setup:**
|
|
1. Valid credentials
|
|
2. Introduce 1-second network delay (via proxy or `tc` on Linux)
|
|
|
|
**Run:**
|
|
```bash
|
|
python src/main.py
|
|
# In MCP client:
|
|
audit_user_drift(email="test@example.com")
|
|
```
|
|
|
|
**Expected Result:**
|
|
- ✅ Tool succeeds (after retries)
|
|
- ✅ Response includes full drift data
|
|
- ✅ Logs show "Retry attempt 1/3", "Retry attempt 2/3"
|
|
|
|
#### Test 4: Health Check
|
|
|
|
**Run:**
|
|
```bash
|
|
python src/main.py
|
|
# In MCP client:
|
|
check_system_health()
|
|
```
|
|
|
|
**Expected Result:**
|
|
```json
|
|
{
|
|
"summary": {
|
|
"total_systems": 6,
|
|
"available_systems": 6,
|
|
"availability_percentage": 100
|
|
},
|
|
"systems": {
|
|
"Workday": {"available": true, "response_time_ms": ...},
|
|
...
|
|
}
|
|
}
|
|
```
|
|
|
|
**Decision Logic:**
|
|
- If `availability_percentage >= 80`: Safe to run bulk audits
|
|
- If `availability_percentage < 80`: Postpone or expect partial results
|
|
|
|
---
|
|
|
|
## Deployment
|
|
|
|
### Prerequisites
|
|
|
|
```bash
|
|
# Navigate to nexus-mcp
|
|
cd nexus-mcp
|
|
|
|
# Install dependencies (including tenacity)
|
|
pip install -e .
|
|
```
|
|
|
|
### Verify Installation
|
|
|
|
```bash
|
|
python -c "from resilience import resilient_http_call; print('✓ Installed')"
|
|
```
|
|
|
|
### Run in Production
|
|
|
|
**With credential-based authentication:**
|
|
```bash
|
|
USE_MOCK=false python src/main.py
|
|
```
|
|
|
|
**With mock data (testing):**
|
|
```bash
|
|
USE_MOCK=true python src/main.py
|
|
```
|
|
|
|
### Monitoring
|
|
|
|
Watch logs for:
|
|
- `[resilience]` messages — retry events, circuit breaker state changes
|
|
- `CircuitBreakerOpenError` — indicates sustained service outage
|
|
- Retry counts — indicates transient network issues
|
|
|
|
**Example Alert Rules:**
|
|
- If `"CircuitBreakerOpenError found in logs"` → Investigate service
|
|
- If `"Retry attempt 2/3" repeated > 10 times in 5 minutes` → Network degradation
|
|
- If `"Circuit.*OPEN"` → Service outage (escalate to on-call)
|
|
|
|
---
|
|
|
|
## Troubleshooting
|
|
|
|
### Symptom: "CircuitBreakerOpenError: Workday circuit breaker is OPEN"
|
|
|
|
**Cause:** 5 consecutive Workday failures within the monitoring window.
|
|
|
|
**Solution:**
|
|
1. Check Workday status (https://status.workday.com)
|
|
2. Verify credentials in `.env` — test manually with `curl` or Postman
|
|
3. Check network connectivity — can you reach `api.myworkday.com`?
|
|
4. Wait 60 seconds for circuit to enter half-open state and test recovery
|
|
5. Monitor logs for `"Circuit HALF_OPEN → CLOSED"` indicating recovery
|
|
|
|
### Symptom: Audit returns empty `systems_available` list
|
|
|
|
**Cause:** All systems are down or credentials are invalid.
|
|
|
|
**Solution:**
|
|
1. Run `check_system_health()` to identify which system is down
|
|
2. For downed systems:
|
|
- Check system status pages
|
|
- Verify network connectivity
|
|
- Wait for service to recover
|
|
3. For credential issues:
|
|
- Verify `.env` has valid credentials
|
|
- Test credentials manually via API (e.g., `curl` for Workday OAuth)
|
|
- Regenerate tokens/credentials if expired
|
|
|
|
### Symptom: Slow response times even on successful requests
|
|
|
|
**Observe:** Use `check_system_health()` to identify slow systems.
|
|
|
|
**Solution:**
|
|
- If `response_time_ms > 5000`: System is under load, expect slower audits
|
|
- Network latency → Consider running audits during low-traffic windows
|
|
- Consider increasing timeouts if system is reliably slow but functional
|
|
|
|
### Symptom: Excessively verbose retry logs
|
|
|
|
**Cause:** Transient network issues causing multiple retries.
|
|
|
|
**Solution:**
|
|
- Expected during network instability
|
|
- Monitor for patterns (e.g., always fails at certain time)
|
|
- Use `check_system_health()` to confirm system is reachable
|
|
- If persistent, investigate network (firewall, ISP, proxy issues)
|
|
|
|
---
|
|
|
|
## Configuration
|
|
|
|
### Retry Policy
|
|
|
|
**Currently Hard-Coded:**
|
|
- Max attempts: 3
|
|
- Backoff: exponential (2s, 4s, 8s)
|
|
|
|
**To Customize:**
|
|
Edit retry decorator in [lib/resilience.py](lib/resilience.py):
|
|
|
|
```python
|
|
@resilient_http_call(service_name="Workday", max_attempts=5) # ← Change here
|
|
```
|
|
|
|
### Circuit Breaker Threshold
|
|
|
|
**Currently Hard-Coded:**
|
|
- Failure threshold: 5 consecutive failures
|
|
- Timeout before half-open: 60 seconds
|
|
|
|
**To Customize:**
|
|
Edit [lib/resilience.py](lib/resilience.py):
|
|
|
|
```python
|
|
breaker = CircuitBreaker("Workday", failure_threshold=10, timeout_seconds=120)
|
|
```
|
|
|
|
---
|
|
|
|
## Future Enhancements
|
|
|
|
1. **Configurable Retry Policy** — Move retry/backoff settings to `.env` or config file
|
|
2. **Metrics & Observability** — Track retry counts, circuit breaker events in audit logs
|
|
3. **Token Expiration Handling** — Cache token expiry times and refresh proactively (CRITICAL #2)
|
|
4. **PowerShell Command Injection Fix** — Use parameterized queries to prevent AD injection attacks (CRITICAL #3)
|
|
5. **Database Fallback** — Cache drift results locally for offline resilience
|
|
6. **Rate Limiting** — Implement exponential backoff to respect API rate limits
|
|
|
|
---
|
|
|
|
## References
|
|
|
|
- **Code Health Report:** `documentation/reports/code-health-report-2026-04-13.md`
|
|
- **Tenacity Docs:** https://tenacity.readthedocs.io/
|
|
- **Feature Branch:** `feat/add-enterprise-resilience`
|
|
- **Commits:**
|
|
- `6337182` — Initial implementation
|
|
- `eb8b14b` — Fix retry logic and datetime deprecation
|