- Created `nexus-work-item-register.md` to establish a canonical registry for NEXUS-XXX work items, including shard assignments and a full work item backlog. - Added `READ_ONLY_VERIFICATION.md` detailing the results of a security audit confirming zero write capabilities across integrated systems. - Introduced `RESILIENCE.md` outlining the new enterprise system resilience feature, including automatic retry logic, circuit breaker pattern, and graceful degradation strategies. - Developed `TEST_VALIDATION_REPORT.md` summarizing the successful rebuild of the Nexus MCP server with full audit shard functionality and comprehensive test results.
13 KiB
Enterprise System Resilience Feature
Overview
This document describes the enterprise system resilience feature that resolves CRITICAL #1 from the code health report: "No Resilience When Enterprise Systems Fail."
Problem: Your HTTP clients crashed on any API failure. If Workday went down during a weekly drift audit, the ENTIRE audit failed—even though AD and Entra data were still accessible.
Solution: Automatic retry logic with exponential backoff, circuit breaker pattern, and graceful degradation allow drift audits to continue with partial data when some systems are unavailable.
Features
1. Automatic Retry Logic
All HTTP clients automatically retry transient failures with exponential backoff:
- Max Attempts: 3 (configurable)
- Backoff Strategy: 2s → 4s → 8s exponential delay
- Retries On: 5xx errors, timeouts, connection errors
- Does NOT Retry: 4xx errors (client errors like 404 are instant failures)
Example: Transient Failure
Attempt 1: 503 Service Unavailable → wait 2s
Attempt 2: 503 Service Unavailable → wait 4s
Attempt 3: Data returned → success ✓
2. Circuit Breaker Pattern
Prevents hammering a failing service by "opening the circuit":
- Threshold: 5 consecutive failures triggers the circuit to open
- Open State (60s): Subsequent requests fail instantly with
CircuitBreakerOpenError(no timeout waste) - Half-Open State (testing): After 60s timeout, one test request allowed
- Close State (recovery): If test succeeds, circuit closes and normal operation resumes
Example: Sustained Failure
Requests 1-5: Each retries 3 times (network errors)
Request 6: Circuit opens immediately (no retry)
Request 7: Circuit still open, fails fast (<100ms)
After 60s: Circuit half-open, test request sent
Test success: Circuit closes, normal retries resume
3. Graceful Degradation in Audit Tools
Audit tools wrap each system call separately, so if one system fails, the audit continues with available systems:
audit_user_drift() Example:
# Before: Any failure crashed the entire audit
# After: Wraps each system separately
try:
wd_data = await _get_wd().get("/staffing/v6/workers", ...)
systems_available.append("Workday")
except Exception as e:
systems_failed.append("Workday")
logger.warning(f"Workday unavailable: {e}")
# Continue with AD and Entra even if Workday failed...
Response Example:
{
"email": "john.doe@wheels.com",
"systems_checked": ["Workday", "ActiveDirectory", "Entra"],
"systems_available": ["ActiveDirectory", "Entra"],
"systems_failed": ["Workday"],
"workday_found": false,
"ad_found": true,
"entra_found": true,
"discrepancy_count": 1,
"discrepancies": [
{
"field": "job_title",
"system_a": "ActiveDirectory",
"value_a": "Senior Engineer",
"system_b": "Entra",
"value_b": "Engineer",
"severity": "medium"
}
]
}
4. Proactive Health Monitoring
New Tool: check_system_health()
Pings all enterprise systems and returns availability + response times:
{
"timestamp": "2026-04-13T14:30:00Z",
"systems": {
"Workday": {"available": true, "response_time_ms": 245},
"ActiveDirectory": {"available": true, "response_time_ms": 150},
"Entra": {"available": true, "response_time_ms": 320},
"Lansweeper": {"available": false, "error": "TimeoutException..."},
"Intune": {"available": true, "response_time_ms": 280},
"Helix": {"available": true, "response_time_ms": 410}
},
"summary": {
"total_systems": 6,
"available_systems": 5,
"unavailable_systems": 1,
"availability_percentage": 83
}
}
Use Case: Run this before bulk audits to decide whether to proceed or wait.
Implementation Details
Modified Files
| File | Change |
|---|---|
pyproject.toml |
Added tenacity>=8.2.0 dependency |
lib/resilience.py |
NEW — Retry decorator, circuit breaker, 404 handler |
lib/workday_client.py |
Applied @resilient_http_call to get(), raas() |
lib/entra_client.py |
Applied @resilient_http_call to get(), get_all_pages() |
lib/helix_client.py |
Applied @resilient_http_call to get(), post() |
lib/intune_client.py |
Applied @resilient_http_call to get() |
lib/lansweeper_client.py |
Applied @resilient_http_call to gql() |
lib/fedex_client.py |
Applied @resilient_http_call to post() |
src/shards/audit.py |
Graceful degradation in audit_user_drift(), audit_device_drift(), new check_system_health() tool |
tests/test_resilience.py |
NEW — 12 comprehensive unit tests |
Decorators
@resilient_http_call
Applies retry logic and circuit breaker to async HTTP functions:
from resilience import resilient_http_call
@resilient_http_call(service_name="Workday", max_attempts=3)
async def get(self, path: str) -> dict:
resp = await self._http.get(url)
resp.raise_for_status()
return resp.json()
Parameters:
service_name(str): Service identifier for logging and circuit breaker trackingmax_attempts(int): Maximum retry attempts (default: 3)enable_circuit_breaker(bool): Whether to use circuit breaker (default: True)
@handle_404_gracefully
Converts 404 errors to None instead of raising:
from resilience import handle_404_gracefully
@handle_404_gracefully
@resilient_http_call(service_name="Entra")
async def get_user(user_id: str) -> dict | None:
resp = await self._http.get(f"/users/{user_id}")
resp.raise_for_status()
return resp.json()
result = await get_user("nonexistent-id") # Returns None instead of raising
Testing
Run All Tests
cd nexus-mcp
pytest tests/test_resilience.py -v
Expected Output:
tests/test_resilience.py::TestCircuitBreaker::test_circuit_closed_to_open_after_threshold_failures PASSED
tests/test_resilience.py::TestCircuitBreaker::test_circuit_half_open_to_closed_on_success PASSED
tests/test_resilience.py::TestCircuitBreaker::test_circuit_half_open_to_open_on_failure PASSED
tests/test_resilience.py::TestCircuitBreaker::test_circuit_resets_on_success PASSED
tests/test_resilience.py::TestResilientHttpCall::test_retries_on_timeout_exception PASSED
tests/test_resilience.py::TestResilientHttpCall::test_retries_on_5xx_errors PASSED
tests/test_resilience.py::TestResilientHttpCall::test_no_retry_on_4xx_errors PASSED
tests/test_resilience.py::TestResilientHttpCall::test_exhausts_retries_and_raises PASSED
tests/test_resilience.py::TestHandle404Gracefully::test_converts_404_to_none PASSED
tests/test_resilience.py::TestHandle404Gracefully::test_does_not_convert_other_errors PASSED
tests/test_resilience.py::TestHandle404Gracefully::test_returns_normal_result_on_success PASSED
tests/test_resilience.py::TestCircuitBreakerIntegration::test_circuit_breaker_opens_after_failures PASSED
======================== 12 passed in 12.40s ========================
Manual Testing
Test 1: Graceful Degradation
Setup:
- Edit
.env— temporarily invalidate one credential (e.g.,WORKDAY_CLIENT_ID=invalid) - Ensure
USE_MOCK=false(live mode)
Run:
python src/main.py
# In MCP client:
audit_user_drift(email="test@example.com")
Expected Result:
{
"systems_available": ["ActiveDirectory", "Entra"],
"systems_failed": ["Workday"],
"discrepancy_count": 1
}
Verification:
- ✅ No crash
- ✅ Audit continues with available systems
- ✅ Drift comparison runs for AD ↔ Entra
Test 2: Circuit Breaker
Setup:
- Simulate sustained Workday outage (disable service or firewall block)
- Credentials valid but service unreachable
Run:
python src/main.py
# In MCP client:
audit_bulk_user_drift(emails=["user1@example.com", "user2@example.com", ..., "user10@example.com"])
Expected Logs:
[audit_user_drift] Workday: Attempt 1/3 (retry on transient error)
[audit_user_drift] Workday: Attempt 2/3 (retry on transient error)
[audit_user_drift] Workday: Attempt 3/3 (retry on transient error)
[resilience] [Workday] Circuit CLOSED → OPEN (5 consecutive failures)
[audit_user_drift] Workday: CircuitBreakerOpenError (fast-fail)
Verification:
- ✅ First 5 requests retry 3 times each
- ✅ Subsequent requests fail instantly (< 100ms)
- ✅ Logs show circuit state transitions
Test 3: Retry on Transient Failure
Setup:
- Valid credentials
- Introduce 1-second network delay (via proxy or
tcon Linux)
Run:
python src/main.py
# In MCP client:
audit_user_drift(email="test@example.com")
Expected Result:
- ✅ Tool succeeds (after retries)
- ✅ Response includes full drift data
- ✅ Logs show "Retry attempt 1/3", "Retry attempt 2/3"
Test 4: Health Check
Run:
python src/main.py
# In MCP client:
check_system_health()
Expected Result:
{
"summary": {
"total_systems": 6,
"available_systems": 6,
"availability_percentage": 100
},
"systems": {
"Workday": {"available": true, "response_time_ms": ...},
...
}
}
Decision Logic:
- If
availability_percentage >= 80: Safe to run bulk audits - If
availability_percentage < 80: Postpone or expect partial results
Deployment
Prerequisites
# Navigate to nexus-mcp
cd nexus-mcp
# Install dependencies (including tenacity)
pip install -e .
Verify Installation
python -c "from resilience import resilient_http_call; print('✓ Installed')"
Run in Production
With credential-based authentication:
USE_MOCK=false python src/main.py
With mock data (testing):
USE_MOCK=true python src/main.py
Monitoring
Watch logs for:
[resilience]messages — retry events, circuit breaker state changesCircuitBreakerOpenError— indicates sustained service outage- Retry counts — indicates transient network issues
Example Alert Rules:
- If
"CircuitBreakerOpenError found in logs"→ Investigate service - If
"Retry attempt 2/3" repeated > 10 times in 5 minutes→ Network degradation - If
"Circuit.*OPEN"→ Service outage (escalate to on-call)
Troubleshooting
Symptom: "CircuitBreakerOpenError: Workday circuit breaker is OPEN"
Cause: 5 consecutive Workday failures within the monitoring window.
Solution:
- Check Workday status (https://status.workday.com)
- Verify credentials in
.env— test manually withcurlor Postman - Check network connectivity — can you reach
api.myworkday.com? - Wait 60 seconds for circuit to enter half-open state and test recovery
- Monitor logs for
"Circuit HALF_OPEN → CLOSED"indicating recovery
Symptom: Audit returns empty systems_available list
Cause: All systems are down or credentials are invalid.
Solution:
- Run
check_system_health()to identify which system is down - For downed systems:
- Check system status pages
- Verify network connectivity
- Wait for service to recover
- For credential issues:
- Verify
.envhas valid credentials - Test credentials manually via API (e.g.,
curlfor Workday OAuth) - Regenerate tokens/credentials if expired
- Verify
Symptom: Slow response times even on successful requests
Observe: Use check_system_health() to identify slow systems.
Solution:
- If
response_time_ms > 5000: System is under load, expect slower audits - Network latency → Consider running audits during low-traffic windows
- Consider increasing timeouts if system is reliably slow but functional
Symptom: Excessively verbose retry logs
Cause: Transient network issues causing multiple retries.
Solution:
- Expected during network instability
- Monitor for patterns (e.g., always fails at certain time)
- Use
check_system_health()to confirm system is reachable - If persistent, investigate network (firewall, ISP, proxy issues)
Configuration
Retry Policy
Currently Hard-Coded:
- Max attempts: 3
- Backoff: exponential (2s, 4s, 8s)
To Customize: Edit retry decorator in lib/resilience.py:
@resilient_http_call(service_name="Workday", max_attempts=5) # ← Change here
Circuit Breaker Threshold
Currently Hard-Coded:
- Failure threshold: 5 consecutive failures
- Timeout before half-open: 60 seconds
To Customize: Edit lib/resilience.py:
breaker = CircuitBreaker("Workday", failure_threshold=10, timeout_seconds=120)
Future Enhancements
- Configurable Retry Policy — Move retry/backoff settings to
.envor config file - Metrics & Observability — Track retry counts, circuit breaker events in audit logs
- Token Expiration Handling — Cache token expiry times and refresh proactively (CRITICAL #2)
- PowerShell Command Injection Fix — Use parameterized queries to prevent AD injection attacks (CRITICAL #3)
- Database Fallback — Cache drift results locally for offline resilience
- Rate Limiting — Implement exponential backoff to respect API rate limits
References
- Code Health Report:
documentation/reports/code-health-report-2026-04-13.md - Tenacity Docs: https://tenacity.readthedocs.io/
- Feature Branch:
feat/add-enterprise-resilience - Commits:
6337182— Initial implementationeb8b14b— Fix retry logic and datetime deprecation