nathan f040ac2d07 feat(docs): add comprehensive documentation for Nexus work item register, read-only verification, resilience features, and test validation report
- Created `nexus-work-item-register.md` to establish a canonical registry for NEXUS-XXX work items, including shard assignments and a full work item backlog.
- Added `READ_ONLY_VERIFICATION.md` detailing the results of a security audit confirming zero write capabilities across integrated systems.
- Introduced `RESILIENCE.md` outlining the new enterprise system resilience feature, including automatic retry logic, circuit breaker pattern, and graceful degradation strategies.
- Developed `TEST_VALIDATION_REPORT.md` summarizing the successful rebuild of the Nexus MCP server with full audit shard functionality and comprehensive test results.
2026-04-14 15:00:15 -04:00

13 KiB

Enterprise System Resilience Feature

Overview

This document describes the enterprise system resilience feature that resolves CRITICAL #1 from the code health report: "No Resilience When Enterprise Systems Fail."

Problem: Your HTTP clients crashed on any API failure. If Workday went down during a weekly drift audit, the ENTIRE audit failed—even though AD and Entra data were still accessible.

Solution: Automatic retry logic with exponential backoff, circuit breaker pattern, and graceful degradation allow drift audits to continue with partial data when some systems are unavailable.


Features

1. Automatic Retry Logic

All HTTP clients automatically retry transient failures with exponential backoff:

  • Max Attempts: 3 (configurable)
  • Backoff Strategy: 2s → 4s → 8s exponential delay
  • Retries On: 5xx errors, timeouts, connection errors
  • Does NOT Retry: 4xx errors (client errors like 404 are instant failures)

Example: Transient Failure

Attempt 1: 503 Service Unavailable → wait 2s
Attempt 2: 503 Service Unavailable → wait 4s
Attempt 3: Data returned → success ✓

2. Circuit Breaker Pattern

Prevents hammering a failing service by "opening the circuit":

  • Threshold: 5 consecutive failures triggers the circuit to open
  • Open State (60s): Subsequent requests fail instantly with CircuitBreakerOpenError (no timeout waste)
  • Half-Open State (testing): After 60s timeout, one test request allowed
  • Close State (recovery): If test succeeds, circuit closes and normal operation resumes

Example: Sustained Failure

Requests 1-5:  Each retries 3 times (network errors)
Request 6:     Circuit opens immediately (no retry)
Request 7:     Circuit still open, fails fast (<100ms)
After 60s:     Circuit half-open, test request sent
Test success:  Circuit closes, normal retries resume

3. Graceful Degradation in Audit Tools

Audit tools wrap each system call separately, so if one system fails, the audit continues with available systems:

audit_user_drift() Example:

# Before: Any failure crashed the entire audit
# After: Wraps each system separately
try:
    wd_data = await _get_wd().get("/staffing/v6/workers", ...)
    systems_available.append("Workday")
except Exception as e:
    systems_failed.append("Workday")
    logger.warning(f"Workday unavailable: {e}")

# Continue with AD and Entra even if Workday failed...

Response Example:

{
  "email": "john.doe@wheels.com",
  "systems_checked": ["Workday", "ActiveDirectory", "Entra"],
  "systems_available": ["ActiveDirectory", "Entra"],
  "systems_failed": ["Workday"],
  "workday_found": false,
  "ad_found": true,
  "entra_found": true,
  "discrepancy_count": 1,
  "discrepancies": [
    {
      "field": "job_title",
      "system_a": "ActiveDirectory",
      "value_a": "Senior Engineer",
      "system_b": "Entra",
      "value_b": "Engineer",
      "severity": "medium"
    }
  ]
}

4. Proactive Health Monitoring

New Tool: check_system_health()

Pings all enterprise systems and returns availability + response times:

{
  "timestamp": "2026-04-13T14:30:00Z",
  "systems": {
    "Workday": {"available": true, "response_time_ms": 245},
    "ActiveDirectory": {"available": true, "response_time_ms": 150},
    "Entra": {"available": true, "response_time_ms": 320},
    "Lansweeper": {"available": false, "error": "TimeoutException..."},
    "Intune": {"available": true, "response_time_ms": 280},
    "Helix": {"available": true, "response_time_ms": 410}
  },
  "summary": {
    "total_systems": 6,
    "available_systems": 5,
    "unavailable_systems": 1,
    "availability_percentage": 83
  }
}

Use Case: Run this before bulk audits to decide whether to proceed or wait.


Implementation Details

Modified Files

File Change
pyproject.toml Added tenacity>=8.2.0 dependency
lib/resilience.py NEW — Retry decorator, circuit breaker, 404 handler
lib/workday_client.py Applied @resilient_http_call to get(), raas()
lib/entra_client.py Applied @resilient_http_call to get(), get_all_pages()
lib/helix_client.py Applied @resilient_http_call to get(), post()
lib/intune_client.py Applied @resilient_http_call to get()
lib/lansweeper_client.py Applied @resilient_http_call to gql()
lib/fedex_client.py Applied @resilient_http_call to post()
src/shards/audit.py Graceful degradation in audit_user_drift(), audit_device_drift(), new check_system_health() tool
tests/test_resilience.py NEW — 12 comprehensive unit tests

Decorators

@resilient_http_call

Applies retry logic and circuit breaker to async HTTP functions:

from resilience import resilient_http_call

@resilient_http_call(service_name="Workday", max_attempts=3)
async def get(self, path: str) -> dict:
    resp = await self._http.get(url)
    resp.raise_for_status()
    return resp.json()

Parameters:

  • service_name (str): Service identifier for logging and circuit breaker tracking
  • max_attempts (int): Maximum retry attempts (default: 3)
  • enable_circuit_breaker (bool): Whether to use circuit breaker (default: True)

@handle_404_gracefully

Converts 404 errors to None instead of raising:

from resilience import handle_404_gracefully

@handle_404_gracefully
@resilient_http_call(service_name="Entra")
async def get_user(user_id: str) -> dict | None:
    resp = await self._http.get(f"/users/{user_id}")
    resp.raise_for_status()
    return resp.json()

result = await get_user("nonexistent-id")  # Returns None instead of raising

Testing

Run All Tests

cd nexus-mcp
pytest tests/test_resilience.py -v

Expected Output:

tests/test_resilience.py::TestCircuitBreaker::test_circuit_closed_to_open_after_threshold_failures PASSED
tests/test_resilience.py::TestCircuitBreaker::test_circuit_half_open_to_closed_on_success PASSED
tests/test_resilience.py::TestCircuitBreaker::test_circuit_half_open_to_open_on_failure PASSED
tests/test_resilience.py::TestCircuitBreaker::test_circuit_resets_on_success PASSED
tests/test_resilience.py::TestResilientHttpCall::test_retries_on_timeout_exception PASSED
tests/test_resilience.py::TestResilientHttpCall::test_retries_on_5xx_errors PASSED
tests/test_resilience.py::TestResilientHttpCall::test_no_retry_on_4xx_errors PASSED
tests/test_resilience.py::TestResilientHttpCall::test_exhausts_retries_and_raises PASSED
tests/test_resilience.py::TestHandle404Gracefully::test_converts_404_to_none PASSED
tests/test_resilience.py::TestHandle404Gracefully::test_does_not_convert_other_errors PASSED
tests/test_resilience.py::TestHandle404Gracefully::test_returns_normal_result_on_success PASSED
tests/test_resilience.py::TestCircuitBreakerIntegration::test_circuit_breaker_opens_after_failures PASSED

======================== 12 passed in 12.40s ========================

Manual Testing

Test 1: Graceful Degradation

Setup:

  1. Edit .env — temporarily invalidate one credential (e.g., WORKDAY_CLIENT_ID=invalid)
  2. Ensure USE_MOCK=false (live mode)

Run:

python src/main.py
# In MCP client:
audit_user_drift(email="test@example.com")

Expected Result:

{
  "systems_available": ["ActiveDirectory", "Entra"],
  "systems_failed": ["Workday"],
  "discrepancy_count": 1
}

Verification:

  • No crash
  • Audit continues with available systems
  • Drift comparison runs for AD ↔ Entra

Test 2: Circuit Breaker

Setup:

  1. Simulate sustained Workday outage (disable service or firewall block)
  2. Credentials valid but service unreachable

Run:

python src/main.py
# In MCP client:
audit_bulk_user_drift(emails=["user1@example.com", "user2@example.com", ..., "user10@example.com"])

Expected Logs:

[audit_user_drift] Workday: Attempt 1/3 (retry on transient error)
[audit_user_drift] Workday: Attempt 2/3 (retry on transient error)
[audit_user_drift] Workday: Attempt 3/3 (retry on transient error)
[resilience] [Workday] Circuit CLOSED → OPEN (5 consecutive failures)
[audit_user_drift] Workday: CircuitBreakerOpenError (fast-fail)

Verification:

  • First 5 requests retry 3 times each
  • Subsequent requests fail instantly (< 100ms)
  • Logs show circuit state transitions

Test 3: Retry on Transient Failure

Setup:

  1. Valid credentials
  2. Introduce 1-second network delay (via proxy or tc on Linux)

Run:

python src/main.py
# In MCP client:
audit_user_drift(email="test@example.com")

Expected Result:

  • Tool succeeds (after retries)
  • Response includes full drift data
  • Logs show "Retry attempt 1/3", "Retry attempt 2/3"

Test 4: Health Check

Run:

python src/main.py
# In MCP client:
check_system_health()

Expected Result:

{
  "summary": {
    "total_systems": 6,
    "available_systems": 6,
    "availability_percentage": 100
  },
  "systems": {
    "Workday": {"available": true, "response_time_ms": ...},
    ...
  }
}

Decision Logic:

  • If availability_percentage >= 80: Safe to run bulk audits
  • If availability_percentage < 80: Postpone or expect partial results

Deployment

Prerequisites

# Navigate to nexus-mcp
cd nexus-mcp

# Install dependencies (including tenacity)
pip install -e .

Verify Installation

python -c "from resilience import resilient_http_call; print('✓ Installed')"

Run in Production

With credential-based authentication:

USE_MOCK=false python src/main.py

With mock data (testing):

USE_MOCK=true python src/main.py

Monitoring

Watch logs for:

  • [resilience] messages — retry events, circuit breaker state changes
  • CircuitBreakerOpenError — indicates sustained service outage
  • Retry counts — indicates transient network issues

Example Alert Rules:

  • If "CircuitBreakerOpenError found in logs" → Investigate service
  • If "Retry attempt 2/3" repeated > 10 times in 5 minutes → Network degradation
  • If "Circuit.*OPEN" → Service outage (escalate to on-call)

Troubleshooting

Symptom: "CircuitBreakerOpenError: Workday circuit breaker is OPEN"

Cause: 5 consecutive Workday failures within the monitoring window.

Solution:

  1. Check Workday status (https://status.workday.com)
  2. Verify credentials in .env — test manually with curl or Postman
  3. Check network connectivity — can you reach api.myworkday.com?
  4. Wait 60 seconds for circuit to enter half-open state and test recovery
  5. Monitor logs for "Circuit HALF_OPEN → CLOSED" indicating recovery

Symptom: Audit returns empty systems_available list

Cause: All systems are down or credentials are invalid.

Solution:

  1. Run check_system_health() to identify which system is down
  2. For downed systems:
    • Check system status pages
    • Verify network connectivity
    • Wait for service to recover
  3. For credential issues:
    • Verify .env has valid credentials
    • Test credentials manually via API (e.g., curl for Workday OAuth)
    • Regenerate tokens/credentials if expired

Symptom: Slow response times even on successful requests

Observe: Use check_system_health() to identify slow systems.

Solution:

  • If response_time_ms > 5000: System is under load, expect slower audits
  • Network latency → Consider running audits during low-traffic windows
  • Consider increasing timeouts if system is reliably slow but functional

Symptom: Excessively verbose retry logs

Cause: Transient network issues causing multiple retries.

Solution:

  • Expected during network instability
  • Monitor for patterns (e.g., always fails at certain time)
  • Use check_system_health() to confirm system is reachable
  • If persistent, investigate network (firewall, ISP, proxy issues)

Configuration

Retry Policy

Currently Hard-Coded:

  • Max attempts: 3
  • Backoff: exponential (2s, 4s, 8s)

To Customize: Edit retry decorator in lib/resilience.py:

@resilient_http_call(service_name="Workday", max_attempts=5)  # ← Change here

Circuit Breaker Threshold

Currently Hard-Coded:

  • Failure threshold: 5 consecutive failures
  • Timeout before half-open: 60 seconds

To Customize: Edit lib/resilience.py:

breaker = CircuitBreaker("Workday", failure_threshold=10, timeout_seconds=120)

Future Enhancements

  1. Configurable Retry Policy — Move retry/backoff settings to .env or config file
  2. Metrics & Observability — Track retry counts, circuit breaker events in audit logs
  3. Token Expiration Handling — Cache token expiry times and refresh proactively (CRITICAL #2)
  4. PowerShell Command Injection Fix — Use parameterized queries to prevent AD injection attacks (CRITICAL #3)
  5. Database Fallback — Cache drift results locally for offline resilience
  6. Rate Limiting — Implement exponential backoff to respect API rate limits

References

  • Code Health Report: documentation/reports/code-health-report-2026-04-13.md
  • Tenacity Docs: https://tenacity.readthedocs.io/
  • Feature Branch: feat/add-enterprise-resilience
  • Commits:
    • 6337182 — Initial implementation
    • eb8b14b — Fix retry logic and datetime deprecation