diff --git a/nexus-mcp/RESILIENCE.md b/nexus-mcp/RESILIENCE.md
new file mode 100644
index 0000000..e7411b6
--- /dev/null
+++ b/nexus-mcp/RESILIENCE.md
@@ -0,0 +1,450 @@
+# Enterprise System Resilience Feature
+
+## Overview
+
+This document describes the enterprise system resilience feature that resolves **CRITICAL #1** from the code health report: "No Resilience When Enterprise Systems Fail."
+
+**Problem:** Your HTTP clients crashed on any API failure. If Workday went down during a weekly drift audit, the ENTIRE audit failed—even though AD and Entra data were still accessible.
+
+**Solution:** Automatic retry logic with exponential backoff, circuit breaker pattern, and graceful degradation allow drift audits to continue with partial data when some systems are unavailable.
+
+---
+
+## Features
+
+### 1. Automatic Retry Logic
+
+All HTTP clients automatically retry transient failures with exponential backoff:
+
+- **Max Attempts:** 3 (configurable)
+- **Backoff Strategy:** 2s → 4s → 8s exponential delay
+- **Retries On:** 5xx errors, timeouts, connection errors
+- **Does NOT Retry:** 4xx errors (client errors like 404 are instant failures)
+
+**Example: Transient Failure**
+```
+Attempt 1: 503 Service Unavailable → wait 2s
+Attempt 2: 503 Service Unavailable → wait 4s
+Attempt 3: Data returned → success ✓
+```
+
+### 2. Circuit Breaker Pattern
+
+Prevents hammering a failing service by "opening the circuit":
+
+- **Threshold:** 5 consecutive failures triggers the circuit to open
+- **Open State (60s):** Subsequent requests fail instantly with `CircuitBreakerOpenError` (no timeout waste)
+- **Half-Open State (testing):** After 60s timeout, one test request allowed
+- **Close State (recovery):** If test succeeds, circuit closes and normal operation resumes
+
+**Example: Sustained Failure**
+```
+Requests 1-5:  Each retries 3 times (network errors)
+Request 6:     Circuit opens immediately (no retry)
+Request 7:     Circuit still open, fails fast (<100ms)
+After 60s:     Circuit half-open, test request sent
+Test success:  Circuit closes, normal retries resume
+```
+
+### 3. Graceful Degradation in Audit Tools
+
+Audit tools wrap each system call separately, so if one system fails, the audit continues with available systems:
+
+**audit_user_drift() Example:**
+```python
+# Before: Any failure crashed the entire audit
+# After: Wraps each system separately
+try:
+    wd_data = await _get_wd().get("/staffing/v6/workers", ...)
+    systems_available.append("Workday")
+except Exception as e:
+    systems_failed.append("Workday")
+    logger.warning(f"Workday unavailable: {e}")
+
+# Continue with AD and Entra even if Workday failed...
+```
+
+**Response Example:**
+```json
+{
+  "email": "john.doe@wheels.com",
+  "systems_checked": ["Workday", "ActiveDirectory", "Entra"],
+  "systems_available": ["ActiveDirectory", "Entra"],
+  "systems_failed": ["Workday"],
+  "workday_found": false,
+  "ad_found": true,
+  "entra_found": true,
+  "discrepancy_count": 1,
+  "discrepancies": [
+    {
+      "field": "job_title",
+      "system_a": "ActiveDirectory",
+      "value_a": "Senior Engineer",
+      "system_b": "Entra",
+      "value_b": "Engineer",
+      "severity": "medium"
+    }
+  ]
+}
+```
+
+### 4. Proactive Health Monitoring
+
+**New Tool: `check_system_health()`**
+
+Pings all enterprise systems and returns availability + response times:
+
+```json
+{
+  "timestamp": "2026-04-13T14:30:00Z",
+  "systems": {
+    "Workday": {"available": true, "response_time_ms": 245},
+    "ActiveDirectory": {"available": true, "response_time_ms": 150},
+    "Entra": {"available": true, "response_time_ms": 320},
+    "Lansweeper": {"available": false, "error": "TimeoutException..."},
+    "Intune": {"available": true, "response_time_ms": 280},
+    "Helix": {"available": true, "response_time_ms": 410}
+  },
+  "summary": {
+    "total_systems": 6,
+    "available_systems": 5,
+    "unavailable_systems": 1,
+    "availability_percentage": 83
+  }
+}
+```
+
+**Use Case:** Run this before bulk audits to decide whether to proceed or wait.
+
+---
+
+## Implementation Details
+
+### Modified Files
+
+| File | Change |
+|------|--------|
+| `pyproject.toml` | Added `tenacity>=8.2.0` dependency |
+| `lib/resilience.py` | **NEW** — Retry decorator, circuit breaker, 404 handler |
+| `lib/workday_client.py` | Applied `@resilient_http_call` to `get()`, `raas()` |
+| `lib/entra_client.py` | Applied `@resilient_http_call` to `get()`, `get_all_pages()` |
+| `lib/helix_client.py` | Applied `@resilient_http_call` to `get()`, `post()` |
+| `lib/intune_client.py` | Applied `@resilient_http_call` to `get()` |
+| `lib/lansweeper_client.py` | Applied `@resilient_http_call` to `gql()` |
+| `lib/fedex_client.py` | Applied `@resilient_http_call` to `post()` |
+| `src/shards/audit.py` | Graceful degradation in `audit_user_drift()`, `audit_device_drift()`, new `check_system_health()` tool |
+| `tests/test_resilience.py` | **NEW** — 12 comprehensive unit tests |
+
+### Decorators
+
+#### @resilient_http_call
+
+Applies retry logic and circuit breaker to async HTTP functions:
+
+```python
+from resilience import resilient_http_call
+
+@resilient_http_call(service_name="Workday", max_attempts=3)
+async def get(self, path: str) -> dict:
+    resp = await self._http.get(url)
+    resp.raise_for_status()
+    return resp.json()
+```
+
+**Parameters:**
+- `service_name` (str): Service identifier for logging and circuit breaker tracking
+- `max_attempts` (int): Maximum retry attempts (default: 3)
+- `enable_circuit_breaker` (bool): Whether to use circuit breaker (default: True)
+
+#### @handle_404_gracefully
+
+Converts 404 errors to `None` instead of raising:
+
+```python
+from resilience import handle_404_gracefully
+
+@handle_404_gracefully
+@resilient_http_call(service_name="Entra")
+async def get_user(user_id: str) -> dict | None:
+    resp = await self._http.get(f"/users/{user_id}")
+    resp.raise_for_status()
+    return resp.json()
+
+result = await get_user("nonexistent-id")  # Returns None instead of raising
+```
+
+---
+
+## Testing
+
+### Run All Tests
+
+```bash
+cd nexus-mcp
+pytest tests/test_resilience.py -v
+```
+
+**Expected Output:**
+```
+tests/test_resilience.py::TestCircuitBreaker::test_circuit_closed_to_open_after_threshold_failures PASSED
+tests/test_resilience.py::TestCircuitBreaker::test_circuit_half_open_to_closed_on_success PASSED
+tests/test_resilience.py::TestCircuitBreaker::test_circuit_half_open_to_open_on_failure PASSED
+tests/test_resilience.py::TestCircuitBreaker::test_circuit_resets_on_success PASSED
+tests/test_resilience.py::TestResilientHttpCall::test_retries_on_timeout_exception PASSED
+tests/test_resilience.py::TestResilientHttpCall::test_retries_on_5xx_errors PASSED
+tests/test_resilience.py::TestResilientHttpCall::test_no_retry_on_4xx_errors PASSED
+tests/test_resilience.py::TestResilientHttpCall::test_exhausts_retries_and_raises PASSED
+tests/test_resilience.py::TestHandle404Gracefully::test_converts_404_to_none PASSED
+tests/test_resilience.py::TestHandle404Gracefully::test_does_not_convert_other_errors PASSED
+tests/test_resilience.py::TestHandle404Gracefully::test_returns_normal_result_on_success PASSED
+tests/test_resilience.py::TestCircuitBreakerIntegration::test_circuit_breaker_opens_after_failures PASSED
+
+======================== 12 passed in 12.40s ========================
+```
+
+### Manual Testing
+
+#### Test 1: Graceful Degradation
+
+**Setup:**
+1. Edit `.env` — temporarily invalidate one credential (e.g., `WORKDAY_CLIENT_ID=invalid`)
+2. Ensure `USE_MOCK=false` (live mode)
+
+**Run:**
+```bash
+python src/main.py
+# In MCP client:
+audit_user_drift(email="test@example.com")
+```
+
+**Expected Result:**
+```json
+{
+  "systems_available": ["ActiveDirectory", "Entra"],
+  "systems_failed": ["Workday"],
+  "discrepancy_count": 1
+}
+```
+
+**Verification:**
+- ✅ No crash
+- ✅ Audit continues with available systems
+- ✅ Drift comparison runs for AD ↔ Entra
+
+#### Test 2: Circuit Breaker
+
+**Setup:**
+1. Simulate sustained Workday outage (disable service or firewall block)
+2. Credentials valid but service unreachable
+
+**Run:**
+```bash
+python src/main.py
+# In MCP client:
+audit_bulk_user_drift(emails=["user1@example.com", "user2@example.com", ..., "user10@example.com"])
+```
+
+**Expected Logs:**
+```
+[audit_user_drift] Workday: Attempt 1/3 (retry on transient error)
+[audit_user_drift] Workday: Attempt 2/3 (retry on transient error)
+[audit_user_drift] Workday: Attempt 3/3 (retry on transient error)
+[resilience] [Workday] Circuit CLOSED → OPEN (5 consecutive failures)
+[audit_user_drift] Workday: CircuitBreakerOpenError (fast-fail)
+```
+
+**Verification:**
+- ✅ First 5 requests retry 3 times each
+- ✅ Subsequent requests fail instantly (< 100ms)
+- ✅ Logs show circuit state transitions
+
+#### Test 3: Retry on Transient Failure
+
+**Setup:**
+1. Valid credentials
+2. Introduce 1-second network delay (via proxy or `tc` on Linux)
+
+**Run:**
+```bash
+python src/main.py
+# In MCP client:
+audit_user_drift(email="test@example.com")
+```
+
+**Expected Result:**
+- ✅ Tool succeeds (after retries)
+- ✅ Response includes full drift data
+- ✅ Logs show "Retry attempt 1/3", "Retry attempt 2/3"
+
+#### Test 4: Health Check
+
+**Run:**
+```bash
+python src/main.py
+# In MCP client:
+check_system_health()
+```
+
+**Expected Result:**
+```json
+{
+  "summary": {
+    "total_systems": 6,
+    "available_systems": 6,
+    "availability_percentage": 100
+  },
+  "systems": {
+    "Workday": {"available": true, "response_time_ms": ...},
+    ...
+  }
+}
+```
+
+**Decision Logic:**
+- If `availability_percentage >= 80`: Safe to run bulk audits
+- If `availability_percentage < 80`: Postpone or expect partial results
+
+---
+
+## Deployment
+
+### Prerequisites
+
+```bash
+# Navigate to nexus-mcp
+cd nexus-mcp
+
+# Install dependencies (including tenacity)
+pip install -e .
+```
+
+### Verify Installation
+
+```bash
+python -c "from resilience import resilient_http_call; print('✓ Installed')"
+```
+
+### Run in Production
+
+**With credential-based authentication:**
+```bash
+USE_MOCK=false python src/main.py
+```
+
+**With mock data (testing):**
+```bash
+USE_MOCK=true python src/main.py
+```
+
+### Monitoring
+
+Watch logs for:
+- `[resilience]` messages — retry events, circuit breaker state changes
+- `CircuitBreakerOpenError` — indicates sustained service outage
+- Retry counts — indicates transient network issues
+
+**Example Alert Rules:**
+- If `"CircuitBreakerOpenError found in logs"` → Investigate service
+- If `"Retry attempt 2/3" repeated > 10 times in 5 minutes` → Network degradation
+- If `"Circuit.*OPEN"` → Service outage (escalate to on-call)
+
+---
+
+## Troubleshooting
+
+### Symptom: "CircuitBreakerOpenError: Workday circuit breaker is OPEN"
+
+**Cause:** 5 consecutive Workday failures within the monitoring window.
+
+**Solution:**
+1. Check Workday status (https://status.workday.com)
+2. Verify credentials in `.env` — test manually with `curl` or Postman
+3. Check network connectivity — can you reach `api.myworkday.com`?
+4. Wait 60 seconds for circuit to enter half-open state and test recovery
+5. Monitor logs for `"Circuit HALF_OPEN → CLOSED"` indicating recovery
+
+### Symptom: Audit returns empty `systems_available` list
+
+**Cause:** All systems are down or credentials are invalid.
+
+**Solution:**
+1. Run `check_system_health()` to identify which system is down
+2. For downed systems:
+   - Check system status pages
+   - Verify network connectivity
+   - Wait for service to recover
+3. For credential issues:
+   - Verify `.env` has valid credentials
+   - Test credentials manually via API (e.g., `curl` for Workday OAuth)
+   - Regenerate tokens/credentials if expired
+
+### Symptom: Slow response times even on successful requests
+
+**Observe:** Use `check_system_health()` to identify slow systems.
+
+**Solution:**
+- If `response_time_ms > 5000`: System is under load, expect slower audits
+- Network latency → Consider running audits during low-traffic windows
+- Consider increasing timeouts if system is reliably slow but functional
+
+### Symptom: Excessively verbose retry logs
+
+**Cause:** Transient network issues causing multiple retries.
+
+**Solution:**
+- Expected during network instability
+- Monitor for patterns (e.g., always fails at certain time)
+- Use `check_system_health()` to confirm system is reachable
+- If persistent, investigate network (firewall, ISP, proxy issues)
+
+---
+
+## Configuration
+
+### Retry Policy
+
+**Currently Hard-Coded:**
+- Max attempts: 3
+- Backoff: exponential (2s, 4s, 8s)
+
+**To Customize:**
+Edit retry decorator in [lib/resilience.py](lib/resilience.py):
+
+```python
+@resilient_http_call(service_name="Workday", max_attempts=5)  # ← Change here
+```
+
+### Circuit Breaker Threshold
+
+**Currently Hard-Coded:**
+- Failure threshold: 5 consecutive failures
+- Timeout before half-open: 60 seconds
+
+**To Customize:**
+Edit [lib/resilience.py](lib/resilience.py):
+
+```python
+breaker = CircuitBreaker("Workday", failure_threshold=10, timeout_seconds=120)
+```
+
+---
+
+## Future Enhancements
+
+1. **Configurable Retry Policy** — Move retry/backoff settings to `.env` or config file
+2. **Metrics & Observability** — Track retry counts, circuit breaker events in audit logs
+3. **Token Expiration Handling** — Cache token expiry times and refresh proactively (CRITICAL #2)
+4. **PowerShell Command Injection Fix** — Use parameterized queries to prevent AD injection attacks (CRITICAL #3)
+5. **Database Fallback** — Cache drift results locally for offline resilience
+6. **Rate Limiting** — Implement exponential backoff to respect API rate limits
+
+---
+
+## References
+
+- **Code Health Report:** `documentation/reports/code-health-report-2026-04-13.md`
+- **Tenacity Docs:** https://tenacity.readthedocs.io/
+- **Feature Branch:** `feat/add-enterprise-resilience`
+- **Commits:**
+  - `6337182` — Initial implementation
+  - `eb8b14b` — Fix retry logic and datetime deprecation