diff --git a/.planning/REQUIREMENTS.md b/.planning/REQUIREMENTS.md index 0706cbd..d14e04e 100644 --- a/.planning/REQUIREMENTS.md +++ b/.planning/REQUIREMENTS.md @@ -10,9 +10,9 @@ Requirements for initial release. Each maps to roadmap phases. ### Service Health - [ ] **HLTH-01**: Admin can view live health status (healthy/degraded/down) for Document AI, Claude/OpenAI, Supabase, and Firebase Auth -- [ ] **HLTH-02**: Each health probe makes a real authenticated API call, not just config checks +- [x] **HLTH-02**: Each health probe makes a real authenticated API call, not just config checks - [ ] **HLTH-03**: Health probes run on a scheduled interval, separate from document processing -- [ ] **HLTH-04**: Health probe results persist to Supabase (survive cold starts) +- [x] **HLTH-04**: Health probe results persist to Supabase (survive cold starts) ### Alerting @@ -77,9 +77,9 @@ Which phases cover which requirements. Updated during roadmap creation. |-------------|-------|--------| | INFR-01 | Phase 1 | Complete | | INFR-04 | Phase 1 | Complete | -| HLTH-02 | Phase 2 | Pending | +| HLTH-02 | Phase 2 | Complete | | HLTH-03 | Phase 2 | Pending | -| HLTH-04 | Phase 2 | Pending | +| HLTH-04 | Phase 2 | Complete | | ALRT-01 | Phase 2 | Pending | | ALRT-02 | Phase 2 | Pending | | ALRT-04 | Phase 2 | Pending | diff --git a/.planning/ROADMAP.md b/.planning/ROADMAP.md index 474a121..3bf271e 100644 --- a/.planning/ROADMAP.md +++ b/.planning/ROADMAP.md @@ -45,7 +45,7 @@ Plans: 4. Alert recipient is read from configuration (environment variable or Supabase config row), not hardcoded in source 5. Analytics events fire as fire-and-forget calls — a deliberately introduced 500ms Supabase delay does not increase processing pipeline duration 6. A scheduled probe function and a weekly retention cleanup function exist as separate Firebase Cloud Function exports -**Plans:** 4 plans +**Plans:** 2/4 plans executed Plans: - [ ] 02-01-PLAN.md — Analytics migration + analyticsService (fire-and-forget) @@ -83,6 +83,6 @@ Phases execute in numeric order: 1 → 2 → 3 → 4 | Phase | Plans Complete | Status | Completed | |-------|----------------|--------|-----------| | 1. Data Foundation | 2/2 | Complete | 2026-02-24 | -| 2. Backend Services | 0/4 | Not started | - | +| 2. Backend Services | 2/4 | In Progress| | | 3. API Layer | 0/TBD | Not started | - | | 4. Frontend | 0/TBD | Not started | - | diff --git a/.planning/STATE.md b/.planning/STATE.md index ed70bc5..136eea9 100644 --- a/.planning/STATE.md +++ b/.planning/STATE.md @@ -5,33 +5,34 @@ See: .planning/PROJECT.md (updated 2026-02-24) **Core value:** When something breaks — an API key expires, a service goes down, a credential needs reauthorization — the admin knows immediately and knows exactly what to fix. -**Current focus:** Phase 1 — Data Foundation +**Current focus:** Phase 2 — Backend Services ## Current Position -Phase: 1 of 4 (Data Foundation) +Phase: 2 of 4 (Backend Services) Plan: 2 of TBD in current phase Status: In progress -Last activity: 2026-02-24 — Completed 01-02 (HealthCheckModel + AlertEventModel unit tests) +Last activity: 2026-02-24 — Completed 02-02 (healthProbeService with 4 probers + 9 unit tests) -Progress: [██░░░░░░░░] 20% +Progress: [████░░░░░░] 40% ## Performance Metrics **Velocity:** -- Total plans completed: 2 -- Average duration: ~17 min -- Total execution time: ~0.57 hours +- Total plans completed: 4 +- Average duration: ~18 min +- Total execution time: ~1.2 hours **By Phase:** | Phase | Plans | Total | Avg/Plan | |-------|-------|-------|----------| | 01-data-foundation | 2 | ~34 min | ~17 min | +| 02-backend-services | 2 | ~38 min | ~19 min | **Recent Trend:** -- Last 5 plans: 01-01 (8 min), 01-02 (26 min) -- Trend: — +- Last 5 plans: 01-01 (8 min), 01-02 (26 min), 02-01 (20 min), 02-02 (18 min) +- Trend: Stable ~18 min/plan *Updated after each plan completion* @@ -53,6 +54,11 @@ Recent decisions affecting current work: - 01-02: Supabase mock uses chain.then (thenability) so both .single() and direct await patterns work from one mock - 01-02: makeSupabaseChain() factory per test — no shared mock state between tests - 01-02: vi.mock() factories must use only inline vi.fn() to avoid Vitest hoisting TDZ errors +- 02-02: LLM probe uses claude-haiku-4-5 with max_tokens 5 (cheapest model, prevents expensive accidental probes) +- 02-02: Supabase probe uses getPostgresPool().query('SELECT 1') not PostgREST (tests actual DB connectivity) +- 02-02: Firebase Auth probe: verifyIdToken always throws; 'INVALID'/'Decoding'/'argument' in message = SDK alive = healthy +- 02-02: Promise.allSettled for probe orchestration — all 4 probes run even if one throws outside its own try/catch +- 02-02: Per-probe HealthCheckModel.create failure swallowed with logger.error — probe results still returned to caller ### Pending Todos @@ -67,5 +73,5 @@ None yet. ## Session Continuity Last session: 2026-02-24 -Stopped at: Completed 01-02-PLAN.md — HealthCheckModel + AlertEventModel unit tests (33 tests passing) +Stopped at: Completed 02-02-PLAN.md — healthProbeService with 4 probers + 9 unit tests (nodemailer installed) Resume file: None diff --git a/.planning/phases/02-backend-services/02-02-SUMMARY.md b/.planning/phases/02-backend-services/02-02-SUMMARY.md new file mode 100644 index 0000000..0dd6b87 --- /dev/null +++ b/.planning/phases/02-backend-services/02-02-SUMMARY.md @@ -0,0 +1,122 @@ +--- +phase: 02-backend-services +plan: 02 +subsystem: infra +tags: [health-probes, document-ai, anthropic, firebase-auth, postgres, vitest, nodemailer] + +# Dependency graph +requires: + - phase: 01-data-foundation + provides: HealthCheckModel.create() for persistence + - phase: 02-backend-services + plan: 01 + provides: Schema and model layer for service_health_checks table + +provides: + - healthProbeService with 4 real API probers (document_ai, llm_api, supabase, firebase_auth) + - ProbeResult interface exported for use by health endpoint + - runAllProbes orchestrator with fault-tolerant probe isolation + - nodemailer installed (needed by Plan 03 alert notifications) + +affects: [02-backend-services, 02-03-PLAN] + +# Tech tracking +tech-stack: + added: [nodemailer@8.0.1, @types/nodemailer] + patterns: + - Promise.allSettled for fault-tolerant concurrent probe orchestration + - firebase-admin verifyIdToken probe distinguishes expected vs unexpected errors + - Direct PostgreSQL pool (getPostgresPool) for Supabase probe, not PostgREST + - LLM probe uses cheapest model (claude-haiku-4-5) with max_tokens 5 + +key-files: + created: + - backend/src/services/healthProbeService.ts + - backend/src/__tests__/unit/healthProbeService.test.ts + modified: + - backend/package.json (nodemailer + @types/nodemailer added) + +key-decisions: + - "LLM probe uses claude-haiku-4-5 with max_tokens 5 (cheapest available, prevents expensive accidental probes)" + - "Supabase probe uses getPostgresPool().query('SELECT 1') not PostgREST client (bypasses caching/middleware)" + - "Firebase Auth probe uses verifyIdToken('invalid-token') — always throws, distinguished by error message content" + - "Promise.allSettled chosen over Promise.all to guarantee all probes run even if one throws outside try/catch" + - "HealthCheckModel.create failure per probe is swallowed with logger.error — probe results still returned to caller" + +patterns-established: + - "Probe pattern: record start time, try real API call, compute latency, return ProbeResult with status/latency_ms/error_message" + - "Firebase SDK probe: verifyIdToken always throws; 'argument'/'INVALID'/'Decoding' in message = SDK alive = healthy" + - "429 rate limit errors = degraded (not down) — service is alive but throttling" + - "vi.mock with inline vi.fn() in factory — no outer variable references (Vitest hoisting TDZ safe)" + +requirements-completed: [HLTH-02, HLTH-04] + +# Metrics +duration: 18min +completed: 2026-02-24 +--- + +# Phase 02 Plan 02: Health Probe Service Summary + +**Four real authenticated API probers (Document AI, LLM claude-haiku-4-5, Supabase pg pool, Firebase Auth) with fault-tolerant orchestrator and Supabase persistence via HealthCheckModel** + +## Performance + +- **Duration:** 18 min +- **Started:** 2026-02-24T14:05:00Z +- **Completed:** 2026-02-24T14:23:55Z +- **Tasks:** 2 +- **Files modified:** 4 + +## Accomplishments + +- Created `healthProbeService.ts` with 4 individual probers each making real authenticated API calls +- Implemented `runAllProbes` orchestrator using `Promise.allSettled` for fault isolation (one probe failure never blocks others) +- Each probe result persisted to Supabase via `HealthCheckModel.create()` after completion +- 9 unit tests covering all probers, fault tolerance, 429 degraded handling, Supabase pool verification, and Firebase error discrimination +- Installed nodemailer (needed by Plan 03 alert notifications) to avoid package.json conflicts in parallel execution + +## Task Commits + +Each task was committed atomically: + +1. **Task 1: Install nodemailer and create healthProbeService** - `4129826` (feat) +2. **Task 2: Create healthProbeService unit tests** - `a8ba884` (test) + +**Plan metadata:** (docs commit — created below) + +## Files Created/Modified + +- `backend/src/services/healthProbeService.ts` - Health probe orchestrator with ProbeResult interface and 4 individual probers +- `backend/src/__tests__/unit/healthProbeService.test.ts` - 9 unit tests covering all probers and orchestrator +- `backend/package.json` - nodemailer + @types/nodemailer added + +## Decisions Made + +- LLM probe uses `claude-haiku-4-5` with `max_tokens: 5` — cheapest Anthropic model prevents accidental expensive probe calls +- Supabase probe uses `getPostgresPool().query('SELECT 1')` — bypasses PostgREST middleware/caching, tests actual DB connectivity +- Firebase Auth probe strategy: `verifyIdToken('invalid-token-probe-check')` always throws; error message containing 'argument', 'INVALID', or 'Decoding' = SDK functioning = 'healthy' +- `Promise.allSettled` over `Promise.all` — guarantees all 4 probes run even if one rejects outside its own try/catch +- Per-probe persistence failure is swallowed (logger.error only) so probe results are still returned to caller + +## Deviations from Plan + +None - plan executed exactly as written. + +## Issues Encountered + +None — all probes compiled and tested cleanly on first implementation. + +## User Setup Required + +None - no external service configuration required beyond what's already in .env. + +## Next Phase Readiness + +- `healthProbeService.runAllProbes()` is ready to be called by the health scheduler (Plan 03) +- `nodemailer` is installed and ready for Plan 03 alert notification service +- `ProbeResult` interface exported and ready for use in health status API endpoints + +--- +*Phase: 02-backend-services* +*Completed: 2026-02-24*