docs(02-backend-services): create phase plan

This commit is contained in:
admin
2026-02-24 14:14:54 -05:00
parent fcb3987c56
commit 73f8d8271e
5 changed files with 739 additions and 2 deletions

View File

@@ -45,7 +45,13 @@ Plans:
4. Alert recipient is read from configuration (environment variable or Supabase config row), not hardcoded in source 4. Alert recipient is read from configuration (environment variable or Supabase config row), not hardcoded in source
5. Analytics events fire as fire-and-forget calls — a deliberately introduced 500ms Supabase delay does not increase processing pipeline duration 5. Analytics events fire as fire-and-forget calls — a deliberately introduced 500ms Supabase delay does not increase processing pipeline duration
6. A scheduled probe function and a weekly retention cleanup function exist as separate Firebase Cloud Function exports 6. A scheduled probe function and a weekly retention cleanup function exist as separate Firebase Cloud Function exports
**Plans**: TBD **Plans:** 4 plans
Plans:
- [ ] 02-01-PLAN.md — Analytics migration + analyticsService (fire-and-forget)
- [ ] 02-02-PLAN.md — Health probe service (4 real API probers + orchestrator)
- [ ] 02-03-PLAN.md — Alert service (deduplication + email via nodemailer)
- [ ] 02-04-PLAN.md — Cloud Function exports (runHealthProbes + runRetentionCleanup)
### Phase 3: API Layer ### Phase 3: API Layer
**Goal**: Admin-authenticated HTTP endpoints expose health status, alerts, and processing analytics; existing service processors emit analytics instrumentation **Goal**: Admin-authenticated HTTP endpoints expose health status, alerts, and processing analytics; existing service processors emit analytics instrumentation
@@ -77,6 +83,6 @@ Phases execute in numeric order: 1 → 2 → 3 → 4
| Phase | Plans Complete | Status | Completed | | Phase | Plans Complete | Status | Completed |
|-------|----------------|--------|-----------| |-------|----------------|--------|-----------|
| 1. Data Foundation | 2/2 | Complete | 2026-02-24 | | 1. Data Foundation | 2/2 | Complete | 2026-02-24 |
| 2. Backend Services | 0/TBD | Not started | - | | 2. Backend Services | 0/4 | Not started | - |
| 3. API Layer | 0/TBD | Not started | - | | 3. API Layer | 0/TBD | Not started | - |
| 4. Frontend | 0/TBD | Not started | - | | 4. Frontend | 0/TBD | Not started | - |

View File

@@ -0,0 +1,176 @@
---
phase: 02-backend-services
plan: 01
type: execute
wave: 1
depends_on: []
files_modified:
- backend/src/models/migrations/013_create_processing_events_table.sql
- backend/src/services/analyticsService.ts
- backend/src/__tests__/unit/analyticsService.test.ts
autonomous: true
requirements: [ANLY-01, ANLY-03]
must_haves:
truths:
- "recordProcessingEvent() writes to document_processing_events table via Supabase"
- "recordProcessingEvent() returns void (not Promise) so callers cannot accidentally await it"
- "A deliberate Supabase write failure logs an error but does not throw or reject"
- "deleteProcessingEventsOlderThan(30) removes rows older than 30 days"
artifacts:
- path: "backend/src/models/migrations/013_create_processing_events_table.sql"
provides: "document_processing_events table DDL with indexes and RLS"
contains: "CREATE TABLE IF NOT EXISTS document_processing_events"
- path: "backend/src/services/analyticsService.ts"
provides: "Fire-and-forget analytics event writer and retention delete"
exports: ["recordProcessingEvent", "deleteProcessingEventsOlderThan"]
- path: "backend/src/__tests__/unit/analyticsService.test.ts"
provides: "Unit tests for analyticsService"
min_lines: 50
key_links:
- from: "backend/src/services/analyticsService.ts"
to: "backend/src/config/supabase.ts"
via: "getSupabaseServiceClient() call"
pattern: "getSupabaseServiceClient"
- from: "backend/src/services/analyticsService.ts"
to: "document_processing_events table"
via: "void supabase.from('document_processing_events').insert(...)"
pattern: "void.*from\\('document_processing_events'\\)"
---
<objective>
Create the analytics migration and fire-and-forget analytics service for persisting document processing events to Supabase.
Purpose: ANLY-01 requires processing events to persist (not in-memory), and ANLY-03 requires instrumentation to be non-blocking. This plan creates the database table and the service that writes to it without blocking the processing pipeline.
Output: Migration 013 SQL file, analyticsService.ts with recordProcessingEvent() and deleteProcessingEventsOlderThan(), and unit tests.
</objective>
<execution_context>
@/home/jonathan/.claude/get-shit-done/workflows/execute-plan.md
@/home/jonathan/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/02-backend-services/02-RESEARCH.md
@.planning/phases/01-data-foundation/01-01-SUMMARY.md
@.planning/phases/01-data-foundation/01-02-SUMMARY.md
@backend/src/models/migrations/012_create_monitoring_tables.sql
@backend/src/config/supabase.ts
@backend/src/utils/logger.ts
</context>
<tasks>
<task type="auto">
<name>Task 1: Create analytics migration and analyticsService</name>
<files>
backend/src/models/migrations/013_create_processing_events_table.sql
backend/src/services/analyticsService.ts
</files>
<action>
**Migration 013:** Create `backend/src/models/migrations/013_create_processing_events_table.sql` following the exact pattern from migration 012. The table:
```sql
CREATE TABLE IF NOT EXISTS document_processing_events (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
document_id UUID NOT NULL,
user_id UUID NOT NULL,
event_type TEXT NOT NULL CHECK (event_type IN ('upload_started', 'processing_started', 'completed', 'failed')),
duration_ms INTEGER,
error_message TEXT,
stage TEXT,
created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX IF NOT EXISTS idx_document_processing_events_created_at
ON document_processing_events(created_at);
CREATE INDEX IF NOT EXISTS idx_document_processing_events_document_id
ON document_processing_events(document_id);
ALTER TABLE document_processing_events ENABLE ROW LEVEL SECURITY;
```
**analyticsService.ts:** Create `backend/src/services/analyticsService.ts` with two exports:
1. `recordProcessingEvent(data: ProcessingEventData): void` — Return type MUST be `void` (not `Promise<void>`) to prevent accidental `await`. Inside, call `getSupabaseServiceClient()` (per-method, not module level), then `void supabase.from('document_processing_events').insert({...}).then(({ error }) => { if (error) logger.error(...) })`. Never throw, never reject.
2. `deleteProcessingEventsOlderThan(days: number): Promise<number>` — Compute cutoff date in JS (`new Date(Date.now() - days * 86400000).toISOString()`), then delete with `.lt('created_at', cutoff)`. Return the count of deleted rows. This follows the same pattern as `HealthCheckModel.deleteOlderThan()`.
Export the `ProcessingEventData` interface:
```typescript
export interface ProcessingEventData {
document_id: string;
user_id: string;
event_type: 'upload_started' | 'processing_started' | 'completed' | 'failed';
duration_ms?: number;
error_message?: string;
stage?: string;
}
```
Use Winston logger (`import { logger } from '../utils/logger'`). Use `getSupabaseServiceClient` from `'../config/supabase'`. Follow project naming conventions (camelCase file, named exports).
</action>
<verify>
<automated>cd /home/jonathan/Coding/cim_summary/backend && npx tsc --noEmit --pretty 2>&1 | head -30</automated>
<manual>Verify 013 migration file exists and analyticsService exports recordProcessingEvent and deleteProcessingEventsOlderThan</manual>
</verify>
<done>Migration 013 creates document_processing_events table with indexes and RLS. analyticsService.ts exports recordProcessingEvent (void return) and deleteProcessingEventsOlderThan (Promise&lt;number&gt;). TypeScript compiles.</done>
</task>
<task type="auto">
<name>Task 2: Create analyticsService unit tests</name>
<files>
backend/src/__tests__/unit/analyticsService.test.ts
</files>
<action>
Create `backend/src/__tests__/unit/analyticsService.test.ts` using the Vitest + Supabase mock pattern established in Phase 1 (01-02-SUMMARY.md).
Mock setup:
- `vi.mock('../../config/supabase')` with inline `vi.fn()` factory
- `vi.mock('../../utils/logger')` with inline `vi.fn()` factory
- Use `vi.mocked()` after import for typed access
- `makeSupabaseChain()` helper per test (fresh mock state)
Test cases for `recordProcessingEvent`:
1. **Calls Supabase insert with correct data** — verify `.from('document_processing_events').insert(...)` called with expected fields including `created_at`
2. **Return type is void (not a Promise)** — call `recordProcessingEvent(data)` and verify the return value is `undefined` (void), not a thenable
3. **Logs error on Supabase failure but does not throw** — mock the `.then` callback with `{ error: { message: 'test error' } }`, verify `logger.error` was called
4. **Handles optional fields (duration_ms, error_message, stage) as null** — pass data without optional fields, verify insert called with `null` for those columns
Test cases for `deleteProcessingEventsOlderThan`:
5. **Computes correct cutoff date and deletes** — mock Supabase delete chain, verify `.lt('created_at', ...)` called with ISO date string ~30 days ago
6. **Returns count of deleted rows** — mock response with `data: [{}, {}, {}]` (3 rows), verify returns 3
Use `beforeEach(() => vi.clearAllMocks())` for test isolation.
</action>
<verify>
<automated>cd /home/jonathan/Coding/cim_summary/backend && npx vitest run src/__tests__/unit/analyticsService.test.ts --reporter=verbose 2>&1</automated>
</verify>
<done>All analyticsService tests pass. recordProcessingEvent verified as fire-and-forget (void return, error-swallowing). deleteProcessingEventsOlderThan verified with correct date math and row count return.</done>
</task>
</tasks>
<verification>
1. `npx tsc --noEmit` passes with no errors from new files
2. `npx vitest run src/__tests__/unit/analyticsService.test.ts` — all tests pass
3. Migration 013 SQL is valid and follows 012 pattern
4. `recordProcessingEvent` return type is `void` (not `Promise<void>`)
</verification>
<success_criteria>
- Migration 013 creates document_processing_events table with id, document_id, user_id, event_type (CHECK constraint), duration_ms, error_message, stage, created_at
- Indexes on created_at and document_id exist
- RLS enabled on the table
- analyticsService.recordProcessingEvent() is fire-and-forget (void return, no throw)
- analyticsService.deleteProcessingEventsOlderThan() returns deleted row count
- All unit tests pass
</success_criteria>
<output>
After completion, create `.planning/phases/02-backend-services/02-01-SUMMARY.md`
</output>

View File

@@ -0,0 +1,176 @@
---
phase: 02-backend-services
plan: 02
type: execute
wave: 1
depends_on: []
files_modified:
- backend/package.json
- backend/src/services/healthProbeService.ts
- backend/src/__tests__/unit/healthProbeService.test.ts
autonomous: true
requirements: [HLTH-02, HLTH-04]
must_haves:
truths:
- "Each probe makes a real authenticated API call (Document AI list processors, Anthropic minimal message, Supabase SELECT 1 via pg pool, Firebase Auth verifyIdToken)"
- "Each probe returns a structured ProbeResult with service_name, status, latency_ms, and optional error_message"
- "Probe results are persisted to Supabase via HealthCheckModel.create()"
- "A single probe failure does not prevent other probes from running"
- "LLM probe uses cheapest model (claude-haiku-4-5) with max_tokens 5"
- "Supabase probe uses getPostgresPool().query('SELECT 1'), not PostgREST client"
artifacts:
- path: "backend/src/services/healthProbeService.ts"
provides: "Health probe orchestrator with 4 individual probers"
exports: ["healthProbeService", "ProbeResult"]
- path: "backend/src/__tests__/unit/healthProbeService.test.ts"
provides: "Unit tests for all probes and orchestrator"
min_lines: 80
key_links:
- from: "backend/src/services/healthProbeService.ts"
to: "backend/src/models/HealthCheckModel.ts"
via: "HealthCheckModel.create() for persistence"
pattern: "HealthCheckModel\\.create"
- from: "backend/src/services/healthProbeService.ts"
to: "backend/src/config/supabase.ts"
via: "getPostgresPool() for Supabase probe"
pattern: "getPostgresPool"
---
<objective>
Create the health probe service with four real API probers (Document AI, LLM, Supabase, Firebase Auth) and an orchestrator that runs all probes and persists results.
Purpose: HLTH-02 requires real authenticated API calls (not config checks), and HLTH-04 requires results to persist to Supabase. This plan builds the probe logic and persistence layer.
Output: healthProbeService.ts with 4 probers + runAllProbes orchestrator, and unit tests. Also installs nodemailer (needed by Plan 03).
</objective>
<execution_context>
@/home/jonathan/.claude/get-shit-done/workflows/execute-plan.md
@/home/jonathan/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/02-backend-services/02-RESEARCH.md
@.planning/phases/01-data-foundation/01-01-SUMMARY.md
@backend/src/models/HealthCheckModel.ts
@backend/src/config/supabase.ts
@backend/src/services/documentAiProcessor.ts
@backend/src/services/llmService.ts
@backend/src/config/firebase.ts
</context>
<tasks>
<task type="auto">
<name>Task 1: Install nodemailer and create healthProbeService</name>
<files>
backend/package.json
backend/src/services/healthProbeService.ts
</files>
<action>
**Step 1: Install nodemailer** (needed by Plan 03, installing now to avoid package.json conflicts in parallel execution):
```bash
cd backend && npm install nodemailer && npm install --save-dev @types/nodemailer
```
**Step 2: Create healthProbeService.ts** with the following structure:
Export a `ProbeResult` interface:
```typescript
export interface ProbeResult {
service_name: string;
status: 'healthy' | 'degraded' | 'down';
latency_ms: number;
error_message?: string;
probe_details?: Record<string, unknown>;
}
```
Create 4 individual probe functions (all private/unexported):
1. **probeDocumentAI()**: Import `DocumentProcessorServiceClient` from `@google-cloud/documentai`. Call `client.listProcessors({ parent: ... })` using the project ID from config. Latency > 2000ms = 'degraded'. Catch errors = 'down' with error_message.
2. **probeLLM()**: Import `Anthropic` from `@anthropic-ai/sdk`. Create client with `process.env.ANTHROPIC_API_KEY`. Call `client.messages.create({ model: 'claude-haiku-4-5', max_tokens: 5, messages: [{ role: 'user', content: 'Hi' }] })`. Use cheapest model (PITFALL B prevention). Latency > 5000ms = 'degraded'. 429 errors = 'degraded' (rate limit, not down). Other errors = 'down'.
3. **probeSupabase()**: Import `getPostgresPool` from `'../config/supabase'`. Call `pool.query('SELECT 1')`. Use direct PostgreSQL, NOT PostgREST (PITFALL C prevention). Latency > 2000ms = 'degraded'. Errors = 'down'.
4. **probeFirebaseAuth()**: Import `admin` from `firebase-admin` (or use the existing firebase config). Call `admin.auth().verifyIdToken('invalid-token-probe-check')`. This ALWAYS throws. If error message contains 'argument' or 'INVALID' = 'healthy' (SDK is alive). Other errors = 'down'.
Create `runAllProbes()` as the orchestrator:
- Wrap each probe in individual try/catch (PITFALL E: one probe failure must not stop others)
- For each ProbeResult, call `HealthCheckModel.create({ service_name, status, latency_ms, error_message, probe_details, checked_at: new Date().toISOString() })`
- Return array of all ProbeResults
- Log summary via Winston logger
Export as object: `export const healthProbeService = { runAllProbes }`.
Use Winston logger for all logging. Use `getSupabaseServiceClient()` per-method pattern for any Supabase calls (though probes use `getPostgresPool()` directly for the Supabase probe).
</action>
<verify>
<automated>cd /home/jonathan/Coding/cim_summary/backend && npx tsc --noEmit --pretty 2>&1 | head -30</automated>
<manual>Verify healthProbeService.ts exists with runAllProbes and ProbeResult exports</manual>
</verify>
<done>nodemailer installed. healthProbeService.ts exports ProbeResult interface and healthProbeService object with runAllProbes(). Four probes make real API calls. Each probe wrapped in try/catch. Results persisted via HealthCheckModel.create(). TypeScript compiles.</done>
</task>
<task type="auto">
<name>Task 2: Create healthProbeService unit tests</name>
<files>
backend/src/__tests__/unit/healthProbeService.test.ts
</files>
<action>
Create `backend/src/__tests__/unit/healthProbeService.test.ts` using the established Vitest mock pattern.
Mock all external dependencies:
- `vi.mock('../../models/HealthCheckModel')` — mock `create()` to resolve successfully
- `vi.mock('../../config/supabase')` — mock `getPostgresPool()` returning `{ query: vi.fn() }`
- `vi.mock('@google-cloud/documentai')` — mock `DocumentProcessorServiceClient` with `listProcessors` resolving
- `vi.mock('@anthropic-ai/sdk')` — mock `Anthropic` constructor, `messages.create` resolving
- `vi.mock('firebase-admin')` — mock `auth().verifyIdToken()` throwing expected error
- `vi.mock('../../utils/logger')` — mock logger
Test cases for `runAllProbes`:
1. **All probes healthy — returns 4 ProbeResults with status 'healthy'** — all mocks resolve quickly, verify 4 results returned with status 'healthy'
2. **Each result persisted via HealthCheckModel.create** — verify `HealthCheckModel.create` called 4 times with correct service_name values: 'document_ai', 'llm_api', 'supabase', 'firebase_auth'
3. **One probe throws — others still run** — make Document AI mock throw, verify 3 other probes still complete and all 4 HealthCheckModel.create calls happen (the failed probe creates a 'down' result)
4. **LLM probe 429 error returns 'degraded' not 'down'** — make Anthropic mock throw error with '429' in message, verify result status is 'degraded'
5. **Supabase probe uses getPostgresPool not getSupabaseServiceClient** — verify `getPostgresPool` was called (not getSupabaseServiceClient) during Supabase probe
6. **Firebase Auth probe — expected error = healthy** — mock verifyIdToken throwing 'Decoding Firebase ID token failed' (argument error), verify status is 'healthy'
7. **Firebase Auth probe — unexpected error = down** — mock verifyIdToken throwing network error, verify status is 'down'
8. **Latency measured correctly** — use `vi.useFakeTimers()` or verify `latency_ms` is a non-negative number
Use `beforeEach(() => vi.clearAllMocks())`.
</action>
<verify>
<automated>cd /home/jonathan/Coding/cim_summary/backend && npx vitest run src/__tests__/unit/healthProbeService.test.ts --reporter=verbose 2>&1</automated>
</verify>
<done>All healthProbeService tests pass. Probes verified as making real API calls (mocked). Orchestrator verified as fault-tolerant (one probe failure doesn't stop others). Results verified as persisted via HealthCheckModel.create(). Supabase probe uses getPostgresPool, not PostgREST.</done>
</task>
</tasks>
<verification>
1. `npm ls nodemailer` shows nodemailer installed
2. `npx tsc --noEmit` passes
3. `npx vitest run src/__tests__/unit/healthProbeService.test.ts` — all tests pass
4. healthProbeService.ts does NOT use getSupabaseServiceClient for the Supabase probe (uses getPostgresPool)
5. LLM probe uses 'claude-haiku-4-5' not an expensive model
</verification>
<success_criteria>
- nodemailer and @types/nodemailer installed in backend/package.json
- healthProbeService exports ProbeResult and healthProbeService.runAllProbes
- 4 probes: document_ai, llm_api, supabase, firebase_auth
- Each probe returns structured ProbeResult with status/latency_ms/error_message
- Probe results persisted via HealthCheckModel.create()
- Individual probe failures isolated (other probes still run)
- All unit tests pass
</success_criteria>
<output>
After completion, create `.planning/phases/02-backend-services/02-02-SUMMARY.md`
</output>

View File

@@ -0,0 +1,182 @@
---
phase: 02-backend-services
plan: 03
type: execute
wave: 2
depends_on: [02-02]
files_modified:
- backend/src/services/alertService.ts
- backend/src/__tests__/unit/alertService.test.ts
autonomous: true
requirements: [ALRT-01, ALRT-02, ALRT-04]
must_haves:
truths:
- "An alert email is sent when a probe returns 'degraded' or 'down'"
- "A second probe failure within the cooldown period does NOT send a duplicate email"
- "Alert recipient is read from process.env.EMAIL_WEEKLY_RECIPIENT, never hardcoded"
- "Email failure does not throw or break the probe pipeline"
- "Nodemailer transporter is created inside the function call, not at module level (Firebase Secret timing)"
- "An alert_events row is created before sending the email"
artifacts:
- path: "backend/src/services/alertService.ts"
provides: "Alert deduplication, email sending, and alert event creation"
exports: ["alertService"]
- path: "backend/src/__tests__/unit/alertService.test.ts"
provides: "Unit tests for alert deduplication, email sending, recipient config"
min_lines: 80
key_links:
- from: "backend/src/services/alertService.ts"
to: "backend/src/models/AlertEventModel.ts"
via: "findRecentByService() for deduplication, create() for alert row"
pattern: "AlertEventModel\\.(findRecentByService|create)"
- from: "backend/src/services/alertService.ts"
to: "nodemailer"
via: "createTransport + sendMail for email delivery"
pattern: "nodemailer\\.createTransport"
- from: "backend/src/services/alertService.ts"
to: "process.env.EMAIL_WEEKLY_RECIPIENT"
via: "Config-based recipient (ALRT-04)"
pattern: "process\\.env\\.EMAIL_WEEKLY_RECIPIENT"
---
<objective>
Create the alert service with deduplication logic, SMTP email sending via nodemailer, and config-based recipient.
Purpose: ALRT-01 requires email alerts on service degradation/failure. ALRT-02 requires deduplication with cooldown. ALRT-04 requires the recipient to come from configuration, not hardcoded source code.
Output: alertService.ts with evaluateAndAlert() and sendAlertEmail(), and unit tests.
</objective>
<execution_context>
@/home/jonathan/.claude/get-shit-done/workflows/execute-plan.md
@/home/jonathan/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/02-backend-services/02-RESEARCH.md
@.planning/phases/01-data-foundation/01-01-SUMMARY.md
@.planning/phases/02-backend-services/02-02-PLAN.md
@backend/src/models/AlertEventModel.ts
@backend/src/index.ts
</context>
<tasks>
<task type="auto">
<name>Task 1: Create alertService with deduplication and email</name>
<files>
backend/src/services/alertService.ts
</files>
<action>
Create `backend/src/services/alertService.ts` with the following structure:
**Import the ProbeResult type** from `'./healthProbeService'` (created in Plan 02).
**Constants:**
- `ALERT_COOLDOWN_MINUTES = parseInt(process.env.ALERT_COOLDOWN_MINUTES ?? '60', 10)` — configurable cooldown window
**Private function `createTransporter()`:**
Create nodemailer transporter INSIDE function scope (not module level — PITFALL A: Firebase Secrets not available at module load). Read SMTP config from `process.env`:
- `host`: `process.env.EMAIL_HOST ?? 'smtp.gmail.com'`
- `port`: `parseInt(process.env.EMAIL_PORT ?? '587', 10)`
- `secure`: `process.env.EMAIL_SECURE === 'true'`
- `auth.user`: `process.env.EMAIL_USER`
- `auth.pass`: `process.env.EMAIL_PASS`
**Private function `sendAlertEmail(serviceName, alertType, message)`:**
- Read recipient from `process.env.EMAIL_WEEKLY_RECIPIENT` (ALRT-04: NEVER hardcode the email address)
- If no recipient configured, log warning and return (do not throw)
- Call `createTransporter()` then `transporter.sendMail({ from, to, subject, text, html })`
- Subject format: `[CIM Summary] Alert: ${serviceName} — ${alertType}`
- Wrap in try/catch — email failure logs error but does NOT throw (email failure must not break probe pipeline)
**Exported function `evaluateAndAlert(probeResults: ProbeResult[])`:**
For each ProbeResult where status is 'degraded' or 'down':
1. Map status to alert_type: 'down' -> 'service_down', 'degraded' -> 'service_degraded'
2. Call `AlertEventModel.findRecentByService(service_name, alert_type, ALERT_COOLDOWN_MINUTES)`
3. If recent alert exists within cooldown, log suppression and skip BOTH row creation AND email (PITFALL 3: prevent alert storms)
4. If no recent alert, create alert_events row via `AlertEventModel.create({ service_name, alert_type, message: error_message or status description })`
5. Then send email via `sendAlertEmail()`
Export as: `export const alertService = { evaluateAndAlert }`.
Use Winston logger for all logging. Use `import { AlertEventModel } from '../models/AlertEventModel'`.
</action>
<verify>
<automated>cd /home/jonathan/Coding/cim_summary/backend && npx tsc --noEmit --pretty 2>&1 | head -30</automated>
<manual>Verify alertService.ts exports alertService.evaluateAndAlert. Verify no hardcoded email addresses in source.</manual>
</verify>
<done>alertService.ts exports evaluateAndAlert(). Deduplication checks AlertEventModel.findRecentByService() before creating rows or sending email. Recipient read from process.env.EMAIL_WEEKLY_RECIPIENT. Transporter created lazily. Email failures caught and logged. TypeScript compiles.</done>
</task>
<task type="auto">
<name>Task 2: Create alertService unit tests</name>
<files>
backend/src/__tests__/unit/alertService.test.ts
</files>
<action>
Create `backend/src/__tests__/unit/alertService.test.ts` using the Vitest mock pattern.
Mock dependencies:
- `vi.mock('../../models/AlertEventModel')` — mock `findRecentByService` and `create`
- `vi.mock('nodemailer')` — mock `createTransport` returning `{ sendMail: vi.fn().mockResolvedValue({}) }`
- `vi.mock('../../utils/logger')` — mock logger
Create test ProbeResult fixtures:
- `healthyProbe: { service_name: 'supabase', status: 'healthy', latency_ms: 50 }`
- `downProbe: { service_name: 'document_ai', status: 'down', latency_ms: 0, error_message: 'Connection refused' }`
- `degradedProbe: { service_name: 'llm_api', status: 'degraded', latency_ms: 6000 }`
Test cases:
1. **Healthy probes — no alerts sent** — pass array of healthy ProbeResults, verify AlertEventModel.findRecentByService NOT called, sendMail NOT called
2. **Down probe — creates alert_events row and sends email** — pass downProbe, mock findRecentByService returning null (no recent alert), verify AlertEventModel.create called with service_name='document_ai' and alert_type='service_down', verify sendMail called
3. **Degraded probe — creates alert with type 'service_degraded'** — pass degradedProbe, mock findRecentByService returning null, verify AlertEventModel.create called with alert_type='service_degraded'
4. **Deduplication — suppresses within cooldown** — pass downProbe, mock findRecentByService returning an existing alert object (non-null), verify AlertEventModel.create NOT called, sendMail NOT called, logger.info called with 'suppress' in message
5. **Recipient from env — reads process.env.EMAIL_WEEKLY_RECIPIENT** — set `process.env.EMAIL_WEEKLY_RECIPIENT = 'test@example.com'`, pass downProbe with no recent alert, verify sendMail called with `to: 'test@example.com'`
6. **No recipient configured — skips email but still creates alert row** — delete process.env.EMAIL_WEEKLY_RECIPIENT, pass downProbe with no recent alert, verify AlertEventModel.create IS called, sendMail NOT called, logger.warn called
7. **Email failure — does not throw** — mock sendMail to reject, verify evaluateAndAlert does not throw, verify logger.error called
8. **Multiple probes — processes each independently** — pass [downProbe, degradedProbe, healthyProbe], verify findRecentByService called twice (for down and degraded, not for healthy)
Use `beforeEach(() => { vi.clearAllMocks(); process.env.EMAIL_WEEKLY_RECIPIENT = 'admin@test.com'; })`.
</action>
<verify>
<automated>cd /home/jonathan/Coding/cim_summary/backend && npx vitest run src/__tests__/unit/alertService.test.ts --reporter=verbose 2>&1</automated>
</verify>
<done>All alertService tests pass. Deduplication verified (suppresses within cooldown). Email sending verified with config-based recipient. Email failure verified as non-throwing. Multiple probe evaluation verified.</done>
</task>
</tasks>
<verification>
1. `npx tsc --noEmit` passes
2. `npx vitest run src/__tests__/unit/alertService.test.ts` — all tests pass
3. `grep -r 'jpressnell\|bluepoint' backend/src/services/alertService.ts` returns nothing (no hardcoded emails)
4. alertService reads recipient from `process.env.EMAIL_WEEKLY_RECIPIENT`
</verification>
<success_criteria>
- alertService exports evaluateAndAlert(probeResults)
- Deduplication uses AlertEventModel.findRecentByService with configurable cooldown
- Alert rows created via AlertEventModel.create before email send
- Suppressed alerts skip BOTH row creation AND email
- Recipient from process.env, never hardcoded
- Transporter created inside function, not at module level
- Email failures caught and logged, never thrown
- All unit tests pass
</success_criteria>
<output>
After completion, create `.planning/phases/02-backend-services/02-03-SUMMARY.md`
</output>

View File

@@ -0,0 +1,197 @@
---
phase: 02-backend-services
plan: 04
type: execute
wave: 3
depends_on: [02-01, 02-02, 02-03]
files_modified:
- backend/src/index.ts
autonomous: true
requirements: [HLTH-03, INFR-03]
must_haves:
truths:
- "runHealthProbes Cloud Function export runs on 'every 5 minutes' schedule, completely separate from processDocumentJobs"
- "runRetentionCleanup Cloud Function export runs on 'every monday 02:00' schedule"
- "runHealthProbes calls healthProbeService.runAllProbes() and then alertService.evaluateAndAlert()"
- "runRetentionCleanup deletes from service_health_checks, alert_events, and document_processing_events older than 30 days"
- "Both exports list required Firebase secrets in their secrets array"
- "Both exports use dynamic import() pattern (same as processDocumentJobs)"
artifacts:
- path: "backend/src/index.ts"
provides: "Two new onSchedule Cloud Function exports"
exports: ["runHealthProbes", "runRetentionCleanup"]
key_links:
- from: "backend/src/index.ts (runHealthProbes)"
to: "backend/src/services/healthProbeService.ts"
via: "dynamic import('./services/healthProbeService')"
pattern: "import\\('./services/healthProbeService'\\)"
- from: "backend/src/index.ts (runHealthProbes)"
to: "backend/src/services/alertService.ts"
via: "dynamic import('./services/alertService')"
pattern: "import\\('./services/alertService'\\)"
- from: "backend/src/index.ts (runRetentionCleanup)"
to: "backend/src/models/HealthCheckModel.ts"
via: "dynamic import for deleteOlderThan(30)"
pattern: "HealthCheckModel\\.deleteOlderThan"
- from: "backend/src/index.ts (runRetentionCleanup)"
to: "backend/src/services/analyticsService.ts"
via: "dynamic import for deleteProcessingEventsOlderThan(30)"
pattern: "deleteProcessingEventsOlderThan"
---
<objective>
Add two new Firebase Cloud Function scheduled exports to index.ts: runHealthProbes (every 5 minutes) and runRetentionCleanup (weekly).
Purpose: HLTH-03 requires health probes to run on a schedule separate from document processing (PITFALL-2). INFR-03 requires 30-day rolling data retention cleanup on schedule.
Output: Two new onSchedule exports in backend/src/index.ts.
</objective>
<execution_context>
@/home/jonathan/.claude/get-shit-done/workflows/execute-plan.md
@/home/jonathan/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/02-backend-services/02-RESEARCH.md
@.planning/phases/02-backend-services/02-01-PLAN.md
@.planning/phases/02-backend-services/02-02-PLAN.md
@.planning/phases/02-backend-services/02-03-PLAN.md
@backend/src/index.ts
</context>
<tasks>
<task type="auto">
<name>Task 1: Add runHealthProbes scheduled Cloud Function export</name>
<files>
backend/src/index.ts
</files>
<action>
Add a new `onSchedule` export to `backend/src/index.ts` AFTER the existing `processDocumentJobs` export. Follow the exact same pattern as `processDocumentJobs`.
```typescript
// Health probe scheduler — separate from document processing (PITFALL-2, HLTH-03)
export const runHealthProbes = onSchedule({
schedule: 'every 5 minutes',
timeoutSeconds: 60,
memory: '256MiB',
retryCount: 0, // Probes should not retry — they run again in 5 minutes anyway
secrets: [
anthropicApiKey, // for LLM probe
openaiApiKey, // for OpenAI probe fallback
databaseUrl, // for Supabase probe
supabaseServiceKey,
supabaseAnonKey,
],
}, async (_event) => {
const { healthProbeService } = await import('./services/healthProbeService');
const { alertService } = await import('./services/alertService');
const results = await healthProbeService.runAllProbes();
await alertService.evaluateAndAlert(results);
logger.info('runHealthProbes: complete', {
probeCount: results.length,
statuses: results.map(r => ({ service: r.service_name, status: r.status })),
});
});
```
Key requirements:
- Use dynamic `import()` (not static import at top of file) — same pattern as processDocumentJobs
- List ALL secrets that probes need in the `secrets` array (Firebase Secrets must be explicitly listed per function)
- Use the existing `anthropicApiKey`, `openaiApiKey`, `databaseUrl`, `supabaseServiceKey`, `supabaseAnonKey` variables already defined via `defineSecret` at the top of index.ts
- Set `retryCount: 0` — probes run every 5 minutes, no need to retry failures
- First call `runAllProbes()` to measure and persist, then `evaluateAndAlert()` to check for alerts
- Log a summary with probe count and statuses
</action>
<verify>
<automated>cd /home/jonathan/Coding/cim_summary/backend && npx tsc --noEmit --pretty 2>&1 | head -30</automated>
<manual>Verify index.ts has `export const runHealthProbes` as a separate export from processDocumentJobs</manual>
</verify>
<done>runHealthProbes export added to index.ts. Runs every 5 minutes. Calls healthProbeService.runAllProbes() then alertService.evaluateAndAlert(). Uses dynamic imports. Lists all required secrets. TypeScript compiles.</done>
</task>
<task type="auto">
<name>Task 2: Add runRetentionCleanup scheduled Cloud Function export</name>
<files>
backend/src/index.ts
</files>
<action>
Add a second `onSchedule` export to `backend/src/index.ts` AFTER runHealthProbes.
```typescript
// Retention cleanup — weekly, separate from document processing (PITFALL-7, INFR-03)
export const runRetentionCleanup = onSchedule({
schedule: 'every monday 02:00',
timeoutSeconds: 120,
memory: '256MiB',
secrets: [databaseUrl, supabaseServiceKey, supabaseAnonKey],
}, async (_event) => {
const { HealthCheckModel } = await import('./models/HealthCheckModel');
const { AlertEventModel } = await import('./models/AlertEventModel');
const { deleteProcessingEventsOlderThan } = await import('./services/analyticsService');
const RETENTION_DAYS = 30;
const [hcCount, alertCount, eventCount] = await Promise.all([
HealthCheckModel.deleteOlderThan(RETENTION_DAYS),
AlertEventModel.deleteOlderThan(RETENTION_DAYS),
deleteProcessingEventsOlderThan(RETENTION_DAYS),
]);
logger.info('runRetentionCleanup: complete', {
retentionDays: RETENTION_DAYS,
deletedHealthChecks: hcCount,
deletedAlerts: alertCount,
deletedProcessingEvents: eventCount,
});
});
```
Key requirements:
- Use dynamic `import()` for all model and service imports
- Run all three deletes in parallel with `Promise.all()` (they touch different tables)
- Only include the secrets needed for Supabase access (no LLM keys needed for cleanup)
- Set `timeoutSeconds: 120` (cleanup may take longer than probes)
- The 30-day retention period is a constant, not configurable via env (matches INFR-03 spec)
- Only manage monitoring tables: service_health_checks, alert_events, document_processing_events. Do NOT delete from performance_metrics, session_events, or execution_events (those are agentic RAG tables, out of scope per research Open Question 4)
- Log the count of deleted rows from each table
</action>
<verify>
<automated>cd /home/jonathan/Coding/cim_summary/backend && npx tsc --noEmit --pretty 2>&1 | head -30</automated>
<manual>Verify index.ts has `export const runRetentionCleanup` as a separate export. Verify it calls deleteOlderThan on all three tables.</manual>
</verify>
<done>runRetentionCleanup export added to index.ts. Runs weekly Monday 02:00. Deletes from service_health_checks, alert_events, and document_processing_events older than 30 days. Uses Promise.all for parallel execution. Logs deletion counts. TypeScript compiles.</done>
</task>
</tasks>
<verification>
1. `npx tsc --noEmit` passes
2. `grep 'export const runHealthProbes' backend/src/index.ts` returns a match
3. `grep 'export const runRetentionCleanup' backend/src/index.ts` returns a match
4. Both exports use `onSchedule` (not piggybacked on processDocumentJobs — PITFALL-2 compliance)
5. Both exports use dynamic `import()` pattern
6. Full test suite still passes: `npx vitest run --reporter=verbose`
</verification>
<success_criteria>
- runHealthProbes is a separate onSchedule export running every 5 minutes
- runRetentionCleanup is a separate onSchedule export running weekly Monday 02:00
- Both are completely decoupled from processDocumentJobs
- runHealthProbes calls runAllProbes() then evaluateAndAlert()
- runRetentionCleanup calls deleteOlderThan(30) on all three monitoring tables
- All required Firebase secrets listed in each function's secrets array
- TypeScript compiles with no errors
- Existing test suite passes with no regressions
</success_criteria>
<output>
After completion, create `.planning/phases/02-backend-services/02-04-SUMMARY.md`
</output>