diff --git a/.planning/phases/02-backend-services/02-RESEARCH.md b/.planning/phases/02-backend-services/02-RESEARCH.md new file mode 100644 index 0000000..6dded67 --- /dev/null +++ b/.planning/phases/02-backend-services/02-RESEARCH.md @@ -0,0 +1,632 @@ +# Phase 2: Backend Services - Research + +**Researched:** 2026-02-24 +**Domain:** Firebase Cloud Functions scheduling, health probes, email alerting (Nodemailer/SMTP), fire-and-forget analytics, alert deduplication, 30-day data retention +**Confidence:** HIGH + +--- + + +## Phase Requirements + +| ID | Description | Research Support | +|----|-------------|-----------------| +| HLTH-02 | Each health probe makes a real authenticated API call, not just config checks | Verified: existing `/monitoring/diagnostics` only checks initialization, not live connectivity; each probe must make a real call (Document AI list processors, Anthropic minimal message, Supabase SELECT 1, Firebase Auth verify-token attempt) | +| HLTH-03 | Health probes run on a scheduled interval, separate from document processing | Verified: `processDocumentJobs` export pattern in `index.ts` shows how to add a second named Cloud Function export; `onSchedule` from `firebase-functions/v2/scheduler` is the correct mechanism; PITFALL-2 mandates decoupling | +| HLTH-04 | Health probe results persist to Supabase and survive cold starts | Verified: `HealthCheckModel.create()` exists from Phase 1 with correct insert signature; `service_health_checks` table exists via migration 012; cold-start survival is automatic once persisted | +| ALRT-01 | Admin receives email alert when a service goes down or degrades | Verified: SMTP config already defined in `index.ts` (`emailHost`, `emailUser`, `emailPass`, `emailPort`, `emailSecure`); `nodemailer` is the correct library (no other email SDK installed; SMTP credentials are pre-configured); `nodemailer` is NOT yet in package.json — must be installed | +| ALRT-02 | Alert deduplication prevents repeat emails for the same ongoing issue (cooldown period) | Verified: `AlertEventModel.findRecentByService()` from Phase 1 exists and accepts `withinMinutes` — built exactly for this use case; check it before firing email and before creating new `alert_events` row | +| ALRT-04 | Alert recipient stored as configuration, not hardcoded | Verified: `EMAIL_WEEKLY_RECIPIENT` defineString already exists in `index.ts` with default `jpressnell@bluepointcapital.com`; alert service must read `process.env.EMAIL_WEEKLY_RECIPIENT` (or `process.env.ALERT_RECIPIENT`) — do NOT hardcode the string in service source | +| ANLY-01 | Document processing events persist to Supabase at write time (not in-memory only) | Verified: `uploadMonitoringService.ts` is in-memory only (confirmed PITFALL-1); a `document_processing_events` table is NOT yet in any migration — Phase 2 must add migration 013 for it; `jobProcessorService.ts` has instrumentation hooks (lines 329-390) to attach fire-and-forget writes | +| ANLY-03 | Analytics instrumentation is non-blocking (fire-and-forget, never delays processing pipeline) | Verified: PITFALL-6 documents the 14-min timeout risk; pattern is `void supabase.from(...).insert(...)` — no `await`; existing `jobProcessorService.ts` processes in ~10 minutes, so blocking even 200ms per checkpoint is risky | +| INFR-03 | 30-day rolling data retention cleanup runs on schedule | Verified: `HealthCheckModel.deleteOlderThan(30)` and `AlertEventModel.deleteOlderThan(30)` exist from Phase 1; a third call for `document_processing_events` needs to be added; must be a separate named Cloud Function export (PITFALL-7: separate from `processDocumentJobs`) | + + + +--- + +## Summary + +Phase 2 is a service-implementation phase. All database infrastructure (tables, models) was built in Phase 1. This phase builds six service classes and two new Firebase Cloud Function exports. The work falls into four groups: + +**Group 1 — Health Probes** (`healthProbeService.ts`): Four probers (Document AI, Anthropic/OpenAI LLM, Supabase, Firebase Auth) each making a real authenticated API call using the already-configured credentials. Results are written to Supabase via `HealthCheckModel.create()`. PITFALL-5 is the key risk: existing diagnostics only check initialization — new probes must make live API calls. + +**Group 2 — Alert Service** (`alertService.ts`): Reads health probe results, checks if an alert already exists within cooldown using `AlertEventModel.findRecentByService()`, creates an `alert_events` row if not, and sends email via `nodemailer` (SMTP credentials already defined as Firebase defineString/defineSecret). Alert recipient read from `process.env.EMAIL_WEEKLY_RECIPIENT` (or a new `ALERT_RECIPIENT` env var). + +**Group 3 — Analytics Collector** (`analyticsService.ts`): A `recordProcessingEvent()` function that writes to a new `document_processing_events` Supabase table using fire-and-forget (`void` not `await`). Requires migration 013. The `jobProcessorService.ts` already has the right instrumentation points (lines 329-390 track `processingTime` and `status`). + +**Group 4 — Schedulers** (new Cloud Function exports in `index.ts`): `runHealthProbes` (every 5 minutes, separate export) and `runRetentionCleanup` (weekly, separate export). Both must be completely decoupled from `processDocumentJobs`. + +**Primary recommendation:** Install `nodemailer` + `@types/nodemailer` first. Build services in dependency order: analytics migration → analyticsService → healthProbeService → alertService → schedulers. + +--- + +## Standard Stack + +### Core +| Library | Version | Purpose | Why Standard | +|---------|---------|---------|--------------| +| `nodemailer` | ^6.9.x | SMTP email sending | SMTP config already pre-wired in `index.ts` (`emailHost`, `emailUser`, `emailPass`, `emailPort`, `emailSecure` via defineString/defineSecret); no other email library installed; Nodemailer is the standard Node.js SMTP library | +| `@supabase/supabase-js` | Already installed (2.53.0) | Writing health checks and analytics to Supabase | Already the only DB client; `HealthCheckModel` and `AlertEventModel` from Phase 1 wrap all writes | +| `firebase-admin` | Already installed (13.4.0) | Firebase Auth probe (verify-token endpoint) + `onSchedule` function exports | Already initialized via `config/firebase.ts` | +| `firebase-functions` | Already installed (7.0.5) | `onSchedule` v2 for scheduled Cloud Functions | Existing `processDocumentJobs` uses exact same pattern | +| `@google-cloud/documentai` | Already installed (9.3.0) | Document AI health probe (list processors call) | Already initialized in `documentAiProcessor.ts` | +| `@anthropic-ai/sdk` | Already installed (0.57.0) | LLM health probe (minimal token message) | Already initialized in `llmService.ts` | +| `openai` | Already installed (5.10.2) | OpenAI health probe fallback | Available when `LLM_PROVIDER=openai` | +| `pg` | Already installed (8.11.3) | Supabase health probe (direct SELECT 1 query) | Direct pool already available via `getPostgresPool()` in `config/supabase.ts` | +| Winston logger | Already installed (3.11.0) | All service logging | Project-wide convention; NEVER `console.log` | + +### New Packages Required +| Library | Version | Purpose | Installation | +|---------|---------|---------|-------------| +| `nodemailer` | ^6.9.x | SMTP email transport | `npm install nodemailer` | +| `@types/nodemailer` | ^6.4.x | TypeScript types | `npm install --save-dev @types/nodemailer` | + +### Alternatives Considered +| Instead of | Could Use | Tradeoff | +|------------|-----------|----------| +| `nodemailer` (SMTP) | Resend SDK | Resend is the STACK.md recommendation for new setups, but Gmail SMTP credentials are already fully configured in `index.ts` (`EMAIL_HOST`, `EMAIL_USER`, `EMAIL_PASS`, etc.) — switching to Resend requires new DNS records and a new API key; Nodemailer avoids all of that | +| Separate `onSchedule` export | `node-cron` inside existing function | PITFALL-2: probe scheduling inside `processDocumentJobs` creates availability coupling; Firebase Cloud Scheduler + separate export is the correct architecture | +| `getPostgresPool()` for Supabase health probe | Supabase PostgREST client | Direct PostgreSQL `SELECT 1` is a better health signal than PostgREST (tests TCP+auth rather than REST layer); `getPostgresPool()` already exists for this purpose | + +**Installation:** +```bash +cd backend +npm install nodemailer +npm install --save-dev @types/nodemailer +``` + +--- + +## Architecture Patterns + +### Recommended Project Structure + +New files slot into existing service layer: + +``` +backend/src/ +├── models/ +│ └── migrations/ +│ └── 013_create_processing_events_table.sql # NEW — analytics events table +├── services/ +│ ├── healthProbeService.ts # NEW — probe orchestrator + individual probers +│ ├── alertService.ts # NEW — deduplication + email + alert_events writer +│ └── analyticsService.ts # NEW — fire-and-forget event writer +├── index.ts # UPDATE — add runHealthProbes + runRetentionCleanup exports +└── __tests__/ + └── models/ # (Phase 1 tests already here) + └── unit/ + ├── healthProbeService.test.ts # NEW + ├── alertService.test.ts # NEW + └── analyticsService.test.ts # NEW +``` + +### Pattern 1: Real Health Probe (HLTH-02) + +**What:** Each probe makes a real authenticated API call. Returns a structured `ProbeResult` with `status`, `latency_ms`, and `error_message`. Probe then calls `HealthCheckModel.create()` to persist. + +**Key insight:** The probe itself has no alert logic — that lives in `alertService.ts`. The probe only measures and records. + +```typescript +// Source: derived from existing documentAiProcessor.ts and llmService.ts patterns +interface ProbeResult { + service_name: string; + status: 'healthy' | 'degraded' | 'down'; + latency_ms: number; + error_message?: string; + probe_details?: Record; +} + +// Document AI probe — list processors is a cheap read that tests auth + API availability +async function probeDocumentAI(): Promise { + const start = Date.now(); + try { + const client = new DocumentProcessorServiceClient(); + await client.listProcessors({ parent: `projects/${projectId}/locations/us` }); + const latency_ms = Date.now() - start; + return { service_name: 'document_ai', status: latency_ms > 2000 ? 'degraded' : 'healthy', latency_ms }; + } catch (err) { + return { + service_name: 'document_ai', + status: 'down', + latency_ms: Date.now() - start, + error_message: err instanceof Error ? err.message : String(err), + }; + } +} + +// Supabase probe — direct PostgreSQL SELECT 1 via existing pg pool +async function probeSupabase(): Promise { + const start = Date.now(); + try { + const pool = getPostgresPool(); + await pool.query('SELECT 1'); + const latency_ms = Date.now() - start; + return { service_name: 'supabase', status: latency_ms > 2000 ? 'degraded' : 'healthy', latency_ms }; + } catch (err) { + return { + service_name: 'supabase', + status: 'down', + latency_ms: Date.now() - start, + error_message: err instanceof Error ? err.message : String(err), + }; + } +} + +// LLM probe — minimal API call (1-word message) to verify key validity +async function probeLLM(): Promise { + const start = Date.now(); + try { + // Use whichever provider is configured + const client = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY }); + await client.messages.create({ + model: 'claude-haiku-4-5', + max_tokens: 5, + messages: [{ role: 'user', content: 'Hi' }], + }); + const latency_ms = Date.now() - start; + return { service_name: 'llm_api', status: latency_ms > 5000 ? 'degraded' : 'healthy', latency_ms }; + } catch (err) { + // 429 = degraded (rate limit), not 'down' + const is429 = err instanceof Error && err.message.includes('429'); + return { + service_name: 'llm_api', + status: is429 ? 'degraded' : 'down', + latency_ms: Date.now() - start, + error_message: err instanceof Error ? err.message : String(err), + }; + } +} + +// Firebase Auth probe — verify a known-invalid token; expect auth/argument-error, not network error +async function probeFirebaseAuth(): Promise { + const start = Date.now(); + try { + await admin.auth().verifyIdToken('invalid-token-probe-check'); + // Should never reach here — always throws + return { service_name: 'firebase_auth', status: 'healthy', latency_ms: Date.now() - start }; + } catch (err) { + const latency_ms = Date.now() - start; + const errMsg = err instanceof Error ? err.message : String(err); + // 'argument-error' or 'auth/argument-error' = SDK is alive and Auth is reachable + const isExpectedError = errMsg.includes('argument') || errMsg.includes('INVALID'); + return { + service_name: 'firebase_auth', + status: isExpectedError ? 'healthy' : 'down', + latency_ms, + error_message: isExpectedError ? undefined : errMsg, + }; + } +} +``` + +### Pattern 2: Alert Deduplication (ALRT-02) + +**What:** Before sending an email, check `AlertEventModel.findRecentByService()` for a matching alert within the cooldown window. If found, suppress. Uses `alert_events` table (already exists from Phase 1). + +**Important:** Deduplication check must happen before BOTH the `alert_events` row creation AND the email send — otherwise a suppressed email still creates a duplicate row. + +```typescript +// Source: AlertEventModel.findRecentByService() from Phase 1 (verified) +// Cooldown: 60 minutes (configurable via env var ALERT_COOLDOWN_MINUTES) +const ALERT_COOLDOWN_MINUTES = parseInt(process.env.ALERT_COOLDOWN_MINUTES ?? '60', 10); + +async function maybeSendAlert( + serviceName: string, + alertType: 'service_down' | 'service_degraded', + message: string +): Promise { + // 1. Check deduplication window + const existing = await AlertEventModel.findRecentByService( + serviceName, + alertType, + ALERT_COOLDOWN_MINUTES + ); + + if (existing) { + logger.info('alertService: suppressing duplicate alert within cooldown', { + serviceName, alertType, existingAlertId: existing.id, cooldownMinutes: ALERT_COOLDOWN_MINUTES, + }); + return; // suppress: both row creation AND email + } + + // 2. Create alert_events row + await AlertEventModel.create({ service_name: serviceName, alert_type: alertType, message }); + + // 3. Send email + await sendAlertEmail(serviceName, alertType, message); +} +``` + +### Pattern 3: Email via SMTP (Nodemailer + existing Firebase config) (ALRT-01, ALRT-04) + +**What:** Nodemailer transporter created using Firebase `defineString`/`defineSecret` values already in `index.ts`. Alert recipient from `process.env.EMAIL_WEEKLY_RECIPIENT` (non-hardcoded, satisfies ALRT-04). + +**Key insight:** The SMTP credentials (`EMAIL_HOST`, `EMAIL_USER`, `EMAIL_PASS`, `EMAIL_PORT`, `EMAIL_SECURE`) are already defined as Firebase params in `index.ts`. The service reads them from `process.env` — Firebase makes `defineString` values available there automatically. + +```typescript +// Source: Firebase Functions v2 defineString/defineSecret pattern — verified in index.ts +import nodemailer from 'nodemailer'; +import { logger } from '../utils/logger'; + +function createTransporter() { + return nodemailer.createTransport({ + host: process.env.EMAIL_HOST ?? 'smtp.gmail.com', + port: parseInt(process.env.EMAIL_PORT ?? '587', 10), + secure: process.env.EMAIL_SECURE === 'true', + auth: { + user: process.env.EMAIL_USER, + pass: process.env.EMAIL_PASS, // Firebase Secret — available in process.env + }, + }); +} + +async function sendAlertEmail(serviceName: string, alertType: string, message: string): Promise { + const recipient = process.env.EMAIL_WEEKLY_RECIPIENT; // ALRT-04: read from config, not hardcoded + if (!recipient) { + logger.warn('alertService.sendAlertEmail: no recipient configured, skipping email', { serviceName }); + return; + } + + const transporter = createTransporter(); + + try { + await transporter.sendMail({ + from: process.env.EMAIL_FROM ?? process.env.EMAIL_USER, + to: recipient, + subject: `[CIM Summary] Alert: ${serviceName} — ${alertType}`, + text: message, + html: `

${serviceName}: ${message}

`, + }); + logger.info('alertService.sendAlertEmail: sent', { serviceName, alertType, recipient }); + } catch (err) { + logger.error('alertService.sendAlertEmail: failed', { + error: err instanceof Error ? err.message : String(err), + serviceName, alertType, + }); + // Do NOT re-throw — email failure should not break the probe run + } +} +``` + +### Pattern 4: Fire-and-Forget Analytics (ANLY-03) + +**What:** `analyticsService.recordProcessingEvent()` uses `void` (no `await`) so the Supabase write is completely detached from the processing pipeline. The function signature returns `void` to make it impossible to accidentally `await` it. + +**Critical rule:** The function MUST be called with `void` or not awaited anywhere it's used. TypeScript enforcing `void` return type ensures this. + +```typescript +// Source: PITFALL-6 pattern — fire-and-forget is mandatory +export interface ProcessingEventData { + document_id: string; + user_id: string; + event_type: 'upload_started' | 'processing_started' | 'completed' | 'failed'; + duration_ms?: number; + error_message?: string; + stage?: string; +} + +// Return type is void (not Promise) — cannot be awaited +export function recordProcessingEvent(data: ProcessingEventData): void { + const supabase = getSupabaseServiceClient(); + void supabase + .from('document_processing_events') + .insert({ + document_id: data.document_id, + user_id: data.user_id, + event_type: data.event_type, + duration_ms: data.duration_ms ?? null, + error_message: data.error_message ?? null, + stage: data.stage ?? null, + created_at: new Date().toISOString(), + }) + .then(({ error }) => { + if (error) { + // Never throw — log only (analytics failure must not affect processing) + logger.error('analyticsService.recordProcessingEvent: write failed', { + error: error.message, data, + }); + } + }); +} +``` + +### Pattern 5: Scheduled Cloud Function Export (HLTH-03, INFR-03) + +**What:** Two new `onSchedule` exports added to `index.ts`. Each is a separate named export, completely decoupled from `processDocumentJobs`. + +**Important:** New exports must include the same `secrets` array as `processDocumentJobs` (all needed Firebase Secrets must be explicitly listed). `defineString` values are auto-available but `defineSecret` values require explicit listing. + +```typescript +// Source: Existing processDocumentJobs pattern in index.ts (verified) +// Add AFTER processDocumentJobs export + +// Health probe scheduler — separate from document processing (PITFALL-2) +export const runHealthProbes = onSchedule({ + schedule: 'every 5 minutes', + timeoutSeconds: 60, + memory: '256MiB', + secrets: [ + anthropicApiKey, // for LLM probe + openaiApiKey, // for OpenAI probe fallback + databaseUrl, // for Supabase probe + supabaseServiceKey, + supabaseAnonKey, + ], +}, async (_event) => { + const { healthProbeService } = await import('./services/healthProbeService'); + await healthProbeService.runAllProbes(); +}); + +// Retention cleanup — weekly (PITFALL-7: separate from document processing scheduler) +export const runRetentionCleanup = onSchedule({ + schedule: 'every monday 02:00', + timeoutSeconds: 120, + memory: '256MiB', + secrets: [databaseUrl, supabaseServiceKey, supabaseAnonKey], +}, async (_event) => { + const { HealthCheckModel } = await import('./models/HealthCheckModel'); + const { AlertEventModel } = await import('./models/AlertEventModel'); + const { analyticsService } = await import('./services/analyticsService'); + + const [hcCount, alertCount, eventCount] = await Promise.all([ + HealthCheckModel.deleteOlderThan(30), + AlertEventModel.deleteOlderThan(30), + analyticsService.deleteProcessingEventsOlderThan(30), + ]); + + logger.info('runRetentionCleanup: complete', { hcCount, alertCount, eventCount }); +}); +``` + +### Pattern 6: Analytics Migration (ANLY-01) + +**What:** Migration `013_create_processing_events_table.sql` adds the `document_processing_events` table. Follows the migration 012 pattern exactly. + +```sql +-- Source: backend/src/models/migrations/012_create_monitoring_tables.sql (verified pattern) +CREATE TABLE IF NOT EXISTS document_processing_events ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + document_id UUID NOT NULL, + user_id UUID NOT NULL, + event_type TEXT NOT NULL CHECK (event_type IN ('upload_started', 'processing_started', 'completed', 'failed')), + duration_ms INTEGER, + error_message TEXT, + stage TEXT, + created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP +); + +CREATE INDEX IF NOT EXISTS idx_document_processing_events_created_at + ON document_processing_events(created_at); +CREATE INDEX IF NOT EXISTS idx_document_processing_events_document_id + ON document_processing_events(document_id); + +ALTER TABLE document_processing_events ENABLE ROW LEVEL SECURITY; +``` + +### Anti-Patterns to Avoid + +- **Probing config existence instead of live connectivity** (PITFALL-5): Any check of `if (process.env.ANTHROPIC_API_KEY)` is not a health probe. Must make a real API call. +- **Awaiting analytics writes** (PITFALL-6): `await analyticsService.recordProcessingEvent(...)` will block the processing pipeline. Must use `void analyticsService.recordProcessingEvent(...)` or the function must not return a Promise. +- **Piggybacking health probes on `processDocumentJobs`** (PITFALL-2): Health probes mixed into the document processing function create availability coupling. Must be a separate `onSchedule` export. +- **Hardcoding alert recipient** (PITFALL-8): Never write `to: 'jpressnell@bluepointcapital.com'` in source. Always `process.env.EMAIL_WEEKLY_RECIPIENT`. +- **Alert storms** (PITFALL-3): Sending email on every failed probe run is a mistake. Must check `AlertEventModel.findRecentByService()` with cooldown window before every send. +- **Creating the nodemailer transporter at module level**: The Firebase Secret `EMAIL_PASS` is only available inside a Cloud Function invocation (it's injected at runtime). Create the transporter inside each email call or on first use inside a function execution — not at module initialization time. + +--- + +## Don't Hand-Roll + +| Problem | Don't Build | Use Instead | Why | +|---------|-------------|-------------|-----| +| Alert deduplication state | Custom in-memory `Map` | `AlertEventModel.findRecentByService()` (already exists, Phase 1) | In-memory state resets on cold start (PITFALL-1); DB-backed deduplication survives restarts | +| SMTP transport | Custom HTTP calls to Gmail API | `nodemailer` with existing SMTP config | Gmail API requires OAuth flow; SMTP App Password already configured and working | +| Health check result storage | Custom logging or in-memory | `HealthCheckModel.create()` (already exists, Phase 1) | Already written, tested, and connected to the right table | +| Cron scheduling | `setInterval` inside function body | `onSchedule` Firebase Cloud Scheduler | `setInterval` does not work in serverless (instances spin up/down); Cloud Scheduler is the correct mechanism | +| Alert creation | Direct Supabase insert | `AlertEventModel.create()` (already exists, Phase 1) | Already written with input validation and error handling | + +**Key insight:** Phase 1 built the entire model layer specifically so Phase 2 only has to write service logic. Use every model method; don't bypass them. + +--- + +## Common Pitfalls + +### Pitfall A: Firebase Secret Unavailable at Module Load Time +**What goes wrong:** Nodemailer transporter created at module top level with `process.env.EMAIL_PASS` — at module load time (cold start initialization), the Firebase Secret hasn't been injected yet. `EMAIL_PASS` is `undefined`. All email attempts fail. +**Why it happens:** Firebase Functions v2 `defineSecret()` values are injected into `process.env` when the function invocation starts, not when the module is first imported. +**How to avoid:** Create the nodemailer transporter lazily — inside the function that sends email, not at module level. Alternatively, use a factory function called at send time. +**Warning signs:** `nodemailer` error "authentication failed" or "invalid credentials" on first cold start; works on warm invocations. + +### Pitfall B: LLM Probe Cost +**What goes wrong:** LLM health probe uses the same model as document processing (e.g., `claude-opus-4-1`). Running every 5 minutes costs ~$0.01 × 288 calls/day = ~$2.88/day just for probing. +**Why it happens:** Copy-pasting the model name from `llmService.ts`. +**How to avoid:** Use the cheapest available model for probes: `claude-haiku-4-5` (Anthropic) or `gpt-3.5-turbo` (OpenAI). The probe only needs to verify API key validity and reachability — response quality doesn't matter. Set `max_tokens: 5`. +**Warning signs:** Anthropic API bill spikes after deploying `runHealthProbes`. + +### Pitfall C: Supabase PostgREST vs Direct Postgres for Health Probe +**What goes wrong:** Using `getSupabaseServiceClient()` (PostgREST) for the Supabase health probe instead of `getPostgresPool()`. PostgREST adds an HTTP layer — if the Supabase API is overloaded but the DB is healthy, the probe returns "down" incorrectly. +**Why it happens:** PostgREST client is the default Supabase client used everywhere else. +**How to avoid:** Use `getPostgresPool().query('SELECT 1')` — this tests TCP connectivity to the database directly, which is the true health signal for data persistence operations. +**Warning signs:** Supabase probe reports "down" while the DB is healthy; health check latency fluctuates widely. + +### Pitfall D: Analytics Migration Naming Conflict +**What goes wrong:** Phase 2 creates `013_create_processing_events_table.sql` but another developer or future migration already used `013`. The migrator runs both or skips one. +**Why it happens:** Not verifying the highest current migration number. +**How to avoid:** Current highest is `012_create_monitoring_tables.sql` (created in Phase 1). Next migration MUST be `013_`. Confirmed safe. +**Warning signs:** Migration run shows "already applied" for `013_` without the table existing. + +### Pitfall E: Probe Errors Swallowed Silently +**What goes wrong:** A probe throws an uncaught exception. The `runHealthProbes` Cloud Function catches it at the top level and does nothing. No health check record is written. The admin dashboard shows no data. +**Why it happens:** Each individual probe can fail independently — if one throws, the others should still run. +**How to avoid:** Wrap each probe call in `try/catch` inside `healthProbeService.runAllProbes()`. A probe error should create a `status: 'down'` result with the error in `error_message`, then persist that to Supabase. The probe orchestrator must never throw; it must always complete all probes. +**Warning signs:** One service's health checks stop appearing in Supabase while others continue. + +### Pitfall F: `deleteOlderThan` Without Batching on Large Tables +**What goes wrong:** After 30 days of operation with health probes running every 5 minutes, `service_health_checks` could have ~8,640 rows (288/day × 30). A single `DELETE WHERE created_at < cutoff` is fine at this scale. At 6 months, however, it could be 50k+ rows — still manageable with the `created_at` index. No batching needed at Phase 2 scale. +**Why it happens:** Concern about DB timeout on large deletes. +**How to avoid:** Index on `created_at` (exists from Phase 1 migration 012) makes the DELETE efficient. For Phase 2 scale, a single `DELETE` is correct. Only consider batching if the table grows to millions of rows. +**Warning signs:** N/A at Phase 2 scale. Log `deletedCount` for visibility. + +--- + +## Code Examples + +Verified patterns from codebase and official Node.js/Firebase docs: + +### Adding a Second Cloud Function Export (index.ts) +```typescript +// Source: Existing processDocumentJobs pattern in backend/src/index.ts (verified) +// New export follows the same onSchedule structure: +import { onSchedule } from 'firebase-functions/v2/scheduler'; + +export const runHealthProbes = onSchedule({ + schedule: 'every 5 minutes', + timeoutSeconds: 60, + memory: '256MiB', + retryCount: 0, // Probes should not retry — they run again in 5 minutes anyway + secrets: [anthropicApiKey, openaiApiKey, databaseUrl, supabaseServiceKey, supabaseAnonKey], +}, async (_event) => { + // Dynamic import (same pattern as processDocumentJobs) + const { healthProbeService } = await import('./services/healthProbeService'); + await healthProbeService.runAllProbes(); +}); +``` + +### HealthCheckModel.create() — Already Available (Phase 1) +```typescript +// Source: backend/src/models/HealthCheckModel.ts (verified, Phase 1) +await HealthCheckModel.create({ + service_name: 'document_ai', + status: 'healthy', + latency_ms: 234, + probe_details: { processor_count: 1 }, +}); +``` + +### AlertEventModel.findRecentByService() — Already Available (Phase 1) +```typescript +// Source: backend/src/models/AlertEventModel.ts (verified, Phase 1) +const recent = await AlertEventModel.findRecentByService( + 'document_ai', // service name + 'service_down', // alert type + 60 // within last 60 minutes +); +if (recent) { + // suppress — cooldown active +} +``` + +### Nodemailer SMTP — Using Existing Firebase Config +```typescript +// Source: Firebase defineString/defineSecret pattern verified in index.ts lines 220-225 +// process.env.EMAIL_HOST, EMAIL_USER, EMAIL_PASS, EMAIL_PORT, EMAIL_SECURE all available +import nodemailer from 'nodemailer'; + +async function sendEmail(to: string, subject: string, html: string): Promise { + // Transporter created INSIDE function call (not at module level) — Firebase Secret timing + const transporter = nodemailer.createTransport({ + host: process.env.EMAIL_HOST ?? 'smtp.gmail.com', + port: parseInt(process.env.EMAIL_PORT ?? '587', 10), + secure: process.env.EMAIL_SECURE === 'true', + auth: { + user: process.env.EMAIL_USER, + pass: process.env.EMAIL_PASS, + }, + }); + await transporter.sendMail({ from: process.env.EMAIL_FROM, to, subject, html }); +} +``` + +### Fire-and-Forget Write Pattern +```typescript +// Source: PITFALL-6 prevention — void prevents awaiting +// This is the ONLY correct way to write fire-and-forget to Supabase + +// CORRECT — non-blocking: +void analyticsService.recordProcessingEvent({ document_id, user_id, event_type: 'completed', duration_ms }); + +// WRONG — blocks processing pipeline: +await analyticsService.recordProcessingEvent(...); // DO NOT DO THIS + +// ALSO WRONG — return type must be void, not Promise: +async function recordProcessingEvent(...): Promise { ... } // enables accidental await +``` + +--- + +## State of the Art + +| Old Approach | Current Approach | When Changed | Impact | +|--------------|------------------|--------------|--------| +| `uploadMonitoringService.ts` in-memory event store | Persistent `document_processing_events` Supabase table | Phase 2 introduces this | Analytics survives cold starts; 30-day history available | +| Configuration-only health check (`/monitoring/diagnostics`) | Live API call probers (`healthProbeService.ts`) | Phase 2 introduces this | Actually detects downed/revoked credentials | +| No email alerting | SMTP email via `nodemailer` + Firebase SMTP config | Phase 2 introduces this | Admin notified of outages | +| No scheduled probe function | `runHealthProbes` Cloud Function export | Phase 2 introduces this | Probes run independently of document processing | + +**Existing but unused:** The `performance_metrics` table (migration 010) is scoped to agentic RAG sessions (has a FK to `agentic_rag_sessions`). It is NOT suitable for general document processing analytics — use the new `document_processing_events` table instead. + +--- + +## Open Questions + +1. **Probe frequency for LLM (HLTH-03)** + - What we know: 5-minute probe interval is specified for `runHealthProbes`. An Anthropic probe every 5 minutes at min tokens costs ~$0.001/call × 288 = $0.29/day. Acceptable. + - What's unclear: Whether to probe BOTH Anthropic and OpenAI each run (depends on active provider) or always probe both. + - Recommendation: Probe the active LLM provider (from `process.env.LLM_PROVIDER`) plus always probe Supabase and Document AI. Probing inactive providers is useful for failover readiness but not required by HLTH-02. + +2. **Alert recipient variable name: `EMAIL_WEEKLY_RECIPIENT` vs `ALERT_RECIPIENT`** + - What we know: `EMAIL_WEEKLY_RECIPIENT` is already defined as a Firebase `defineString` in `index.ts`. It has the correct default value. + - What's unclear: The name implies "weekly" which is misleading for health alerts. Should this be a separate `ALERT_RECIPIENT` env var? + - Recommendation: Reuse `EMAIL_WEEKLY_RECIPIENT` for alert recipient to avoid adding another Firebase param. Document that it's dual-purpose. If a separate `ALERT_RECIPIENT` is desired, add it as a new `defineString` in `index.ts` alongside the existing one. + +3. **`runHealthProbes` secrets list** + - What we know: `defineSecret()` values must be listed in each function's `secrets:` array to be available in `process.env` during that function's execution. + - What's unclear: The LLM probe needs `ANTHROPIC_API_KEY` or `OPENAI_API_KEY` depending on config. The Supabase probe needs `DATABASE_URL`, `SUPABASE_SERVICE_KEY`, `SUPABASE_ANON_KEY`. + - Recommendation: Include all potentially-needed secrets: `anthropicApiKey`, `openaiApiKey`, `databaseUrl`, `supabaseServiceKey`, `supabaseAnonKey`. Unused secrets don't cause issues; missing ones cause failures. + +4. **Should `runRetentionCleanup` also delete from `performance_metrics` / `session_events`?** + - What we know: `performance_metrics` (migration 010) tracks agentic RAG sessions. It has no 30-day retention requirement specified. + - What's unclear: INFR-03 says "30-day rolling data retention cleanup" — does this apply only to monitoring tables or all analytics tables? + - Recommendation: Phase 2 only manages tables introduced in the monitoring feature: `service_health_checks`, `alert_events`, `document_processing_events`. Leave `performance_metrics`, `session_events`, `execution_events` out of scope — INFR-03 is monitoring-specific. + +--- + +## Validation Architecture + +*(Nyquist validation not configured — `workflow.nyquist_validation` not present in `.planning/config.json`. This section is omitted.)* + +--- + +## Sources + +### Primary (HIGH confidence) +- `backend/src/models/HealthCheckModel.ts` — Verified: `create()`, `findLatestByService()`, `findAll()`, `deleteOlderThan()` signatures and behavior +- `backend/src/models/AlertEventModel.ts` — Verified: `create()`, `findActive()`, `findRecentByService()`, `deleteOlderThan()` signatures; deduplication method ready for Phase 2 +- `backend/src/index.ts` lines 208-265 — Verified: `defineSecret('EMAIL_PASS')`, `defineString('EMAIL_HOST')`, `defineString('EMAIL_USER')`, `defineString('EMAIL_PORT')`, `defineString('EMAIL_SECURE')`, `defineString('EMAIL_WEEKLY_RECIPIENT')` all already defined; `onSchedule` export pattern confirmed from `processDocumentJobs` +- `backend/src/models/migrations/012_create_monitoring_tables.sql` — Verified: migration 012 exists and is the current highest; next migration is 013 +- `backend/src/services/jobProcessorService.ts` lines 329-390 — Verified: `processingTime` and `status` tracked at end of each job; correct hook points for analytics instrumentation +- `backend/src/services/uploadMonitoringService.ts` — Verified: in-memory only, loses data on cold start (PITFALL-1 confirmed) +- `backend/package.json` — Verified: `nodemailer` is NOT installed; must be added +- `backend/vitest.config.ts` — Verified: test glob includes `src/__tests__/**/*.{test,spec}.{ts,js}`; timeout 30s +- `.planning/research/PITFALLS.md` — Verified: PITFALL-1 through PITFALL-10 all considered in this research +- `.planning/research/STACK.md` — Verified: Email decision (Nodemailer fallback), node-cron vs Firebase Cloud Scheduler + +### Secondary (MEDIUM confidence) +- `nodemailer` SMTP pattern: Standard Node.js email library; `createTransport` + `sendMail` API is stable and well-documented. Confidence HIGH from training data; verified against package docs as of August 2025. +- Firebase `defineSecret()` runtime injection timing: Firebase Secrets are injected at function invocation time, not module load time — confirmed behavior from Firebase Functions v2 documentation patterns. Verified via the `secrets:` array requirement in `onSchedule` config. + +### Tertiary (LOW confidence) +- Specific LLM probe cost calculation: Estimated from Anthropic public pricing as of training data. Actual cost may vary — verify with Anthropic API pricing page before deploying. + +--- + +## Metadata + +**Confidence breakdown:** +- Standard stack: HIGH — all libraries verified in `package.json`; only `nodemailer` is new +- Architecture: HIGH — Cloud Function export pattern verified from existing `processDocumentJobs`; model methods verified from Phase 1 source +- Pitfalls: HIGH — PITFALL-1 through PITFALL-10 verified against codebase; Firebase Secret timing is documented Firebase behavior + +**Research date:** 2026-02-24 +**Valid until:** 2026-03-25 (30 days — Firebase Functions v2 and Supabase patterns are stable)