Files

admin 38a0f0619d chore: complete v1.0 Analytics & Monitoring milestone

Archive milestone artifacts (roadmap, requirements, audit, phase directories)
to .planning/milestones/. Evolve PROJECT.md with validated requirements and
decision outcomes. Create MILESTONES.md and RETROSPECTIVE.md.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-25 10:34:18 -05:00

37 KiB

Raw Blame History

Phase 2: Backend Services - Research

Researched: 2026-02-24 Domain: Firebase Cloud Functions scheduling, health probes, email alerting (Nodemailer/SMTP), fire-and-forget analytics, alert deduplication, 30-day data retention Confidence: HIGH

<phase_requirements>

Phase Requirements

ID	Description	Research Support
HLTH-02	Each health probe makes a real authenticated API call, not just config checks	Verified: existing `/monitoring/diagnostics` only checks initialization, not live connectivity; each probe must make a real call (Document AI list processors, Anthropic minimal message, Supabase SELECT 1, Firebase Auth verify-token attempt)
HLTH-03	Health probes run on a scheduled interval, separate from document processing	Verified: `processDocumentJobs` export pattern in `index.ts` shows how to add a second named Cloud Function export; `onSchedule` from `firebase-functions/v2/scheduler` is the correct mechanism; PITFALL-2 mandates decoupling
HLTH-04	Health probe results persist to Supabase and survive cold starts	Verified: `HealthCheckModel.create()` exists from Phase 1 with correct insert signature; `service_health_checks` table exists via migration 012; cold-start survival is automatic once persisted
ALRT-01	Admin receives email alert when a service goes down or degrades	Verified: SMTP config already defined in `index.ts` (`emailHost`, `emailUser`, `emailPass`, `emailPort`, `emailSecure`); `nodemailer` is the correct library (no other email SDK installed; SMTP credentials are pre-configured); `nodemailer` is NOT yet in package.json — must be installed
ALRT-02	Alert deduplication prevents repeat emails for the same ongoing issue (cooldown period)	Verified: `AlertEventModel.findRecentByService()` from Phase 1 exists and accepts `withinMinutes` — built exactly for this use case; check it before firing email and before creating new `alert_events` row
ALRT-04	Alert recipient stored as configuration, not hardcoded	Verified: `EMAIL_WEEKLY_RECIPIENT` defineString already exists in `index.ts` with default `jpressnell@bluepointcapital.com`; alert service must read `process.env.EMAIL_WEEKLY_RECIPIENT` (or `process.env.ALERT_RECIPIENT`) — do NOT hardcode the string in service source
ANLY-01	Document processing events persist to Supabase at write time (not in-memory only)	Verified: `uploadMonitoringService.ts` is in-memory only (confirmed PITFALL-1); a `document_processing_events` table is NOT yet in any migration — Phase 2 must add migration 013 for it; `jobProcessorService.ts` has instrumentation hooks (lines 329-390) to attach fire-and-forget writes
ANLY-03	Analytics instrumentation is non-blocking (fire-and-forget, never delays processing pipeline)	Verified: PITFALL-6 documents the 14-min timeout risk; pattern is `void supabase.from(...).insert(...)` — no `await`; existing `jobProcessorService.ts` processes in ~10 minutes, so blocking even 200ms per checkpoint is risky
INFR-03	30-day rolling data retention cleanup runs on schedule	Verified: `HealthCheckModel.deleteOlderThan(30)` and `AlertEventModel.deleteOlderThan(30)` exist from Phase 1; a third call for `document_processing_events` needs to be added; must be a separate named Cloud Function export (PITFALL-7: separate from `processDocumentJobs`)

</phase_requirements>

Summary

Phase 2 is a service-implementation phase. All database infrastructure (tables, models) was built in Phase 1. This phase builds six service classes and two new Firebase Cloud Function exports. The work falls into four groups:

Group 1 — Health Probes (healthProbeService.ts): Four probers (Document AI, Anthropic/OpenAI LLM, Supabase, Firebase Auth) each making a real authenticated API call using the already-configured credentials. Results are written to Supabase via HealthCheckModel.create(). PITFALL-5 is the key risk: existing diagnostics only check initialization — new probes must make live API calls.

Group 2 — Alert Service (alertService.ts): Reads health probe results, checks if an alert already exists within cooldown using AlertEventModel.findRecentByService(), creates an alert_events row if not, and sends email via nodemailer (SMTP credentials already defined as Firebase defineString/defineSecret). Alert recipient read from process.env.EMAIL_WEEKLY_RECIPIENT (or a new ALERT_RECIPIENT env var).

Group 3 — Analytics Collector (analyticsService.ts): A recordProcessingEvent() function that writes to a new document_processing_events Supabase table using fire-and-forget (void not await). Requires migration 013. The jobProcessorService.ts already has the right instrumentation points (lines 329-390 track processingTime and status).

Group 4 — Schedulers (new Cloud Function exports in index.ts): runHealthProbes (every 5 minutes, separate export) and runRetentionCleanup (weekly, separate export). Both must be completely decoupled from processDocumentJobs.

Primary recommendation: Install nodemailer + @types/nodemailer first. Build services in dependency order: analytics migration → analyticsService → healthProbeService → alertService → schedulers.

Standard Stack

Core

Library	Version	Purpose	Why Standard
`nodemailer`	^6.9.x	SMTP email sending	SMTP config already pre-wired in `index.ts` (`emailHost`, `emailUser`, `emailPass`, `emailPort`, `emailSecure` via defineString/defineSecret); no other email library installed; Nodemailer is the standard Node.js SMTP library
`@supabase/supabase-js`	Already installed (2.53.0)	Writing health checks and analytics to Supabase	Already the only DB client; `HealthCheckModel` and `AlertEventModel` from Phase 1 wrap all writes
`firebase-admin`	Already installed (13.4.0)	Firebase Auth probe (verify-token endpoint) + `onSchedule` function exports	Already initialized via `config/firebase.ts`
`firebase-functions`	Already installed (7.0.5)	`onSchedule` v2 for scheduled Cloud Functions	Existing `processDocumentJobs` uses exact same pattern
`@google-cloud/documentai`	Already installed (9.3.0)	Document AI health probe (list processors call)	Already initialized in `documentAiProcessor.ts`
`@anthropic-ai/sdk`	Already installed (0.57.0)	LLM health probe (minimal token message)	Already initialized in `llmService.ts`
`openai`	Already installed (5.10.2)	OpenAI health probe fallback	Available when `LLM_PROVIDER=openai`
`pg`	Already installed (8.11.3)	Supabase health probe (direct SELECT 1 query)	Direct pool already available via `getPostgresPool()` in `config/supabase.ts`
Winston logger	Already installed (3.11.0)	All service logging	Project-wide convention; NEVER `console.log`

New Packages Required

Library	Version	Purpose	Installation
`nodemailer`	^6.9.x	SMTP email transport	`npm install nodemailer`
`@types/nodemailer`	^6.4.x	TypeScript types	`npm install --save-dev @types/nodemailer`

Alternatives Considered

Instead of	Could Use	Tradeoff
`nodemailer` (SMTP)	Resend SDK	Resend is the STACK.md recommendation for new setups, but Gmail SMTP credentials are already fully configured in `index.ts` (`EMAIL_HOST`, `EMAIL_USER`, `EMAIL_PASS`, etc.) — switching to Resend requires new DNS records and a new API key; Nodemailer avoids all of that
Separate `onSchedule` export	`node-cron` inside existing function	PITFALL-2: probe scheduling inside `processDocumentJobs` creates availability coupling; Firebase Cloud Scheduler + separate export is the correct architecture
`getPostgresPool()` for Supabase health probe	Supabase PostgREST client	Direct PostgreSQL `SELECT 1` is a better health signal than PostgREST (tests TCP+auth rather than REST layer); `getPostgresPool()` already exists for this purpose

Installation:

cd backend
npm install nodemailer
npm install --save-dev @types/nodemailer

Architecture Patterns

Recommended Project Structure

New files slot into existing service layer:

backend/src/
├── models/
│   └── migrations/
│       └── 013_create_processing_events_table.sql   # NEW — analytics events table
├── services/
│   ├── healthProbeService.ts                        # NEW — probe orchestrator + individual probers
│   ├── alertService.ts                              # NEW — deduplication + email + alert_events writer
│   └── analyticsService.ts                          # NEW — fire-and-forget event writer
├── index.ts                                         # UPDATE — add runHealthProbes + runRetentionCleanup exports
└── __tests__/
    └── models/                                      # (Phase 1 tests already here)
    └── unit/
        ├── healthProbeService.test.ts               # NEW
        ├── alertService.test.ts                     # NEW
        └── analyticsService.test.ts                 # NEW

Pattern 1: Real Health Probe (HLTH-02)

What: Each probe makes a real authenticated API call. Returns a structured ProbeResult with status, latency_ms, and error_message. Probe then calls HealthCheckModel.create() to persist.

Key insight: The probe itself has no alert logic — that lives in alertService.ts. The probe only measures and records.

// Source: derived from existing documentAiProcessor.ts and llmService.ts patterns
interface ProbeResult {
  service_name: string;
  status: 'healthy' | 'degraded' | 'down';
  latency_ms: number;
  error_message?: string;
  probe_details?: Record<string, unknown>;
}

// Document AI probe — list processors is a cheap read that tests auth + API availability
async function probeDocumentAI(): Promise<ProbeResult> {
  const start = Date.now();
  try {
    const client = new DocumentProcessorServiceClient();
    await client.listProcessors({ parent: `projects/${projectId}/locations/us` });
    const latency_ms = Date.now() - start;
    return { service_name: 'document_ai', status: latency_ms > 2000 ? 'degraded' : 'healthy', latency_ms };
  } catch (err) {
    return {
      service_name: 'document_ai',
      status: 'down',
      latency_ms: Date.now() - start,
      error_message: err instanceof Error ? err.message : String(err),
    };
  }
}

// Supabase probe — direct PostgreSQL SELECT 1 via existing pg pool
async function probeSupabase(): Promise<ProbeResult> {
  const start = Date.now();
  try {
    const pool = getPostgresPool();
    await pool.query('SELECT 1');
    const latency_ms = Date.now() - start;
    return { service_name: 'supabase', status: latency_ms > 2000 ? 'degraded' : 'healthy', latency_ms };
  } catch (err) {
    return {
      service_name: 'supabase',
      status: 'down',
      latency_ms: Date.now() - start,
      error_message: err instanceof Error ? err.message : String(err),
    };
  }
}

// LLM probe — minimal API call (1-word message) to verify key validity
async function probeLLM(): Promise<ProbeResult> {
  const start = Date.now();
  try {
    // Use whichever provider is configured
    const client = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });
    await client.messages.create({
      model: 'claude-haiku-4-5',
      max_tokens: 5,
      messages: [{ role: 'user', content: 'Hi' }],
    });
    const latency_ms = Date.now() - start;
    return { service_name: 'llm_api', status: latency_ms > 5000 ? 'degraded' : 'healthy', latency_ms };
  } catch (err) {
    // 429 = degraded (rate limit), not 'down'
    const is429 = err instanceof Error && err.message.includes('429');
    return {
      service_name: 'llm_api',
      status: is429 ? 'degraded' : 'down',
      latency_ms: Date.now() - start,
      error_message: err instanceof Error ? err.message : String(err),
    };
  }
}

// Firebase Auth probe — verify a known-invalid token; expect auth/argument-error, not network error
async function probeFirebaseAuth(): Promise<ProbeResult> {
  const start = Date.now();
  try {
    await admin.auth().verifyIdToken('invalid-token-probe-check');
    // Should never reach here — always throws
    return { service_name: 'firebase_auth', status: 'healthy', latency_ms: Date.now() - start };
  } catch (err) {
    const latency_ms = Date.now() - start;
    const errMsg = err instanceof Error ? err.message : String(err);
    // 'argument-error' or 'auth/argument-error' = SDK is alive and Auth is reachable
    const isExpectedError = errMsg.includes('argument') || errMsg.includes('INVALID');
    return {
      service_name: 'firebase_auth',
      status: isExpectedError ? 'healthy' : 'down',
      latency_ms,
      error_message: isExpectedError ? undefined : errMsg,
    };
  }
}

Pattern 2: Alert Deduplication (ALRT-02)

What: Before sending an email, check AlertEventModel.findRecentByService() for a matching alert within the cooldown window. If found, suppress. Uses alert_events table (already exists from Phase 1).

Important: Deduplication check must happen before BOTH the alert_events row creation AND the email send — otherwise a suppressed email still creates a duplicate row.

// Source: AlertEventModel.findRecentByService() from Phase 1 (verified)
// Cooldown: 60 minutes (configurable via env var ALERT_COOLDOWN_MINUTES)
const ALERT_COOLDOWN_MINUTES = parseInt(process.env.ALERT_COOLDOWN_MINUTES ?? '60', 10);

async function maybeSendAlert(
  serviceName: string,
  alertType: 'service_down' | 'service_degraded',
  message: string
): Promise<void> {
  // 1. Check deduplication window
  const existing = await AlertEventModel.findRecentByService(
    serviceName,
    alertType,
    ALERT_COOLDOWN_MINUTES
  );

  if (existing) {
    logger.info('alertService: suppressing duplicate alert within cooldown', {
      serviceName, alertType, existingAlertId: existing.id, cooldownMinutes: ALERT_COOLDOWN_MINUTES,
    });
    return; // suppress: both row creation AND email
  }

  // 2. Create alert_events row
  await AlertEventModel.create({ service_name: serviceName, alert_type: alertType, message });

  // 3. Send email
  await sendAlertEmail(serviceName, alertType, message);
}

Pattern 3: Email via SMTP (Nodemailer + existing Firebase config) (ALRT-01, ALRT-04)

What: Nodemailer transporter created using Firebase defineString/defineSecret values already in index.ts. Alert recipient from process.env.EMAIL_WEEKLY_RECIPIENT (non-hardcoded, satisfies ALRT-04).

Key insight: The SMTP credentials (EMAIL_HOST, EMAIL_USER, EMAIL_PASS, EMAIL_PORT, EMAIL_SECURE) are already defined as Firebase params in index.ts. The service reads them from process.env — Firebase makes defineString values available there automatically.

// Source: Firebase Functions v2 defineString/defineSecret pattern — verified in index.ts
import nodemailer from 'nodemailer';
import { logger } from '../utils/logger';

function createTransporter() {
  return nodemailer.createTransport({
    host: process.env.EMAIL_HOST ?? 'smtp.gmail.com',
    port: parseInt(process.env.EMAIL_PORT ?? '587', 10),
    secure: process.env.EMAIL_SECURE === 'true',
    auth: {
      user: process.env.EMAIL_USER,
      pass: process.env.EMAIL_PASS, // Firebase Secret — available in process.env
    },
  });
}

async function sendAlertEmail(serviceName: string, alertType: string, message: string): Promise<void> {
  const recipient = process.env.EMAIL_WEEKLY_RECIPIENT; // ALRT-04: read from config, not hardcoded
  if (!recipient) {
    logger.warn('alertService.sendAlertEmail: no recipient configured, skipping email', { serviceName });
    return;
  }

  const transporter = createTransporter();

  try {
    await transporter.sendMail({
      from: process.env.EMAIL_FROM ?? process.env.EMAIL_USER,
      to: recipient,
      subject: `[CIM Summary] Alert: ${serviceName} — ${alertType}`,
      text: message,
      html: `<p><strong>${serviceName}</strong>: ${message}</p>`,
    });
    logger.info('alertService.sendAlertEmail: sent', { serviceName, alertType, recipient });
  } catch (err) {
    logger.error('alertService.sendAlertEmail: failed', {
      error: err instanceof Error ? err.message : String(err),
      serviceName, alertType,
    });
    // Do NOT re-throw — email failure should not break the probe run
  }
}

Pattern 4: Fire-and-Forget Analytics (ANLY-03)

What: analyticsService.recordProcessingEvent() uses void (no await) so the Supabase write is completely detached from the processing pipeline. The function signature returns void to make it impossible to accidentally await it.

Critical rule: The function MUST be called with void or not awaited anywhere it's used. TypeScript enforcing void return type ensures this.

// Source: PITFALL-6 pattern — fire-and-forget is mandatory
export interface ProcessingEventData {
  document_id: string;
  user_id: string;
  event_type: 'upload_started' | 'processing_started' | 'completed' | 'failed';
  duration_ms?: number;
  error_message?: string;
  stage?: string;
}

// Return type is void (not Promise<void>) — cannot be awaited
export function recordProcessingEvent(data: ProcessingEventData): void {
  const supabase = getSupabaseServiceClient();
  void supabase
    .from('document_processing_events')
    .insert({
      document_id: data.document_id,
      user_id: data.user_id,
      event_type: data.event_type,
      duration_ms: data.duration_ms ?? null,
      error_message: data.error_message ?? null,
      stage: data.stage ?? null,
      created_at: new Date().toISOString(),
    })
    .then(({ error }) => {
      if (error) {
        // Never throw — log only (analytics failure must not affect processing)
        logger.error('analyticsService.recordProcessingEvent: write failed', {
          error: error.message, data,
        });
      }
    });
}

Pattern 5: Scheduled Cloud Function Export (HLTH-03, INFR-03)

What: Two new onSchedule exports added to index.ts. Each is a separate named export, completely decoupled from processDocumentJobs.

Important: New exports must include the same secrets array as processDocumentJobs (all needed Firebase Secrets must be explicitly listed). defineString values are auto-available but defineSecret values require explicit listing.

// Source: Existing processDocumentJobs pattern in index.ts (verified)
// Add AFTER processDocumentJobs export

// Health probe scheduler — separate from document processing (PITFALL-2)
export const runHealthProbes = onSchedule({
  schedule: 'every 5 minutes',
  timeoutSeconds: 60,
  memory: '256MiB',
  secrets: [
    anthropicApiKey,    // for LLM probe
    openaiApiKey,       // for OpenAI probe fallback
    databaseUrl,        // for Supabase probe
    supabaseServiceKey,
    supabaseAnonKey,
  ],
}, async (_event) => {
  const { healthProbeService } = await import('./services/healthProbeService');
  await healthProbeService.runAllProbes();
});

// Retention cleanup — weekly (PITFALL-7: separate from document processing scheduler)
export const runRetentionCleanup = onSchedule({
  schedule: 'every monday 02:00',
  timeoutSeconds: 120,
  memory: '256MiB',
  secrets: [databaseUrl, supabaseServiceKey, supabaseAnonKey],
}, async (_event) => {
  const { HealthCheckModel } = await import('./models/HealthCheckModel');
  const { AlertEventModel } = await import('./models/AlertEventModel');
  const { analyticsService } = await import('./services/analyticsService');

  const [hcCount, alertCount, eventCount] = await Promise.all([
    HealthCheckModel.deleteOlderThan(30),
    AlertEventModel.deleteOlderThan(30),
    analyticsService.deleteProcessingEventsOlderThan(30),
  ]);

  logger.info('runRetentionCleanup: complete', { hcCount, alertCount, eventCount });
});

Pattern 6: Analytics Migration (ANLY-01)

What: Migration 013_create_processing_events_table.sql adds the document_processing_events table. Follows the migration 012 pattern exactly.

-- Source: backend/src/models/migrations/012_create_monitoring_tables.sql (verified pattern)
CREATE TABLE IF NOT EXISTS document_processing_events (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    document_id UUID NOT NULL,
    user_id UUID NOT NULL,
    event_type TEXT NOT NULL CHECK (event_type IN ('upload_started', 'processing_started', 'completed', 'failed')),
    duration_ms INTEGER,
    error_message TEXT,
    stage TEXT,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX IF NOT EXISTS idx_document_processing_events_created_at
    ON document_processing_events(created_at);
CREATE INDEX IF NOT EXISTS idx_document_processing_events_document_id
    ON document_processing_events(document_id);

ALTER TABLE document_processing_events ENABLE ROW LEVEL SECURITY;

Anti-Patterns to Avoid

Probing config existence instead of live connectivity (PITFALL-5): Any check of if (process.env.ANTHROPIC_API_KEY) is not a health probe. Must make a real API call.
Awaiting analytics writes (PITFALL-6): await analyticsService.recordProcessingEvent(...) will block the processing pipeline. Must use void analyticsService.recordProcessingEvent(...) or the function must not return a Promise.
Piggybacking health probes on processDocumentJobs (PITFALL-2): Health probes mixed into the document processing function create availability coupling. Must be a separate onSchedule export.
Hardcoding alert recipient (PITFALL-8): Never write to: 'jpressnell@bluepointcapital.com' in source. Always process.env.EMAIL_WEEKLY_RECIPIENT.
Alert storms (PITFALL-3): Sending email on every failed probe run is a mistake. Must check AlertEventModel.findRecentByService() with cooldown window before every send.
Creating the nodemailer transporter at module level: The Firebase Secret EMAIL_PASS is only available inside a Cloud Function invocation (it's injected at runtime). Create the transporter inside each email call or on first use inside a function execution — not at module initialization time.

Don't Hand-Roll

Problem	Don't Build	Use Instead	Why
Alert deduplication state	Custom in-memory `Map<service, lastAlertTime>`	`AlertEventModel.findRecentByService()` (already exists, Phase 1)	In-memory state resets on cold start (PITFALL-1); DB-backed deduplication survives restarts
SMTP transport	Custom HTTP calls to Gmail API	`nodemailer` with existing SMTP config	Gmail API requires OAuth flow; SMTP App Password already configured and working
Health check result storage	Custom logging or in-memory	`HealthCheckModel.create()` (already exists, Phase 1)	Already written, tested, and connected to the right table
Cron scheduling	`setInterval` inside function body	`onSchedule` Firebase Cloud Scheduler	`setInterval` does not work in serverless (instances spin up/down); Cloud Scheduler is the correct mechanism
Alert creation	Direct Supabase insert	`AlertEventModel.create()` (already exists, Phase 1)	Already written with input validation and error handling

Key insight: Phase 1 built the entire model layer specifically so Phase 2 only has to write service logic. Use every model method; don't bypass them.

Common Pitfalls

Pitfall A: Firebase Secret Unavailable at Module Load Time

What goes wrong: Nodemailer transporter created at module top level with process.env.EMAIL_PASS — at module load time (cold start initialization), the Firebase Secret hasn't been injected yet. EMAIL_PASS is undefined. All email attempts fail. Why it happens: Firebase Functions v2 defineSecret() values are injected into process.env when the function invocation starts, not when the module is first imported. How to avoid: Create the nodemailer transporter lazily — inside the function that sends email, not at module level. Alternatively, use a factory function called at send time. Warning signs: nodemailer error "authentication failed" or "invalid credentials" on first cold start; works on warm invocations.

Pitfall B: LLM Probe Cost

What goes wrong: LLM health probe uses the same model as document processing (e.g., claude-opus-4-1). Running every 5 minutes costs ~$0.01 × 288 calls/day = ~$2.88/day just for probing. Why it happens: Copy-pasting the model name from llmService.ts. How to avoid: Use the cheapest available model for probes: claude-haiku-4-5 (Anthropic) or gpt-3.5-turbo (OpenAI). The probe only needs to verify API key validity and reachability — response quality doesn't matter. Set max_tokens: 5. Warning signs: Anthropic API bill spikes after deploying runHealthProbes.

Pitfall C: Supabase PostgREST vs Direct Postgres for Health Probe

What goes wrong: Using getSupabaseServiceClient() (PostgREST) for the Supabase health probe instead of getPostgresPool(). PostgREST adds an HTTP layer — if the Supabase API is overloaded but the DB is healthy, the probe returns "down" incorrectly. Why it happens: PostgREST client is the default Supabase client used everywhere else. How to avoid: Use getPostgresPool().query('SELECT 1') — this tests TCP connectivity to the database directly, which is the true health signal for data persistence operations. Warning signs: Supabase probe reports "down" while the DB is healthy; health check latency fluctuates widely.

Pitfall D: Analytics Migration Naming Conflict

What goes wrong: Phase 2 creates 013_create_processing_events_table.sql but another developer or future migration already used 013. The migrator runs both or skips one. Why it happens: Not verifying the highest current migration number. How to avoid: Current highest is 012_create_monitoring_tables.sql (created in Phase 1). Next migration MUST be 013_. Confirmed safe. Warning signs: Migration run shows "already applied" for 013_ without the table existing.

Pitfall E: Probe Errors Swallowed Silently

What goes wrong: A probe throws an uncaught exception. The runHealthProbes Cloud Function catches it at the top level and does nothing. No health check record is written. The admin dashboard shows no data. Why it happens: Each individual probe can fail independently — if one throws, the others should still run. How to avoid: Wrap each probe call in try/catch inside healthProbeService.runAllProbes(). A probe error should create a status: 'down' result with the error in error_message, then persist that to Supabase. The probe orchestrator must never throw; it must always complete all probes. Warning signs: One service's health checks stop appearing in Supabase while others continue.

Pitfall F: `deleteOlderThan` Without Batching on Large Tables

What goes wrong: After 30 days of operation with health probes running every 5 minutes, service_health_checks could have ~8,640 rows (288/day × 30). A single DELETE WHERE created_at < cutoff is fine at this scale. At 6 months, however, it could be 50k+ rows — still manageable with the created_at index. No batching needed at Phase 2 scale. Why it happens: Concern about DB timeout on large deletes. How to avoid: Index on created_at (exists from Phase 1 migration 012) makes the DELETE efficient. For Phase 2 scale, a single DELETE is correct. Only consider batching if the table grows to millions of rows. Warning signs: N/A at Phase 2 scale. Log deletedCount for visibility.

Code Examples

Verified patterns from codebase and official Node.js/Firebase docs:

Adding a Second Cloud Function Export (index.ts)

// Source: Existing processDocumentJobs pattern in backend/src/index.ts (verified)
// New export follows the same onSchedule structure:
import { onSchedule } from 'firebase-functions/v2/scheduler';

export const runHealthProbes = onSchedule({
  schedule: 'every 5 minutes',
  timeoutSeconds: 60,
  memory: '256MiB',
  retryCount: 0, // Probes should not retry — they run again in 5 minutes anyway
  secrets: [anthropicApiKey, openaiApiKey, databaseUrl, supabaseServiceKey, supabaseAnonKey],
}, async (_event) => {
  // Dynamic import (same pattern as processDocumentJobs)
  const { healthProbeService } = await import('./services/healthProbeService');
  await healthProbeService.runAllProbes();
});

HealthCheckModel.create() — Already Available (Phase 1)

// Source: backend/src/models/HealthCheckModel.ts (verified, Phase 1)
await HealthCheckModel.create({
  service_name: 'document_ai',
  status: 'healthy',
  latency_ms: 234,
  probe_details: { processor_count: 1 },
});

AlertEventModel.findRecentByService() — Already Available (Phase 1)

// Source: backend/src/models/AlertEventModel.ts (verified, Phase 1)
const recent = await AlertEventModel.findRecentByService(
  'document_ai',      // service name
  'service_down',     // alert type
  60                  // within last 60 minutes
);
if (recent) {
  // suppress — cooldown active
}

Nodemailer SMTP — Using Existing Firebase Config

// Source: Firebase defineString/defineSecret pattern verified in index.ts lines 220-225
// process.env.EMAIL_HOST, EMAIL_USER, EMAIL_PASS, EMAIL_PORT, EMAIL_SECURE all available
import nodemailer from 'nodemailer';

async function sendEmail(to: string, subject: string, html: string): Promise<void> {
  // Transporter created INSIDE function call (not at module level) — Firebase Secret timing
  const transporter = nodemailer.createTransport({
    host: process.env.EMAIL_HOST ?? 'smtp.gmail.com',
    port: parseInt(process.env.EMAIL_PORT ?? '587', 10),
    secure: process.env.EMAIL_SECURE === 'true',
    auth: {
      user: process.env.EMAIL_USER,
      pass: process.env.EMAIL_PASS,
    },
  });
  await transporter.sendMail({ from: process.env.EMAIL_FROM, to, subject, html });
}

Fire-and-Forget Write Pattern

// Source: PITFALL-6 prevention — void prevents awaiting
// This is the ONLY correct way to write fire-and-forget to Supabase

// CORRECT — non-blocking:
void analyticsService.recordProcessingEvent({ document_id, user_id, event_type: 'completed', duration_ms });

// WRONG — blocks processing pipeline:
await analyticsService.recordProcessingEvent(...); // DO NOT DO THIS

// ALSO WRONG — return type must be void, not Promise<void>:
async function recordProcessingEvent(...): Promise<void> { ... } // enables accidental await

State of the Art

Old Approach	Current Approach	When Changed	Impact
`uploadMonitoringService.ts` in-memory event store	Persistent `document_processing_events` Supabase table	Phase 2 introduces this	Analytics survives cold starts; 30-day history available
Configuration-only health check (`/monitoring/diagnostics`)	Live API call probers (`healthProbeService.ts`)	Phase 2 introduces this	Actually detects downed/revoked credentials
No email alerting	SMTP email via `nodemailer` + Firebase SMTP config	Phase 2 introduces this	Admin notified of outages
No scheduled probe function	`runHealthProbes` Cloud Function export	Phase 2 introduces this	Probes run independently of document processing

Existing but unused: The performance_metrics table (migration 010) is scoped to agentic RAG sessions (has a FK to agentic_rag_sessions). It is NOT suitable for general document processing analytics — use the new document_processing_events table instead.

Open Questions

Probe frequency for LLM (HLTH-03)
- What we know: 5-minute probe interval is specified for runHealthProbes. An Anthropic probe every 5 minutes at min tokens costs ~$0.001/call × 288 = $0.29/day. Acceptable.
- What's unclear: Whether to probe BOTH Anthropic and OpenAI each run (depends on active provider) or always probe both.
- Recommendation: Probe the active LLM provider (from process.env.LLM_PROVIDER) plus always probe Supabase and Document AI. Probing inactive providers is useful for failover readiness but not required by HLTH-02.
Alert recipient variable name: EMAIL_WEEKLY_RECIPIENT vs ALERT_RECIPIENT
- What we know: EMAIL_WEEKLY_RECIPIENT is already defined as a Firebase defineString in index.ts. It has the correct default value.
- What's unclear: The name implies "weekly" which is misleading for health alerts. Should this be a separate ALERT_RECIPIENT env var?
- Recommendation: Reuse EMAIL_WEEKLY_RECIPIENT for alert recipient to avoid adding another Firebase param. Document that it's dual-purpose. If a separate ALERT_RECIPIENT is desired, add it as a new defineString in index.ts alongside the existing one.
runHealthProbes secrets list
- What we know: defineSecret() values must be listed in each function's secrets: array to be available in process.env during that function's execution.
- What's unclear: The LLM probe needs ANTHROPIC_API_KEY or OPENAI_API_KEY depending on config. The Supabase probe needs DATABASE_URL, SUPABASE_SERVICE_KEY, SUPABASE_ANON_KEY.
- Recommendation: Include all potentially-needed secrets: anthropicApiKey, openaiApiKey, databaseUrl, supabaseServiceKey, supabaseAnonKey. Unused secrets don't cause issues; missing ones cause failures.
Should runRetentionCleanup also delete from performance_metrics / session_events?
- What we know: performance_metrics (migration 010) tracks agentic RAG sessions. It has no 30-day retention requirement specified.
- What's unclear: INFR-03 says "30-day rolling data retention cleanup" — does this apply only to monitoring tables or all analytics tables?
- Recommendation: Phase 2 only manages tables introduced in the monitoring feature: service_health_checks, alert_events, document_processing_events. Leave performance_metrics, session_events, execution_events out of scope — INFR-03 is monitoring-specific.

Validation Architecture

(Nyquist validation not configured — workflow.nyquist_validation not present in .planning/config.json. This section is omitted.)

Sources

Primary (HIGH confidence)

backend/src/models/HealthCheckModel.ts — Verified: create(), findLatestByService(), findAll(), deleteOlderThan() signatures and behavior
backend/src/models/AlertEventModel.ts — Verified: create(), findActive(), findRecentByService(), deleteOlderThan() signatures; deduplication method ready for Phase 2
backend/src/index.ts lines 208-265 — Verified: defineSecret('EMAIL_PASS'), defineString('EMAIL_HOST'), defineString('EMAIL_USER'), defineString('EMAIL_PORT'), defineString('EMAIL_SECURE'), defineString('EMAIL_WEEKLY_RECIPIENT') all already defined; onSchedule export pattern confirmed from processDocumentJobs
backend/src/models/migrations/012_create_monitoring_tables.sql — Verified: migration 012 exists and is the current highest; next migration is 013
backend/src/services/jobProcessorService.ts lines 329-390 — Verified: processingTime and status tracked at end of each job; correct hook points for analytics instrumentation
backend/src/services/uploadMonitoringService.ts — Verified: in-memory only, loses data on cold start (PITFALL-1 confirmed)
backend/package.json — Verified: nodemailer is NOT installed; must be added
backend/vitest.config.ts — Verified: test glob includes src/__tests__/**/*.{test,spec}.{ts,js}; timeout 30s
.planning/research/PITFALLS.md — Verified: PITFALL-1 through PITFALL-10 all considered in this research
.planning/research/STACK.md — Verified: Email decision (Nodemailer fallback), node-cron vs Firebase Cloud Scheduler

Secondary (MEDIUM confidence)

nodemailer SMTP pattern: Standard Node.js email library; createTransport + sendMail API is stable and well-documented. Confidence HIGH from training data; verified against package docs as of August 2025.
Firebase defineSecret() runtime injection timing: Firebase Secrets are injected at function invocation time, not module load time — confirmed behavior from Firebase Functions v2 documentation patterns. Verified via the secrets: array requirement in onSchedule config.

Tertiary (LOW confidence)

Specific LLM probe cost calculation: Estimated from Anthropic public pricing as of training data. Actual cost may vary — verify with Anthropic API pricing page before deploying.

Metadata

Confidence breakdown:

Standard stack: HIGH — all libraries verified in package.json; only nodemailer is new
Architecture: HIGH — Cloud Function export pattern verified from existing processDocumentJobs; model methods verified from Phase 1 source
Pitfalls: HIGH — PITFALL-1 through PITFALL-10 verified against codebase; Firebase Secret timing is documented Firebase behavior

Research date: 2026-02-24 Valid until: 2026-03-25 (30 days — Firebase Functions v2 and Supabase patterns are stable)

37 KiB Raw Blame History Unescape Escape