Archive milestone artifacts (roadmap, requirements, audit, phase directories) to .planning/milestones/. Evolve PROJECT.md with validated requirements and decision outcomes. Create MILESTONES.md and RETROSPECTIVE.md. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
37 KiB
Phase 2: Backend Services - Research
Researched: 2026-02-24 Domain: Firebase Cloud Functions scheduling, health probes, email alerting (Nodemailer/SMTP), fire-and-forget analytics, alert deduplication, 30-day data retention Confidence: HIGH
<phase_requirements>
Phase Requirements
| ID | Description | Research Support |
|---|---|---|
| HLTH-02 | Each health probe makes a real authenticated API call, not just config checks | Verified: existing /monitoring/diagnostics only checks initialization, not live connectivity; each probe must make a real call (Document AI list processors, Anthropic minimal message, Supabase SELECT 1, Firebase Auth verify-token attempt) |
| HLTH-03 | Health probes run on a scheduled interval, separate from document processing | Verified: processDocumentJobs export pattern in index.ts shows how to add a second named Cloud Function export; onSchedule from firebase-functions/v2/scheduler is the correct mechanism; PITFALL-2 mandates decoupling |
| HLTH-04 | Health probe results persist to Supabase and survive cold starts | Verified: HealthCheckModel.create() exists from Phase 1 with correct insert signature; service_health_checks table exists via migration 012; cold-start survival is automatic once persisted |
| ALRT-01 | Admin receives email alert when a service goes down or degrades | Verified: SMTP config already defined in index.ts (emailHost, emailUser, emailPass, emailPort, emailSecure); nodemailer is the correct library (no other email SDK installed; SMTP credentials are pre-configured); nodemailer is NOT yet in package.json — must be installed |
| ALRT-02 | Alert deduplication prevents repeat emails for the same ongoing issue (cooldown period) | Verified: AlertEventModel.findRecentByService() from Phase 1 exists and accepts withinMinutes — built exactly for this use case; check it before firing email and before creating new alert_events row |
| ALRT-04 | Alert recipient stored as configuration, not hardcoded | Verified: EMAIL_WEEKLY_RECIPIENT defineString already exists in index.ts with default jpressnell@bluepointcapital.com; alert service must read process.env.EMAIL_WEEKLY_RECIPIENT (or process.env.ALERT_RECIPIENT) — do NOT hardcode the string in service source |
| ANLY-01 | Document processing events persist to Supabase at write time (not in-memory only) | Verified: uploadMonitoringService.ts is in-memory only (confirmed PITFALL-1); a document_processing_events table is NOT yet in any migration — Phase 2 must add migration 013 for it; jobProcessorService.ts has instrumentation hooks (lines 329-390) to attach fire-and-forget writes |
| ANLY-03 | Analytics instrumentation is non-blocking (fire-and-forget, never delays processing pipeline) | Verified: PITFALL-6 documents the 14-min timeout risk; pattern is void supabase.from(...).insert(...) — no await; existing jobProcessorService.ts processes in ~10 minutes, so blocking even 200ms per checkpoint is risky |
| INFR-03 | 30-day rolling data retention cleanup runs on schedule | Verified: HealthCheckModel.deleteOlderThan(30) and AlertEventModel.deleteOlderThan(30) exist from Phase 1; a third call for document_processing_events needs to be added; must be a separate named Cloud Function export (PITFALL-7: separate from processDocumentJobs) |
</phase_requirements>
Summary
Phase 2 is a service-implementation phase. All database infrastructure (tables, models) was built in Phase 1. This phase builds six service classes and two new Firebase Cloud Function exports. The work falls into four groups:
Group 1 — Health Probes (healthProbeService.ts): Four probers (Document AI, Anthropic/OpenAI LLM, Supabase, Firebase Auth) each making a real authenticated API call using the already-configured credentials. Results are written to Supabase via HealthCheckModel.create(). PITFALL-5 is the key risk: existing diagnostics only check initialization — new probes must make live API calls.
Group 2 — Alert Service (alertService.ts): Reads health probe results, checks if an alert already exists within cooldown using AlertEventModel.findRecentByService(), creates an alert_events row if not, and sends email via nodemailer (SMTP credentials already defined as Firebase defineString/defineSecret). Alert recipient read from process.env.EMAIL_WEEKLY_RECIPIENT (or a new ALERT_RECIPIENT env var).
Group 3 — Analytics Collector (analyticsService.ts): A recordProcessingEvent() function that writes to a new document_processing_events Supabase table using fire-and-forget (void not await). Requires migration 013. The jobProcessorService.ts already has the right instrumentation points (lines 329-390 track processingTime and status).
Group 4 — Schedulers (new Cloud Function exports in index.ts): runHealthProbes (every 5 minutes, separate export) and runRetentionCleanup (weekly, separate export). Both must be completely decoupled from processDocumentJobs.
Primary recommendation: Install nodemailer + @types/nodemailer first. Build services in dependency order: analytics migration → analyticsService → healthProbeService → alertService → schedulers.
Standard Stack
Core
| Library | Version | Purpose | Why Standard |
|---|---|---|---|
nodemailer |
^6.9.x | SMTP email sending | SMTP config already pre-wired in index.ts (emailHost, emailUser, emailPass, emailPort, emailSecure via defineString/defineSecret); no other email library installed; Nodemailer is the standard Node.js SMTP library |
@supabase/supabase-js |
Already installed (2.53.0) | Writing health checks and analytics to Supabase | Already the only DB client; HealthCheckModel and AlertEventModel from Phase 1 wrap all writes |
firebase-admin |
Already installed (13.4.0) | Firebase Auth probe (verify-token endpoint) + onSchedule function exports |
Already initialized via config/firebase.ts |
firebase-functions |
Already installed (7.0.5) | onSchedule v2 for scheduled Cloud Functions |
Existing processDocumentJobs uses exact same pattern |
@google-cloud/documentai |
Already installed (9.3.0) | Document AI health probe (list processors call) | Already initialized in documentAiProcessor.ts |
@anthropic-ai/sdk |
Already installed (0.57.0) | LLM health probe (minimal token message) | Already initialized in llmService.ts |
openai |
Already installed (5.10.2) | OpenAI health probe fallback | Available when LLM_PROVIDER=openai |
pg |
Already installed (8.11.3) | Supabase health probe (direct SELECT 1 query) | Direct pool already available via getPostgresPool() in config/supabase.ts |
| Winston logger | Already installed (3.11.0) | All service logging | Project-wide convention; NEVER console.log |
New Packages Required
| Library | Version | Purpose | Installation |
|---|---|---|---|
nodemailer |
^6.9.x | SMTP email transport | npm install nodemailer |
@types/nodemailer |
^6.4.x | TypeScript types | npm install --save-dev @types/nodemailer |
Alternatives Considered
| Instead of | Could Use | Tradeoff |
|---|---|---|
nodemailer (SMTP) |
Resend SDK | Resend is the STACK.md recommendation for new setups, but Gmail SMTP credentials are already fully configured in index.ts (EMAIL_HOST, EMAIL_USER, EMAIL_PASS, etc.) — switching to Resend requires new DNS records and a new API key; Nodemailer avoids all of that |
Separate onSchedule export |
node-cron inside existing function |
PITFALL-2: probe scheduling inside processDocumentJobs creates availability coupling; Firebase Cloud Scheduler + separate export is the correct architecture |
getPostgresPool() for Supabase health probe |
Supabase PostgREST client | Direct PostgreSQL SELECT 1 is a better health signal than PostgREST (tests TCP+auth rather than REST layer); getPostgresPool() already exists for this purpose |
Installation:
cd backend
npm install nodemailer
npm install --save-dev @types/nodemailer
Architecture Patterns
Recommended Project Structure
New files slot into existing service layer:
backend/src/
├── models/
│ └── migrations/
│ └── 013_create_processing_events_table.sql # NEW — analytics events table
├── services/
│ ├── healthProbeService.ts # NEW — probe orchestrator + individual probers
│ ├── alertService.ts # NEW — deduplication + email + alert_events writer
│ └── analyticsService.ts # NEW — fire-and-forget event writer
├── index.ts # UPDATE — add runHealthProbes + runRetentionCleanup exports
└── __tests__/
└── models/ # (Phase 1 tests already here)
└── unit/
├── healthProbeService.test.ts # NEW
├── alertService.test.ts # NEW
└── analyticsService.test.ts # NEW
Pattern 1: Real Health Probe (HLTH-02)
What: Each probe makes a real authenticated API call. Returns a structured ProbeResult with status, latency_ms, and error_message. Probe then calls HealthCheckModel.create() to persist.
Key insight: The probe itself has no alert logic — that lives in alertService.ts. The probe only measures and records.
// Source: derived from existing documentAiProcessor.ts and llmService.ts patterns
interface ProbeResult {
service_name: string;
status: 'healthy' | 'degraded' | 'down';
latency_ms: number;
error_message?: string;
probe_details?: Record<string, unknown>;
}
// Document AI probe — list processors is a cheap read that tests auth + API availability
async function probeDocumentAI(): Promise<ProbeResult> {
const start = Date.now();
try {
const client = new DocumentProcessorServiceClient();
await client.listProcessors({ parent: `projects/${projectId}/locations/us` });
const latency_ms = Date.now() - start;
return { service_name: 'document_ai', status: latency_ms > 2000 ? 'degraded' : 'healthy', latency_ms };
} catch (err) {
return {
service_name: 'document_ai',
status: 'down',
latency_ms: Date.now() - start,
error_message: err instanceof Error ? err.message : String(err),
};
}
}
// Supabase probe — direct PostgreSQL SELECT 1 via existing pg pool
async function probeSupabase(): Promise<ProbeResult> {
const start = Date.now();
try {
const pool = getPostgresPool();
await pool.query('SELECT 1');
const latency_ms = Date.now() - start;
return { service_name: 'supabase', status: latency_ms > 2000 ? 'degraded' : 'healthy', latency_ms };
} catch (err) {
return {
service_name: 'supabase',
status: 'down',
latency_ms: Date.now() - start,
error_message: err instanceof Error ? err.message : String(err),
};
}
}
// LLM probe — minimal API call (1-word message) to verify key validity
async function probeLLM(): Promise<ProbeResult> {
const start = Date.now();
try {
// Use whichever provider is configured
const client = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });
await client.messages.create({
model: 'claude-haiku-4-5',
max_tokens: 5,
messages: [{ role: 'user', content: 'Hi' }],
});
const latency_ms = Date.now() - start;
return { service_name: 'llm_api', status: latency_ms > 5000 ? 'degraded' : 'healthy', latency_ms };
} catch (err) {
// 429 = degraded (rate limit), not 'down'
const is429 = err instanceof Error && err.message.includes('429');
return {
service_name: 'llm_api',
status: is429 ? 'degraded' : 'down',
latency_ms: Date.now() - start,
error_message: err instanceof Error ? err.message : String(err),
};
}
}
// Firebase Auth probe — verify a known-invalid token; expect auth/argument-error, not network error
async function probeFirebaseAuth(): Promise<ProbeResult> {
const start = Date.now();
try {
await admin.auth().verifyIdToken('invalid-token-probe-check');
// Should never reach here — always throws
return { service_name: 'firebase_auth', status: 'healthy', latency_ms: Date.now() - start };
} catch (err) {
const latency_ms = Date.now() - start;
const errMsg = err instanceof Error ? err.message : String(err);
// 'argument-error' or 'auth/argument-error' = SDK is alive and Auth is reachable
const isExpectedError = errMsg.includes('argument') || errMsg.includes('INVALID');
return {
service_name: 'firebase_auth',
status: isExpectedError ? 'healthy' : 'down',
latency_ms,
error_message: isExpectedError ? undefined : errMsg,
};
}
}
Pattern 2: Alert Deduplication (ALRT-02)
What: Before sending an email, check AlertEventModel.findRecentByService() for a matching alert within the cooldown window. If found, suppress. Uses alert_events table (already exists from Phase 1).
Important: Deduplication check must happen before BOTH the alert_events row creation AND the email send — otherwise a suppressed email still creates a duplicate row.
// Source: AlertEventModel.findRecentByService() from Phase 1 (verified)
// Cooldown: 60 minutes (configurable via env var ALERT_COOLDOWN_MINUTES)
const ALERT_COOLDOWN_MINUTES = parseInt(process.env.ALERT_COOLDOWN_MINUTES ?? '60', 10);
async function maybeSendAlert(
serviceName: string,
alertType: 'service_down' | 'service_degraded',
message: string
): Promise<void> {
// 1. Check deduplication window
const existing = await AlertEventModel.findRecentByService(
serviceName,
alertType,
ALERT_COOLDOWN_MINUTES
);
if (existing) {
logger.info('alertService: suppressing duplicate alert within cooldown', {
serviceName, alertType, existingAlertId: existing.id, cooldownMinutes: ALERT_COOLDOWN_MINUTES,
});
return; // suppress: both row creation AND email
}
// 2. Create alert_events row
await AlertEventModel.create({ service_name: serviceName, alert_type: alertType, message });
// 3. Send email
await sendAlertEmail(serviceName, alertType, message);
}
Pattern 3: Email via SMTP (Nodemailer + existing Firebase config) (ALRT-01, ALRT-04)
What: Nodemailer transporter created using Firebase defineString/defineSecret values already in index.ts. Alert recipient from process.env.EMAIL_WEEKLY_RECIPIENT (non-hardcoded, satisfies ALRT-04).
Key insight: The SMTP credentials (EMAIL_HOST, EMAIL_USER, EMAIL_PASS, EMAIL_PORT, EMAIL_SECURE) are already defined as Firebase params in index.ts. The service reads them from process.env — Firebase makes defineString values available there automatically.
// Source: Firebase Functions v2 defineString/defineSecret pattern — verified in index.ts
import nodemailer from 'nodemailer';
import { logger } from '../utils/logger';
function createTransporter() {
return nodemailer.createTransport({
host: process.env.EMAIL_HOST ?? 'smtp.gmail.com',
port: parseInt(process.env.EMAIL_PORT ?? '587', 10),
secure: process.env.EMAIL_SECURE === 'true',
auth: {
user: process.env.EMAIL_USER,
pass: process.env.EMAIL_PASS, // Firebase Secret — available in process.env
},
});
}
async function sendAlertEmail(serviceName: string, alertType: string, message: string): Promise<void> {
const recipient = process.env.EMAIL_WEEKLY_RECIPIENT; // ALRT-04: read from config, not hardcoded
if (!recipient) {
logger.warn('alertService.sendAlertEmail: no recipient configured, skipping email', { serviceName });
return;
}
const transporter = createTransporter();
try {
await transporter.sendMail({
from: process.env.EMAIL_FROM ?? process.env.EMAIL_USER,
to: recipient,
subject: `[CIM Summary] Alert: ${serviceName} — ${alertType}`,
text: message,
html: `<p><strong>${serviceName}</strong>: ${message}</p>`,
});
logger.info('alertService.sendAlertEmail: sent', { serviceName, alertType, recipient });
} catch (err) {
logger.error('alertService.sendAlertEmail: failed', {
error: err instanceof Error ? err.message : String(err),
serviceName, alertType,
});
// Do NOT re-throw — email failure should not break the probe run
}
}
Pattern 4: Fire-and-Forget Analytics (ANLY-03)
What: analyticsService.recordProcessingEvent() uses void (no await) so the Supabase write is completely detached from the processing pipeline. The function signature returns void to make it impossible to accidentally await it.
Critical rule: The function MUST be called with void or not awaited anywhere it's used. TypeScript enforcing void return type ensures this.
// Source: PITFALL-6 pattern — fire-and-forget is mandatory
export interface ProcessingEventData {
document_id: string;
user_id: string;
event_type: 'upload_started' | 'processing_started' | 'completed' | 'failed';
duration_ms?: number;
error_message?: string;
stage?: string;
}
// Return type is void (not Promise<void>) — cannot be awaited
export function recordProcessingEvent(data: ProcessingEventData): void {
const supabase = getSupabaseServiceClient();
void supabase
.from('document_processing_events')
.insert({
document_id: data.document_id,
user_id: data.user_id,
event_type: data.event_type,
duration_ms: data.duration_ms ?? null,
error_message: data.error_message ?? null,
stage: data.stage ?? null,
created_at: new Date().toISOString(),
})
.then(({ error }) => {
if (error) {
// Never throw — log only (analytics failure must not affect processing)
logger.error('analyticsService.recordProcessingEvent: write failed', {
error: error.message, data,
});
}
});
}
Pattern 5: Scheduled Cloud Function Export (HLTH-03, INFR-03)
What: Two new onSchedule exports added to index.ts. Each is a separate named export, completely decoupled from processDocumentJobs.
Important: New exports must include the same secrets array as processDocumentJobs (all needed Firebase Secrets must be explicitly listed). defineString values are auto-available but defineSecret values require explicit listing.
// Source: Existing processDocumentJobs pattern in index.ts (verified)
// Add AFTER processDocumentJobs export
// Health probe scheduler — separate from document processing (PITFALL-2)
export const runHealthProbes = onSchedule({
schedule: 'every 5 minutes',
timeoutSeconds: 60,
memory: '256MiB',
secrets: [
anthropicApiKey, // for LLM probe
openaiApiKey, // for OpenAI probe fallback
databaseUrl, // for Supabase probe
supabaseServiceKey,
supabaseAnonKey,
],
}, async (_event) => {
const { healthProbeService } = await import('./services/healthProbeService');
await healthProbeService.runAllProbes();
});
// Retention cleanup — weekly (PITFALL-7: separate from document processing scheduler)
export const runRetentionCleanup = onSchedule({
schedule: 'every monday 02:00',
timeoutSeconds: 120,
memory: '256MiB',
secrets: [databaseUrl, supabaseServiceKey, supabaseAnonKey],
}, async (_event) => {
const { HealthCheckModel } = await import('./models/HealthCheckModel');
const { AlertEventModel } = await import('./models/AlertEventModel');
const { analyticsService } = await import('./services/analyticsService');
const [hcCount, alertCount, eventCount] = await Promise.all([
HealthCheckModel.deleteOlderThan(30),
AlertEventModel.deleteOlderThan(30),
analyticsService.deleteProcessingEventsOlderThan(30),
]);
logger.info('runRetentionCleanup: complete', { hcCount, alertCount, eventCount });
});
Pattern 6: Analytics Migration (ANLY-01)
What: Migration 013_create_processing_events_table.sql adds the document_processing_events table. Follows the migration 012 pattern exactly.
-- Source: backend/src/models/migrations/012_create_monitoring_tables.sql (verified pattern)
CREATE TABLE IF NOT EXISTS document_processing_events (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
document_id UUID NOT NULL,
user_id UUID NOT NULL,
event_type TEXT NOT NULL CHECK (event_type IN ('upload_started', 'processing_started', 'completed', 'failed')),
duration_ms INTEGER,
error_message TEXT,
stage TEXT,
created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX IF NOT EXISTS idx_document_processing_events_created_at
ON document_processing_events(created_at);
CREATE INDEX IF NOT EXISTS idx_document_processing_events_document_id
ON document_processing_events(document_id);
ALTER TABLE document_processing_events ENABLE ROW LEVEL SECURITY;
Anti-Patterns to Avoid
- Probing config existence instead of live connectivity (PITFALL-5): Any check of
if (process.env.ANTHROPIC_API_KEY)is not a health probe. Must make a real API call. - Awaiting analytics writes (PITFALL-6):
await analyticsService.recordProcessingEvent(...)will block the processing pipeline. Must usevoid analyticsService.recordProcessingEvent(...)or the function must not return a Promise. - Piggybacking health probes on
processDocumentJobs(PITFALL-2): Health probes mixed into the document processing function create availability coupling. Must be a separateonScheduleexport. - Hardcoding alert recipient (PITFALL-8): Never write
to: 'jpressnell@bluepointcapital.com'in source. Alwaysprocess.env.EMAIL_WEEKLY_RECIPIENT. - Alert storms (PITFALL-3): Sending email on every failed probe run is a mistake. Must check
AlertEventModel.findRecentByService()with cooldown window before every send. - Creating the nodemailer transporter at module level: The Firebase Secret
EMAIL_PASSis only available inside a Cloud Function invocation (it's injected at runtime). Create the transporter inside each email call or on first use inside a function execution — not at module initialization time.
Don't Hand-Roll
| Problem | Don't Build | Use Instead | Why |
|---|---|---|---|
| Alert deduplication state | Custom in-memory Map<service, lastAlertTime> |
AlertEventModel.findRecentByService() (already exists, Phase 1) |
In-memory state resets on cold start (PITFALL-1); DB-backed deduplication survives restarts |
| SMTP transport | Custom HTTP calls to Gmail API | nodemailer with existing SMTP config |
Gmail API requires OAuth flow; SMTP App Password already configured and working |
| Health check result storage | Custom logging or in-memory | HealthCheckModel.create() (already exists, Phase 1) |
Already written, tested, and connected to the right table |
| Cron scheduling | setInterval inside function body |
onSchedule Firebase Cloud Scheduler |
setInterval does not work in serverless (instances spin up/down); Cloud Scheduler is the correct mechanism |
| Alert creation | Direct Supabase insert | AlertEventModel.create() (already exists, Phase 1) |
Already written with input validation and error handling |
Key insight: Phase 1 built the entire model layer specifically so Phase 2 only has to write service logic. Use every model method; don't bypass them.
Common Pitfalls
Pitfall A: Firebase Secret Unavailable at Module Load Time
What goes wrong: Nodemailer transporter created at module top level with process.env.EMAIL_PASS — at module load time (cold start initialization), the Firebase Secret hasn't been injected yet. EMAIL_PASS is undefined. All email attempts fail.
Why it happens: Firebase Functions v2 defineSecret() values are injected into process.env when the function invocation starts, not when the module is first imported.
How to avoid: Create the nodemailer transporter lazily — inside the function that sends email, not at module level. Alternatively, use a factory function called at send time.
Warning signs: nodemailer error "authentication failed" or "invalid credentials" on first cold start; works on warm invocations.
Pitfall B: LLM Probe Cost
What goes wrong: LLM health probe uses the same model as document processing (e.g., claude-opus-4-1). Running every 5 minutes costs ~$0.01 × 288 calls/day = ~$2.88/day just for probing.
Why it happens: Copy-pasting the model name from llmService.ts.
How to avoid: Use the cheapest available model for probes: claude-haiku-4-5 (Anthropic) or gpt-3.5-turbo (OpenAI). The probe only needs to verify API key validity and reachability — response quality doesn't matter. Set max_tokens: 5.
Warning signs: Anthropic API bill spikes after deploying runHealthProbes.
Pitfall C: Supabase PostgREST vs Direct Postgres for Health Probe
What goes wrong: Using getSupabaseServiceClient() (PostgREST) for the Supabase health probe instead of getPostgresPool(). PostgREST adds an HTTP layer — if the Supabase API is overloaded but the DB is healthy, the probe returns "down" incorrectly.
Why it happens: PostgREST client is the default Supabase client used everywhere else.
How to avoid: Use getPostgresPool().query('SELECT 1') — this tests TCP connectivity to the database directly, which is the true health signal for data persistence operations.
Warning signs: Supabase probe reports "down" while the DB is healthy; health check latency fluctuates widely.
Pitfall D: Analytics Migration Naming Conflict
What goes wrong: Phase 2 creates 013_create_processing_events_table.sql but another developer or future migration already used 013. The migrator runs both or skips one.
Why it happens: Not verifying the highest current migration number.
How to avoid: Current highest is 012_create_monitoring_tables.sql (created in Phase 1). Next migration MUST be 013_. Confirmed safe.
Warning signs: Migration run shows "already applied" for 013_ without the table existing.
Pitfall E: Probe Errors Swallowed Silently
What goes wrong: A probe throws an uncaught exception. The runHealthProbes Cloud Function catches it at the top level and does nothing. No health check record is written. The admin dashboard shows no data.
Why it happens: Each individual probe can fail independently — if one throws, the others should still run.
How to avoid: Wrap each probe call in try/catch inside healthProbeService.runAllProbes(). A probe error should create a status: 'down' result with the error in error_message, then persist that to Supabase. The probe orchestrator must never throw; it must always complete all probes.
Warning signs: One service's health checks stop appearing in Supabase while others continue.
Pitfall F: deleteOlderThan Without Batching on Large Tables
What goes wrong: After 30 days of operation with health probes running every 5 minutes, service_health_checks could have ~8,640 rows (288/day × 30). A single DELETE WHERE created_at < cutoff is fine at this scale. At 6 months, however, it could be 50k+ rows — still manageable with the created_at index. No batching needed at Phase 2 scale.
Why it happens: Concern about DB timeout on large deletes.
How to avoid: Index on created_at (exists from Phase 1 migration 012) makes the DELETE efficient. For Phase 2 scale, a single DELETE is correct. Only consider batching if the table grows to millions of rows.
Warning signs: N/A at Phase 2 scale. Log deletedCount for visibility.
Code Examples
Verified patterns from codebase and official Node.js/Firebase docs:
Adding a Second Cloud Function Export (index.ts)
// Source: Existing processDocumentJobs pattern in backend/src/index.ts (verified)
// New export follows the same onSchedule structure:
import { onSchedule } from 'firebase-functions/v2/scheduler';
export const runHealthProbes = onSchedule({
schedule: 'every 5 minutes',
timeoutSeconds: 60,
memory: '256MiB',
retryCount: 0, // Probes should not retry — they run again in 5 minutes anyway
secrets: [anthropicApiKey, openaiApiKey, databaseUrl, supabaseServiceKey, supabaseAnonKey],
}, async (_event) => {
// Dynamic import (same pattern as processDocumentJobs)
const { healthProbeService } = await import('./services/healthProbeService');
await healthProbeService.runAllProbes();
});
HealthCheckModel.create() — Already Available (Phase 1)
// Source: backend/src/models/HealthCheckModel.ts (verified, Phase 1)
await HealthCheckModel.create({
service_name: 'document_ai',
status: 'healthy',
latency_ms: 234,
probe_details: { processor_count: 1 },
});
AlertEventModel.findRecentByService() — Already Available (Phase 1)
// Source: backend/src/models/AlertEventModel.ts (verified, Phase 1)
const recent = await AlertEventModel.findRecentByService(
'document_ai', // service name
'service_down', // alert type
60 // within last 60 minutes
);
if (recent) {
// suppress — cooldown active
}
Nodemailer SMTP — Using Existing Firebase Config
// Source: Firebase defineString/defineSecret pattern verified in index.ts lines 220-225
// process.env.EMAIL_HOST, EMAIL_USER, EMAIL_PASS, EMAIL_PORT, EMAIL_SECURE all available
import nodemailer from 'nodemailer';
async function sendEmail(to: string, subject: string, html: string): Promise<void> {
// Transporter created INSIDE function call (not at module level) — Firebase Secret timing
const transporter = nodemailer.createTransport({
host: process.env.EMAIL_HOST ?? 'smtp.gmail.com',
port: parseInt(process.env.EMAIL_PORT ?? '587', 10),
secure: process.env.EMAIL_SECURE === 'true',
auth: {
user: process.env.EMAIL_USER,
pass: process.env.EMAIL_PASS,
},
});
await transporter.sendMail({ from: process.env.EMAIL_FROM, to, subject, html });
}
Fire-and-Forget Write Pattern
// Source: PITFALL-6 prevention — void prevents awaiting
// This is the ONLY correct way to write fire-and-forget to Supabase
// CORRECT — non-blocking:
void analyticsService.recordProcessingEvent({ document_id, user_id, event_type: 'completed', duration_ms });
// WRONG — blocks processing pipeline:
await analyticsService.recordProcessingEvent(...); // DO NOT DO THIS
// ALSO WRONG — return type must be void, not Promise<void>:
async function recordProcessingEvent(...): Promise<void> { ... } // enables accidental await
State of the Art
| Old Approach | Current Approach | When Changed | Impact |
|---|---|---|---|
uploadMonitoringService.ts in-memory event store |
Persistent document_processing_events Supabase table |
Phase 2 introduces this | Analytics survives cold starts; 30-day history available |
Configuration-only health check (/monitoring/diagnostics) |
Live API call probers (healthProbeService.ts) |
Phase 2 introduces this | Actually detects downed/revoked credentials |
| No email alerting | SMTP email via nodemailer + Firebase SMTP config |
Phase 2 introduces this | Admin notified of outages |
| No scheduled probe function | runHealthProbes Cloud Function export |
Phase 2 introduces this | Probes run independently of document processing |
Existing but unused: The performance_metrics table (migration 010) is scoped to agentic RAG sessions (has a FK to agentic_rag_sessions). It is NOT suitable for general document processing analytics — use the new document_processing_events table instead.
Open Questions
-
Probe frequency for LLM (HLTH-03)
- What we know: 5-minute probe interval is specified for
runHealthProbes. An Anthropic probe every 5 minutes at min tokens costs ~$0.001/call × 288 = $0.29/day. Acceptable. - What's unclear: Whether to probe BOTH Anthropic and OpenAI each run (depends on active provider) or always probe both.
- Recommendation: Probe the active LLM provider (from
process.env.LLM_PROVIDER) plus always probe Supabase and Document AI. Probing inactive providers is useful for failover readiness but not required by HLTH-02.
- What we know: 5-minute probe interval is specified for
-
Alert recipient variable name:
EMAIL_WEEKLY_RECIPIENTvsALERT_RECIPIENT- What we know:
EMAIL_WEEKLY_RECIPIENTis already defined as a FirebasedefineStringinindex.ts. It has the correct default value. - What's unclear: The name implies "weekly" which is misleading for health alerts. Should this be a separate
ALERT_RECIPIENTenv var? - Recommendation: Reuse
EMAIL_WEEKLY_RECIPIENTfor alert recipient to avoid adding another Firebase param. Document that it's dual-purpose. If a separateALERT_RECIPIENTis desired, add it as a newdefineStringinindex.tsalongside the existing one.
- What we know:
-
runHealthProbessecrets list- What we know:
defineSecret()values must be listed in each function'ssecrets:array to be available inprocess.envduring that function's execution. - What's unclear: The LLM probe needs
ANTHROPIC_API_KEYorOPENAI_API_KEYdepending on config. The Supabase probe needsDATABASE_URL,SUPABASE_SERVICE_KEY,SUPABASE_ANON_KEY. - Recommendation: Include all potentially-needed secrets:
anthropicApiKey,openaiApiKey,databaseUrl,supabaseServiceKey,supabaseAnonKey. Unused secrets don't cause issues; missing ones cause failures.
- What we know:
-
Should
runRetentionCleanupalso delete fromperformance_metrics/session_events?- What we know:
performance_metrics(migration 010) tracks agentic RAG sessions. It has no 30-day retention requirement specified. - What's unclear: INFR-03 says "30-day rolling data retention cleanup" — does this apply only to monitoring tables or all analytics tables?
- Recommendation: Phase 2 only manages tables introduced in the monitoring feature:
service_health_checks,alert_events,document_processing_events. Leaveperformance_metrics,session_events,execution_eventsout of scope — INFR-03 is monitoring-specific.
- What we know:
Validation Architecture
(Nyquist validation not configured — workflow.nyquist_validation not present in .planning/config.json. This section is omitted.)
Sources
Primary (HIGH confidence)
backend/src/models/HealthCheckModel.ts— Verified:create(),findLatestByService(),findAll(),deleteOlderThan()signatures and behaviorbackend/src/models/AlertEventModel.ts— Verified:create(),findActive(),findRecentByService(),deleteOlderThan()signatures; deduplication method ready for Phase 2backend/src/index.tslines 208-265 — Verified:defineSecret('EMAIL_PASS'),defineString('EMAIL_HOST'),defineString('EMAIL_USER'),defineString('EMAIL_PORT'),defineString('EMAIL_SECURE'),defineString('EMAIL_WEEKLY_RECIPIENT')all already defined;onScheduleexport pattern confirmed fromprocessDocumentJobsbackend/src/models/migrations/012_create_monitoring_tables.sql— Verified: migration 012 exists and is the current highest; next migration is 013backend/src/services/jobProcessorService.tslines 329-390 — Verified:processingTimeandstatustracked at end of each job; correct hook points for analytics instrumentationbackend/src/services/uploadMonitoringService.ts— Verified: in-memory only, loses data on cold start (PITFALL-1 confirmed)backend/package.json— Verified:nodemaileris NOT installed; must be addedbackend/vitest.config.ts— Verified: test glob includessrc/__tests__/**/*.{test,spec}.{ts,js}; timeout 30s.planning/research/PITFALLS.md— Verified: PITFALL-1 through PITFALL-10 all considered in this research.planning/research/STACK.md— Verified: Email decision (Nodemailer fallback), node-cron vs Firebase Cloud Scheduler
Secondary (MEDIUM confidence)
nodemailerSMTP pattern: Standard Node.js email library;createTransport+sendMailAPI is stable and well-documented. Confidence HIGH from training data; verified against package docs as of August 2025.- Firebase
defineSecret()runtime injection timing: Firebase Secrets are injected at function invocation time, not module load time — confirmed behavior from Firebase Functions v2 documentation patterns. Verified via thesecrets:array requirement inonScheduleconfig.
Tertiary (LOW confidence)
- Specific LLM probe cost calculation: Estimated from Anthropic public pricing as of training data. Actual cost may vary — verify with Anthropic API pricing page before deploying.
Metadata
Confidence breakdown:
- Standard stack: HIGH — all libraries verified in
package.json; onlynodemaileris new - Architecture: HIGH — Cloud Function export pattern verified from existing
processDocumentJobs; model methods verified from Phase 1 source - Pitfalls: HIGH — PITFALL-1 through PITFALL-10 verified against codebase; Firebase Secret timing is documented Firebase behavior
Research date: 2026-02-24 Valid until: 2026-03-25 (30 days — Firebase Functions v2 and Supabase patterns are stable)