chore: complete v1.0 Analytics & Monitoring milestone

Archive milestone artifacts (roadmap, requirements, audit, phase directories)
to .planning/milestones/. Evolve PROJECT.md with validated requirements and
decision outcomes. Create MILESTONES.md and RETROSPECTIVE.md.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
admin
2026-02-25 10:34:18 -05:00
parent 8bad951d63
commit 38a0f0619d
39 changed files with 299 additions and 186 deletions

25
.planning/MILESTONES.md Normal file
View File

@@ -0,0 +1,25 @@
# Milestones
## v1.0 Analytics & Monitoring (Shipped: 2026-02-25)
**Phases completed:** 5 phases, 10 plans
**Timeline:** 2 days (2026-02-24 → 2026-02-25)
**Commits:** 42 (e606027..8bad951)
**Codebase:** 31,184 LOC TypeScript
**Delivered:** Persistent analytics dashboard and service health monitoring for the CIM Summary application — the admin knows immediately when any external service breaks and sees processing metrics at a glance.
**Key accomplishments:**
1. Database foundation with monitoring tables (service_health_checks, alert_events, document_processing_events) and typed models
2. Fire-and-forget analytics service for non-blocking document processing event tracking
3. Health probe system with real authenticated API calls to Document AI, Claude/OpenAI, Supabase, and Firebase Auth
4. Alert service with email delivery, deduplication cooldown, and config-driven recipients
5. Admin-authenticated API layer with health, analytics, and alerts endpoints (404 for non-admin)
6. Frontend admin dashboard with service health grid, analytics summary, and critical alert banner
7. Tech debt cleanup: env-driven config, consolidated retention cleanup, removed hardcoded defaults
**Requirements:** 15/15 satisfied
**Git range:** e606027..8bad951
---

View File

@@ -2,7 +2,7 @@
## What This Is ## What This Is
An analytics dashboard and service health monitoring system for the existing CIM Summary application. Provides document processing metrics, user activity tracking, real-time service health detection, scheduled health probes, and email + in-app alerting when APIs or credentials need attention. An analytics dashboard and service health monitoring system for the CIM Summary application. Provides persistent document processing metrics, scheduled health probes for all 4 external services, email + in-app alerting when APIs or credentials need attention, and an admin-only monitoring dashboard.
## Core Value ## Core Value
@@ -22,16 +22,25 @@ When something breaks — an API key expires, a service goes down, a credential
- ✓ Structured logging with Winston and correlation IDs — existing - ✓ Structured logging with Winston and correlation IDs — existing
- ✓ Basic health endpoints (`/health`, `/health/config`, `/monitoring/dashboard`) — existing - ✓ Basic health endpoints (`/health`, `/health/config`, `/monitoring/dashboard`) — existing
- ✓ PDF generation and export — existing - ✓ PDF generation and export — existing
- ✓ Admin can view live health status for all 4 services (HLTH-01) — v1.0
- ✓ Health probes make real authenticated API calls (HLTH-02) — v1.0
- ✓ Scheduled periodic health probes (HLTH-03) — v1.0
- ✓ Health probe results persist to Supabase (HLTH-04) — v1.0
- ✓ Email alert on service down/degraded (ALRT-01) — v1.0
- ✓ Alert deduplication within cooldown (ALRT-02) — v1.0
- ✓ In-app alert banner for critical issues (ALRT-03) — v1.0
- ✓ Alert recipient from config, not hardcoded (ALRT-04) — v1.0
- ✓ Processing events persist at write time (ANLY-01) — v1.0
- ✓ Admin can view processing summary (ANLY-02) — v1.0
- ✓ Analytics instrumentation non-blocking (ANLY-03) — v1.0
- ✓ DB migrations with indexes on created_at (INFR-01) — v1.0
- ✓ Admin API routes protected by Firebase Auth (INFR-02) — v1.0
- ✓ 30-day rolling data retention cleanup (INFR-03) — v1.0
- ✓ Analytics use existing Supabase connection (INFR-04) — v1.0
### Active ### Active
- [ ] In-app admin analytics dashboard (processing metrics + user activity) (None — next milestone not yet defined. Run `/gsd:new-milestone` to plan.)
- [ ] Service health monitoring for Google Document AI, Claude/OpenAI, Supabase, Firebase Auth
- [ ] Real-time auth failure detection with actionable alerts
- [ ] Scheduled periodic health probes for all 4 services
- [ ] Email alerting for critical service issues
- [ ] In-app alert notifications for admin
- [ ] 30-day rolling data retention for analytics
### Out of Scope ### Out of Scope
@@ -40,37 +49,42 @@ When something breaks — an API key expires, a service goes down, a credential
- Mobile push notifications — email + in-app sufficient - Mobile push notifications — email + in-app sufficient
- Historical analytics beyond 30 days — lean storage, can extend later - Historical analytics beyond 30 days — lean storage, can extend later
- Real-time WebSocket updates — polling is sufficient for admin dashboard - Real-time WebSocket updates — polling is sufficient for admin dashboard
- ML-based anomaly detection — threshold-based alerting sufficient at this scale
## Context ## Context
The CIM Summary application already has basic health endpoints and structured logging with correlation IDs. The existing `/monitoring/dashboard` endpoint provides some system metrics. The `performance_metrics` table in Supabase already exists for storing system performance data. Winston logging captures errors with context, but there's no alerting mechanism — errors are logged but nobody gets notified. Shipped v1.0 with 31,184 LOC TypeScript across Express.js backend and React frontend.
Tech stack: Express.js, React, Supabase (PostgreSQL + pgvector), Firebase Auth, Firebase Cloud Functions, Google Document AI, Anthropic/OpenAI LLMs, nodemailer, Tailwind CSS.
The admin user is jpressnell@bluepointcapital.com. This is a single-admin system for now. Four external services monitored with real authenticated probes:
1. **Google Document AI** — service account credential validation
2. **Claude/OpenAI** — API key validation via cheapest model (claude-haiku-4-5, max_tokens 5)
3. **Supabase** — direct PostgreSQL pool query (`SELECT 1`)
4. **Firebase Auth** — SDK liveness via verifyIdToken error classification
Four external services need monitoring: Admin user: jpressnell@bluepointcapital.com (config-driven, not hardcoded).
1. **Google Document AI** — uses service account credentials, can expire or lose permissions
2. **Claude/OpenAI** — API keys can be revoked, rate limited, or run out of credits
3. **Supabase** — connection pool issues, service key rotation, pgvector availability
4. **Firebase Auth** — project config changes, token verification failures
## Constraints ## Constraints
- **Tech stack**: Must integrate with existing Express.js backend and React frontend - **Tech stack**: Express.js backend + React frontend
- **Auth**: Admin-only access, use existing Firebase Auth with role check for jpressnell@bluepointcapital.com - **Auth**: Admin-only access via Firebase Auth with config-driven email check
- **Storage**: Use existing Supabase PostgreSQL — no new database infrastructure - **Storage**: Supabase PostgreSQL — no new database infrastructure
- **Email**: Need an email sending service (SendGrid, Resend, or similar) for alerts - **Email**: nodemailer for alert delivery
- **Deployment**: Must work within Firebase Cloud Functions 14-minute timeout - **Deployment**: Firebase Cloud Functions (14-minute timeout)
- **Data retention**: 30-day rolling window to keep storage costs low - **Data retention**: 30-day rolling window
## Key Decisions ## Key Decisions
| Decision | Rationale | Outcome | | Decision | Rationale | Outcome |
|----------|-----------|---------| |----------|-----------|---------|
| In-app dashboard over external tools | Simpler setup, no additional infrastructure, admin can see everything in one place | — Pending | | In-app dashboard over external tools | Simpler setup, no additional infrastructure | ✓ Good — admin sees everything in one place |
| Email + in-app dual alerting | Redundancy for critical issues — in-app for when you're already looking, email for when you're not | — Pending | | Email + in-app dual alerting | Redundancy for critical issues | ✓ Good — covers both active and passive monitoring |
| 30-day retention | Balances useful trend data with storage efficiency | — Pending | | 30-day retention | Balances useful trend data with storage efficiency | ✓ Good — consolidated into single cleanup function |
| Single admin (jpressnell@bluepointcapital.com) | Simple RBAC for now, can extend later | — Pending | | Single admin (config-driven) | Simple RBAC, can extend later | ✓ Good — email now env-driven after tech debt cleanup |
| Real-time detection + scheduled probes | Catches failures as they happen AND proactively tests services before users hit them | — Pending | | Scheduled probes + fire-and-forget analytics | Decouples monitoring from processing | ✓ Good — zero impact on processing pipeline latency |
| 404 (not 403) for non-admin routes | Does not reveal admin routes exist | ✓ Good — security through obscurity at API level |
| void return type for analytics writes | Prevents accidental await on critical path | ✓ Good — type system enforces fire-and-forget pattern |
| Promise.allSettled for probe orchestration | All 4 probes run even if one throws | ✓ Good — partial results better than total failure |
--- ---
*Last updated: 2026-02-24 after initialization* *Last updated: 2026-02-25 after v1.0 milestone*

View File

@@ -0,0 +1,66 @@
# Project Retrospective
*A living document updated after each milestone. Lessons feed forward into future planning.*
## Milestone: v1.0 — Analytics & Monitoring
**Shipped:** 2026-02-25
**Phases:** 5 | **Plans:** 10 | **Sessions:** ~4
### What Was Built
- Database foundation with 3 monitoring tables (service_health_checks, alert_events, document_processing_events) and typed TypeScript models
- Health probe system with real authenticated API calls to Document AI, Claude/OpenAI, Supabase, and Firebase Auth
- Alert service with email delivery via nodemailer, deduplication cooldown, and config-driven recipients
- Fire-and-forget analytics service for non-blocking document processing event tracking
- Admin-authenticated API layer with health, analytics, and alerts endpoints
- Frontend admin dashboard with service health grid, analytics summary, and critical alert banner
- Tech debt cleanup: env-driven config, consolidated retention, removed hardcoded defaults
### What Worked
- Strict dependency ordering (data → services → API → frontend) prevented integration surprises — each phase consumed exactly what the prior phase provided
- Fire-and-forget pattern enforced at the type level (void return) caught potential performance issues at compile time
- GSD audit-milestone workflow caught 5 tech debt items before shipping — all resolved
- 2-day milestone completion shows GSD workflow is efficient for well-scoped work
### What Was Inefficient
- Phase 5 (tech debt) was added to the roadmap but executed as a direct commit — the GSD plan/execute overhead wasn't warranted for 3 small fixes
- Summary one-liner extraction returned null for all summaries — frontmatter format may not match what gsd-tools expects
### Patterns Established
- Static class model pattern for Supabase (no instantiation, getSupabaseServiceClient per-method)
- makeSupabaseChain() factory for Vitest mocking of Supabase client
- requireAdminEmail middleware returns 404 (not 403) to hide admin routes
- Firebase Secrets read inside function body, never at module level
- void return type to prevent accidental await on fire-and-forget operations
### Key Lessons
1. Small tech debt fixes don't need full GSD plan/execute — direct commits are fine when the audit already defines the scope
2. Type-level enforcement (void vs Promise<void>) is more reliable than code review for architectural constraints
3. Promise.allSettled is the right pattern when partial results are better than total failure (health probes)
4. Admin email should always be config-driven from day one — hardcoding "just for now" creates tech debt immediately
### Cost Observations
- Model mix: ~80% sonnet (execution), ~20% haiku (research/verification)
- Sessions: ~4
- Notable: Phase 4 (frontend) completed fastest — well-defined API contracts from Phase 3 made UI wiring straightforward
---
## Cross-Milestone Trends
### Process Evolution
| Milestone | Sessions | Phases | Key Change |
|-----------|----------|--------|------------|
| v1.0 | ~4 | 5 | First milestone — established patterns |
### Cumulative Quality
| Milestone | Tests | Coverage | Zero-Dep Additions |
|-----------|-------|----------|-------------------|
| v1.0 | 14+ | — | 3 tables, 5 services, 4 routes, 3 components |
### Top Lessons (Verified Across Milestones)
1. Type-level enforcement > code review for architectural constraints
2. Strict phase dependency ordering prevents integration surprises

View File

@@ -1,112 +1,28 @@
# Roadmap: CIM Summary — Analytics & Monitoring # Roadmap: CIM Summary — Analytics & Monitoring
## Overview ## Milestones
This milestone adds persistent analytics and service health monitoring to the existing CIM Summary application. The work proceeds in four phases that respect hard dependency constraints: database schema must exist before services can write to it, services must exist before routes can expose them, and routes must be stable before the frontend can be wired up. Each phase delivers a complete, independently testable layer. -**v1.0 Analytics & Monitoring** — Phases 1-5 (shipped 2026-02-25)
## Phases ## Phases
**Phase Numbering:** <details>
- Integer phases (1, 2, 3): Planned milestone work <summary>✅ v1.0 Analytics & Monitoring (Phases 1-5) — SHIPPED 2026-02-25</summary>
- Decimal phases (2.1, 2.2): Urgent insertions (marked with INSERTED)
Decimal phases appear between their surrounding integers in numeric order. - [x] Phase 1: Data Foundation (2/2 plans) — completed 2026-02-24
- [x] Phase 2: Backend Services (4/4 plans) — completed 2026-02-24
- [x] Phase 3: API Layer (2/2 plans) — completed 2026-02-24
- [x] Phase 4: Frontend (2/2 plans) — completed 2026-02-25
- [x] Phase 5: Tech Debt Cleanup (direct commit) — completed 2026-02-25
- [ ] **Phase 1: Data Foundation** - Create schema, DB models, and verify existing Supabase connection wiring </details>
- [x] **Phase 2: Backend Services** - Health probers, alert trigger, email sender, analytics collector, scheduler, retention cleanup (completed 2026-02-24)
- [x] **Phase 3: API Layer** - Admin-gated routes exposing all services, instrumentation hooks in existing processors (completed 2026-02-24)
- [x] **Phase 4: Frontend** - Admin dashboard page, health panel, processing metrics, alert notification banner (completed 2026-02-24)
- [ ] **Phase 5: Tech Debt Cleanup** - Config-driven admin email, consolidate retention cleanup, remove hardcoded defaults
## Phase Details
### Phase 1: Data Foundation
**Goal**: The database schema for monitoring exists and the existing Supabase connection is the only data infrastructure used
**Depends on**: Nothing (first phase)
**Requirements**: INFR-01, INFR-04
**Success Criteria** (what must be TRUE):
1. `service_health_checks` and `alert_events` tables exist in Supabase with indexes on `created_at`
2. All new tables use the existing Supabase client from `config/supabase.ts` — no new database connections added
3. `AlertEventModel.ts` exists and its CRUD methods can be called in isolation without errors
4. Migration SQL can be run against the live Supabase instance and produces the expected schema
**Plans:** 2/2 plans executed
Plans:
- [x] 01-01-PLAN.md — Migration SQL + HealthCheckModel + AlertEventModel
- [x] 01-02-PLAN.md — Unit tests for both monitoring models
### Phase 2: Backend Services
**Goal**: All monitoring logic runs correctly — health probes make real API calls, alerts fire with deduplication, analytics events write non-blocking to Supabase, and data is cleaned up on schedule
**Depends on**: Phase 1
**Requirements**: HLTH-02, HLTH-03, HLTH-04, ALRT-01, ALRT-02, ALRT-04, ANLY-01, ANLY-03, INFR-03
**Success Criteria** (what must be TRUE):
1. Each health probe makes a real authenticated API call to its target service and returns a structured result (status, latency_ms, error_message)
2. Health probe results are written to Supabase and survive a simulated cold start (data present after function restart)
3. An alert email is sent when a service probe returns degraded or down, and a second probe failure within the cooldown period does not send a duplicate email
4. Alert recipient is read from configuration (environment variable or Supabase config row), not hardcoded in source
5. Analytics events fire as fire-and-forget calls — a deliberately introduced 500ms Supabase delay does not increase processing pipeline duration
6. A scheduled probe function and a weekly retention cleanup function exist as separate Firebase Cloud Function exports
**Plans:** 4/4 plans complete
Plans:
- [ ] 02-01-PLAN.md — Analytics migration + analyticsService (fire-and-forget)
- [ ] 02-02-PLAN.md — Health probe service (4 real API probers + orchestrator)
- [ ] 02-03-PLAN.md — Alert service (deduplication + email via nodemailer)
- [ ] 02-04-PLAN.md — Cloud Function exports (runHealthProbes + runRetentionCleanup)
### Phase 3: API Layer
**Goal**: Admin-authenticated HTTP endpoints expose health status, alerts, and processing analytics; existing service processors emit analytics instrumentation
**Depends on**: Phase 2
**Requirements**: INFR-02, HLTH-01, ANLY-02
**Success Criteria** (what must be TRUE):
1. `GET /admin/health` returns current health status for all four services; a request with a non-admin Firebase token receives 403
2. `GET /admin/analytics` returns processing summary (upload counts, success/failure rates, avg processing time) sourced from Supabase, not in-memory state
3. `GET /admin/alerts` and `POST /admin/alerts/:id/acknowledge` function correctly and are blocked to non-admin users
4. Document processing in `jobProcessorService.ts` and `llmService.ts` emits analytics events at stage transitions without any change to existing processing behavior
**Plans:** 2/2 plans complete
Plans:
- [ ] 03-01-PLAN.md — Admin auth middleware + admin routes (health, analytics, alerts endpoints)
- [ ] 03-02-PLAN.md — Analytics instrumentation in jobProcessorService
### Phase 4: Frontend
**Goal**: The admin can see live service health, processing metrics, and active alerts directly in the application UI
**Depends on**: Phase 3
**Requirements**: ALRT-03, ANLY-02 (UI delivery), HLTH-01 (UI delivery)
**Success Criteria** (what must be TRUE):
1. An alert banner appears at the top of the admin UI when there is at least one unacknowledged critical alert, and disappears after the admin acknowledges it
2. The admin dashboard shows health status indicators (green/yellow/red) for all four services, with the last-checked timestamp visible
3. The admin dashboard shows processing metrics (upload counts, success/failure rates, average processing time) sourced from the persistent Supabase backend
4. A non-admin user visiting the admin route is redirected or shown an access-denied state
**Plans:** 2/2 plans complete
Plans:
- [ ] 04-01-PLAN.md — AdminService monitoring methods + AlertBanner + AdminMonitoringDashboard components
- [ ] 04-02-PLAN.md — Wire components into Dashboard + visual verification checkpoint
## Progress ## Progress
**Execution Order:** | Phase | Milestone | Plans Complete | Status | Completed |
Phases execute in numeric order: 1 → 2 → 3 → 4 → 5 |-------|-----------|----------------|--------|-----------|
| 1. Data Foundation | v1.0 | 2/2 | Complete | 2026-02-24 |
| Phase | Plans Complete | Status | Completed | | 2. Backend Services | v1.0 | 4/4 | Complete | 2026-02-24 |
|-------|----------------|--------|-----------| | 3. API Layer | v1.0 | 2/2 | Complete | 2026-02-24 |
| 1. Data Foundation | 2/2 | Complete | 2026-02-24 | | 4. Frontend | v1.0 | 2/2 | Complete | 2026-02-25 |
| 2. Backend Services | 4/4 | Complete | 2026-02-24 | | 5. Tech Debt Cleanup | v1.0 | — | Complete | 2026-02-25 |
| 3. API Layer | 2/2 | Complete | 2026-02-24 |
| 4. Frontend | 2/2 | Complete | 2026-02-25 |
| 5. Tech Debt Cleanup | 0/0 | Not Planned | — |
### Phase 5: Tech Debt Cleanup
**Goal**: All configuration values are env-driven (no hardcoded emails), retention cleanup is consolidated into a single function, and deployment defaults use placeholders
**Depends on**: Phase 4
**Requirements**: None (tech debt from v1.0 audit)
**Gap Closure**: Closes tech debt items from v1.0-MILESTONE-AUDIT.md
**Success Criteria** (what must be TRUE):
1. Frontend `adminService.ts` reads admin email from `import.meta.env.VITE_ADMIN_EMAIL` instead of a hardcoded literal
2. Only one retention cleanup function exists in `index.ts` (the model-layer `runRetentionCleanup`), with the pre-existing raw SQL `cleanupOldData` consolidated or removed
3. `defineString('EMAIL_WEEKLY_RECIPIENT')` default in `index.ts` uses a placeholder (not a personal email address)
**Plans:** 0 plans
Plans:
- [ ] TBD (run /gsd:plan-phase 5 to break down)

View File

@@ -1,27 +1,39 @@
---
gsd_state_version: 1.0
milestone: v1.0
milestone_name: Analytics & Monitoring
status: shipped
last_updated: "2026-02-25"
progress:
total_phases: 5
completed_phases: 5
total_plans: 10
completed_plans: 10
---
# Project State # Project State
## Project Reference ## Project Reference
See: .planning/PROJECT.md (updated 2026-02-24) See: .planning/PROJECT.md (updated 2026-02-25)
**Core value:** When something breaks — an API key expires, a service goes down, a credential needs reauthorization — the admin knows immediately and knows exactly what to fix. **Core value:** When something breaks — an API key expires, a service goes down, a credential needs reauthorization — the admin knows immediately and knows exactly what to fix.
**Current focus:** Phase 5 — Tech Debt Cleanup **Current focus:** v1.0 shipped — next milestone not yet defined
## Current Position ## Current Position
Phase: 5 of 5 (Tech Debt Cleanup) Phase: 5 of 5 (all complete)
Plan: 0 of 0 in current phase Plan: All plans complete
Status: Not planned yet Status: v1.0 milestone shipped
Last activity: 2026-02-25 — Phase 5 added for tech debt closure from v1.0 audit Last activity: 2026-02-25 — v1.0 milestone archived
Progress: [██████████] 100% Progress: [██████████] 100%
## Performance Metrics ## Performance Metrics
**Velocity:** **Velocity:**
- Total plans completed: 5 - Total plans completed: 10
- Average duration: ~17 min - Timeline: 2 days (2026-02-24 → 2026-02-25)
- Total execution time: ~1.4 hours
**By Phase:** **By Phase:**
@@ -29,67 +41,26 @@ Progress: [██████████] 100%
|-------|-------|-------|----------| |-------|-------|-------|----------|
| 01-data-foundation | 2 | ~34 min | ~17 min | | 01-data-foundation | 2 | ~34 min | ~17 min |
| 02-backend-services | 4 | ~51 min | ~13 min | | 02-backend-services | 4 | ~51 min | ~13 min |
| 03-api-layer | 2 | ~16 min | ~8 min |
**Recent Trend:** | 04-frontend | 2 | ~4 min | ~2 min |
- Last 5 plans: 01-02 (26 min), 02-01 (20 min), 02-02 (18 min), 02-03 (12 min), 02-04 (1 min) | 05-tech-debt-cleanup | — | direct commit | — |
- Trend: Stable ~15 min/plan
*Updated after each plan completion*
| Phase 03-api-layer P01 | 8 | 2 tasks | 4 files |
| Phase 04-frontend P01 | 2 | 2 tasks | 3 files |
| Phase 04-frontend P02 | 2 | 1 tasks | 1 files |
## Accumulated Context ## Accumulated Context
### Decisions ### Decisions
Decisions are logged in PROJECT.md Key Decisions table. All v1.0 decisions validated — see PROJECT.md Key Decisions table for outcomes.
Recent decisions affecting current work:
- Roadmap: 4 phases following data layer → services → API → frontend dependency order
- Architecture: Health probes decoupled from document processing as separate Cloud Function export
- Architecture: Analytics writes are always fire-and-forget (never await on critical path)
- Architecture: Alert recipient stored in config, not hardcoded (PITFALL-8 prevention)
- 01-01: TEXT + CHECK constraint used for enum columns (not PostgreSQL ENUM types)
- 01-01: getSupabaseServiceClient() called per-method, never cached at module level
- 01-01: checked_at column separate from created_at on service_health_checks (probe time vs DB write time)
- 01-01: Forward-only migrations only (no rollback scripts)
- 01-02: Supabase mock uses chain.then (thenability) so both .single() and direct await patterns work from one mock
- 01-02: makeSupabaseChain() factory per test — no shared mock state between tests
- 01-02: vi.mock() factories must use only inline vi.fn() to avoid Vitest hoisting TDZ errors
- 02-02: LLM probe uses claude-haiku-4-5 with max_tokens 5 (cheapest model, prevents expensive accidental probes)
- 02-02: Supabase probe uses getPostgresPool().query('SELECT 1') not PostgREST (tests actual DB connectivity)
- 02-02: Firebase Auth probe: verifyIdToken always throws; 'INVALID'/'Decoding'/'argument' in message = SDK alive = healthy
- 02-02: Promise.allSettled for probe orchestration — all 4 probes run even if one throws outside its own try/catch
- 02-02: Per-probe HealthCheckModel.create failure swallowed with logger.error — probe results still returned to caller
- [Phase 02-backend-services]: 02-01: recordProcessingEvent return type is void (not Promise<void>) — type system prevents accidental await on critical path
- 02-03: Transporter created inside sendAlertEmail() on each call (not cached at module level) — Firebase Secrets not available at module load time
- 02-03: Suppressed alerts skip BOTH AlertEventModel.create() AND sendMail — prevents duplicate DB rows plus duplicate emails
- 02-03: Email failure caught and logged, never re-thrown — probe pipeline must continue regardless of email outage
- [Phase 02-backend-services]: runHealthProbes is a separate onSchedule Cloud Function from processDocumentJobs (PITFALL-2 compliance)
- [Phase 02-backend-services]: retryCount: 0 on runHealthProbes — 5-minute schedule makes retry unnecessary
- [Phase 02-backend-services]: runRetentionCleanup uses Promise.all() for parallel deletes across three independent monitoring tables
- [Phase 03-api-layer]: 03-02: recordProcessingEvent() instrumentation uses void return — no await at 3 lifecycle points in processJob (PITFALL-6 compliance)
- [Phase 03-api-layer]: requireAdminEmail returns 404 not 403 — does not reveal admin routes exist
- [Phase 03-api-layer]: getPostgresPool() used for aggregate SQL — Supabase JS client does not support COUNT/AVG
- [Phase 03-api-layer]: Admin env vars read inside function body not module level — Firebase Secrets timing constraint
- [Phase 04-frontend]: AlertBanner filters to active service_down/service_degraded only — recovery type is informational, not critical
- [Phase 04-frontend]: AlertEvent uses snake_case (backend raw model), ServiceHealthEntry/AnalyticsSummary use camelCase (backend admin.ts remaps)
- [Phase 04-frontend]: AdminMonitoringDashboard is self-contained with no required props
- [Phase 04-frontend]: AlertBanner placed before nav element so it shows across all tabs when admin has active critical alerts
- [Phase 04-frontend]: Alert fetch gated by isAdmin in useEffect dependency array — non-admin users never call getAlerts
### Pending Todos ### Pending Todos
None yet. None.
### Blockers/Concerns ### Blockers/Concerns
- PITFALL-6: Each analytics instrumentation point must be void/fire-and-forget — reviewer must check this in Phase 3 None — v1.0 shipped.
- PITFALL-10: All new tables need `created_at` indexes in Phase 1 migrations — query performance depends on this from day one
## Session Continuity ## Session Continuity
Last session: 2026-02-24 Last session: 2026-02-25
Stopped at: Completed 04-02-PLAN.md — AlertBanner and AdminMonitoringDashboard wired into App.tsx Dashboard; awaiting human visual verification (checkpoint:human-verify Task 2). Stopped at: v1.0 milestone archived and tagged
Resume file: None Resume file: None

View File

@@ -1,3 +1,12 @@
# Requirements Archive: v1.0 Analytics & Monitoring
**Archived:** 2026-02-25
**Status:** SHIPPED
For current requirements, see `.planning/REQUIREMENTS.md`.
---
# Requirements: CIM Summary — Analytics & Monitoring # Requirements: CIM Summary — Analytics & Monitoring
**Defined:** 2026-02-24 **Defined:** 2026-02-24

View File

@@ -0,0 +1,112 @@
# Roadmap: CIM Summary — Analytics & Monitoring
## Overview
This milestone adds persistent analytics and service health monitoring to the existing CIM Summary application. The work proceeds in four phases that respect hard dependency constraints: database schema must exist before services can write to it, services must exist before routes can expose them, and routes must be stable before the frontend can be wired up. Each phase delivers a complete, independently testable layer.
## Phases
**Phase Numbering:**
- Integer phases (1, 2, 3): Planned milestone work
- Decimal phases (2.1, 2.2): Urgent insertions (marked with INSERTED)
Decimal phases appear between their surrounding integers in numeric order.
- [ ] **Phase 1: Data Foundation** - Create schema, DB models, and verify existing Supabase connection wiring
- [x] **Phase 2: Backend Services** - Health probers, alert trigger, email sender, analytics collector, scheduler, retention cleanup (completed 2026-02-24)
- [x] **Phase 3: API Layer** - Admin-gated routes exposing all services, instrumentation hooks in existing processors (completed 2026-02-24)
- [x] **Phase 4: Frontend** - Admin dashboard page, health panel, processing metrics, alert notification banner (completed 2026-02-24)
- [ ] **Phase 5: Tech Debt Cleanup** - Config-driven admin email, consolidate retention cleanup, remove hardcoded defaults
## Phase Details
### Phase 1: Data Foundation
**Goal**: The database schema for monitoring exists and the existing Supabase connection is the only data infrastructure used
**Depends on**: Nothing (first phase)
**Requirements**: INFR-01, INFR-04
**Success Criteria** (what must be TRUE):
1. `service_health_checks` and `alert_events` tables exist in Supabase with indexes on `created_at`
2. All new tables use the existing Supabase client from `config/supabase.ts` — no new database connections added
3. `AlertEventModel.ts` exists and its CRUD methods can be called in isolation without errors
4. Migration SQL can be run against the live Supabase instance and produces the expected schema
**Plans:** 2/2 plans executed
Plans:
- [x] 01-01-PLAN.md — Migration SQL + HealthCheckModel + AlertEventModel
- [x] 01-02-PLAN.md — Unit tests for both monitoring models
### Phase 2: Backend Services
**Goal**: All monitoring logic runs correctly — health probes make real API calls, alerts fire with deduplication, analytics events write non-blocking to Supabase, and data is cleaned up on schedule
**Depends on**: Phase 1
**Requirements**: HLTH-02, HLTH-03, HLTH-04, ALRT-01, ALRT-02, ALRT-04, ANLY-01, ANLY-03, INFR-03
**Success Criteria** (what must be TRUE):
1. Each health probe makes a real authenticated API call to its target service and returns a structured result (status, latency_ms, error_message)
2. Health probe results are written to Supabase and survive a simulated cold start (data present after function restart)
3. An alert email is sent when a service probe returns degraded or down, and a second probe failure within the cooldown period does not send a duplicate email
4. Alert recipient is read from configuration (environment variable or Supabase config row), not hardcoded in source
5. Analytics events fire as fire-and-forget calls — a deliberately introduced 500ms Supabase delay does not increase processing pipeline duration
6. A scheduled probe function and a weekly retention cleanup function exist as separate Firebase Cloud Function exports
**Plans:** 4/4 plans complete
Plans:
- [ ] 02-01-PLAN.md — Analytics migration + analyticsService (fire-and-forget)
- [ ] 02-02-PLAN.md — Health probe service (4 real API probers + orchestrator)
- [ ] 02-03-PLAN.md — Alert service (deduplication + email via nodemailer)
- [ ] 02-04-PLAN.md — Cloud Function exports (runHealthProbes + runRetentionCleanup)
### Phase 3: API Layer
**Goal**: Admin-authenticated HTTP endpoints expose health status, alerts, and processing analytics; existing service processors emit analytics instrumentation
**Depends on**: Phase 2
**Requirements**: INFR-02, HLTH-01, ANLY-02
**Success Criteria** (what must be TRUE):
1. `GET /admin/health` returns current health status for all four services; a request with a non-admin Firebase token receives 403
2. `GET /admin/analytics` returns processing summary (upload counts, success/failure rates, avg processing time) sourced from Supabase, not in-memory state
3. `GET /admin/alerts` and `POST /admin/alerts/:id/acknowledge` function correctly and are blocked to non-admin users
4. Document processing in `jobProcessorService.ts` and `llmService.ts` emits analytics events at stage transitions without any change to existing processing behavior
**Plans:** 2/2 plans complete
Plans:
- [ ] 03-01-PLAN.md — Admin auth middleware + admin routes (health, analytics, alerts endpoints)
- [ ] 03-02-PLAN.md — Analytics instrumentation in jobProcessorService
### Phase 4: Frontend
**Goal**: The admin can see live service health, processing metrics, and active alerts directly in the application UI
**Depends on**: Phase 3
**Requirements**: ALRT-03, ANLY-02 (UI delivery), HLTH-01 (UI delivery)
**Success Criteria** (what must be TRUE):
1. An alert banner appears at the top of the admin UI when there is at least one unacknowledged critical alert, and disappears after the admin acknowledges it
2. The admin dashboard shows health status indicators (green/yellow/red) for all four services, with the last-checked timestamp visible
3. The admin dashboard shows processing metrics (upload counts, success/failure rates, average processing time) sourced from the persistent Supabase backend
4. A non-admin user visiting the admin route is redirected or shown an access-denied state
**Plans:** 2/2 plans complete
Plans:
- [ ] 04-01-PLAN.md — AdminService monitoring methods + AlertBanner + AdminMonitoringDashboard components
- [ ] 04-02-PLAN.md — Wire components into Dashboard + visual verification checkpoint
## Progress
**Execution Order:**
Phases execute in numeric order: 1 → 2 → 3 → 4 → 5
| Phase | Plans Complete | Status | Completed |
|-------|----------------|--------|-----------|
| 1. Data Foundation | 2/2 | Complete | 2026-02-24 |
| 2. Backend Services | 4/4 | Complete | 2026-02-24 |
| 3. API Layer | 2/2 | Complete | 2026-02-24 |
| 4. Frontend | 2/2 | Complete | 2026-02-25 |
| 5. Tech Debt Cleanup | 0/0 | Not Planned | — |
### Phase 5: Tech Debt Cleanup
**Goal**: All configuration values are env-driven (no hardcoded emails), retention cleanup is consolidated into a single function, and deployment defaults use placeholders
**Depends on**: Phase 4
**Requirements**: None (tech debt from v1.0 audit)
**Gap Closure**: Closes tech debt items from v1.0-MILESTONE-AUDIT.md
**Success Criteria** (what must be TRUE):
1. Frontend `adminService.ts` reads admin email from `import.meta.env.VITE_ADMIN_EMAIL` instead of a hardcoded literal
2. Only one retention cleanup function exists in `index.ts` (the model-layer `runRetentionCleanup`), with the pre-existing raw SQL `cleanupOldData` consolidated or removed
3. `defineString('EMAIL_WEEKLY_RECIPIENT')` default in `index.ts` uses a placeholder (not a personal email address)
**Plans:** 0 plans
Plans:
- [ ] TBD (run /gsd:plan-phase 5 to break down)