# CIM Summary — Analytics & Monitoring

## What This Is

An analytics dashboard and service health monitoring system for the CIM Summary application. Provides persistent document processing metrics, scheduled health probes for all 4 external services, email + in-app alerting when APIs or credentials need attention, and an admin-only monitoring dashboard.

## Core Value

When something breaks — an API key expires, a service goes down, a credential needs reauthorization — the admin knows immediately and knows exactly what to fix.

## Requirements

### Validated

- ✓ Document upload and processing pipeline — existing
- ✓ Multi-provider LLM integration (Anthropic, OpenAI, OpenRouter) — existing
- ✓ Google Document AI text extraction — existing
- ✓ Supabase PostgreSQL with pgvector for storage and search — existing
- ✓ Firebase Authentication — existing
- ✓ Google Cloud Storage for file management — existing
- ✓ Background job queue with retry logic — existing
- ✓ Structured logging with Winston and correlation IDs — existing
- ✓ Basic health endpoints (`/health`, `/health/config`, `/monitoring/dashboard`) — existing
- ✓ PDF generation and export — existing
- ✓ Admin can view live health status for all 4 services (HLTH-01) — v1.0
- ✓ Health probes make real authenticated API calls (HLTH-02) — v1.0
- ✓ Scheduled periodic health probes (HLTH-03) — v1.0
- ✓ Health probe results persist to Supabase (HLTH-04) — v1.0
- ✓ Email alert on service down/degraded (ALRT-01) — v1.0
- ✓ Alert deduplication within cooldown (ALRT-02) — v1.0
- ✓ In-app alert banner for critical issues (ALRT-03) — v1.0
- ✓ Alert recipient from config, not hardcoded (ALRT-04) — v1.0
- ✓ Processing events persist at write time (ANLY-01) — v1.0
- ✓ Admin can view processing summary (ANLY-02) — v1.0
- ✓ Analytics instrumentation non-blocking (ANLY-03) — v1.0
- ✓ DB migrations with indexes on created_at (INFR-01) — v1.0
- ✓ Admin API routes protected by Firebase Auth (INFR-02) — v1.0
- ✓ 30-day rolling data retention cleanup (INFR-03) — v1.0
- ✓ Analytics use existing Supabase connection (INFR-04) — v1.0

### Active

(None — next milestone not yet defined. Run `/gsd:new-milestone` to plan.)

### Out of Scope

- External monitoring tools (Grafana, Datadog) — keeping it in-app for simplicity
- Non-admin user analytics views — admin-only for now
- Mobile push notifications — email + in-app sufficient
- Historical analytics beyond 30 days — lean storage, can extend later
- Real-time WebSocket updates — polling is sufficient for admin dashboard
- ML-based anomaly detection — threshold-based alerting sufficient at this scale

## Context

Shipped v1.0 with 31,184 LOC TypeScript across Express.js backend and React frontend.
Tech stack: Express.js, React, Supabase (PostgreSQL + pgvector), Firebase Auth, Firebase Cloud Functions, Google Document AI, Anthropic/OpenAI LLMs, nodemailer, Tailwind CSS.

Four external services monitored with real authenticated probes:
1. **Google Document AI** — service account credential validation
2. **Claude/OpenAI** — API key validation via cheapest model (claude-haiku-4-5, max_tokens 5)
3. **Supabase** — direct PostgreSQL pool query (`SELECT 1`)
4. **Firebase Auth** — SDK liveness via verifyIdToken error classification

Admin user: jpressnell@bluepointcapital.com (config-driven, not hardcoded).

## Constraints

- **Tech stack**: Express.js backend + React frontend
- **Auth**: Admin-only access via Firebase Auth with config-driven email check
- **Storage**: Supabase PostgreSQL — no new database infrastructure
- **Email**: nodemailer for alert delivery
- **Deployment**: Firebase Cloud Functions (14-minute timeout)
- **Data retention**: 30-day rolling window

## Key Decisions

| Decision | Rationale | Outcome |
|----------|-----------|---------|
| In-app dashboard over external tools | Simpler setup, no additional infrastructure | ✓ Good — admin sees everything in one place |
| Email + in-app dual alerting | Redundancy for critical issues | ✓ Good — covers both active and passive monitoring |
| 30-day retention | Balances useful trend data with storage efficiency | ✓ Good — consolidated into single cleanup function |
| Single admin (config-driven) | Simple RBAC, can extend later | ✓ Good — email now env-driven after tech debt cleanup |
| Scheduled probes + fire-and-forget analytics | Decouples monitoring from processing | ✓ Good — zero impact on processing pipeline latency |
| 404 (not 403) for non-admin routes | Does not reveal admin routes exist | ✓ Good — security through obscurity at API level |
| void return type for analytics writes | Prevents accidental await on critical path | ✓ Good — type system enforces fire-and-forget pattern |
| Promise.allSettled for probe orchestration | All 4 probes run even if one throws | ✓ Good — partial results better than total failure |

---
*Last updated: 2026-02-25 after v1.0 milestone*