diff --git a/.planning/REQUIREMENTS.md b/.planning/REQUIREMENTS.md new file mode 100644 index 0000000..d8351a7 --- /dev/null +++ b/.planning/REQUIREMENTS.md @@ -0,0 +1,101 @@ +# Requirements: CIM Summary — Analytics & Monitoring + +**Defined:** 2026-02-24 +**Core Value:** When something breaks — an API key expires, a service goes down, a credential needs reauthorization — the admin knows immediately and knows exactly what to fix. + +## v1 Requirements + +Requirements for initial release. Each maps to roadmap phases. + +### Service Health + +- [ ] **HLTH-01**: Admin can view live health status (healthy/degraded/down) for Document AI, Claude/OpenAI, Supabase, and Firebase Auth +- [ ] **HLTH-02**: Each health probe makes a real authenticated API call, not just config checks +- [ ] **HLTH-03**: Health probes run on a scheduled interval, separate from document processing +- [ ] **HLTH-04**: Health probe results persist to Supabase (survive cold starts) + +### Alerting + +- [ ] **ALRT-01**: Admin receives email alert when a service goes down or degrades +- [ ] **ALRT-02**: Alert deduplication prevents repeat emails for the same ongoing issue (cooldown period) +- [ ] **ALRT-03**: Admin sees in-app alert banner for active critical issues +- [ ] **ALRT-04**: Alert recipient stored as configuration, not hardcoded + +### Processing Analytics + +- [ ] **ANLY-01**: Document processing events persist to Supabase at write time (not in-memory only) +- [ ] **ANLY-02**: Admin can view processing summary: upload counts, success/failure rates, avg processing time +- [ ] **ANLY-03**: Analytics instrumentation is non-blocking (fire-and-forget, never delays processing pipeline) + +### Infrastructure + +- [ ] **INFR-01**: Database migrations create service_health_checks and alert_events tables with indexes on created_at +- [ ] **INFR-02**: Admin API routes protected by Firebase Auth with admin email check +- [ ] **INFR-03**: 30-day rolling data retention cleanup runs on schedule +- [ ] **INFR-04**: Analytics writes use existing Supabase connection, no new database infrastructure + +## v2 Requirements + +Deferred to future release. Tracked but not in current roadmap. + +### Service Health + +- **HLTH-05**: Admin can view 7-day service health history with uptime percentages +- **HLTH-06**: Real-time auth failure detection classifies auth errors (401/403) vs transient errors (429/503) and alerts immediately on credential issues + +### Alerting + +- **ALRT-05**: Admin can acknowledge or snooze alerts from the UI +- **ALRT-06**: Admin receives recovery email when a downed service returns healthy + +### Processing Analytics + +- **ANLY-04**: Admin can view processing time trend charts over time +- **ANLY-05**: Admin can view LLM token usage and estimated cost per document and per month + +### Infrastructure + +- **INFR-05**: Dashboard shows staleness warning when monitoring data stops arriving + +## Out of Scope + +| Feature | Reason | +|---------|--------| +| External monitoring tools (Grafana, Datadog) | Operational overhead unjustified for single-admin app | +| Multi-user analytics views | One admin user, RBAC complexity for zero benefit | +| WebSocket/SSE real-time updates | Polling at 60s intervals sufficient; WebSockets complex in Cloud Functions | +| Mobile push notifications | Email + in-app covers notification needs | +| Historical analytics beyond 30 days | Storage costs; can extend later | +| ML-based anomaly detection | Threshold-based alerting sufficient for this scale | +| Log aggregation / log search UI | Firebase Cloud Logging handles this | + +## Traceability + +Which phases cover which requirements. Updated during roadmap creation. + +| Requirement | Phase | Status | +|-------------|-------|--------| +| HLTH-01 | — | Pending | +| HLTH-02 | — | Pending | +| HLTH-03 | — | Pending | +| HLTH-04 | — | Pending | +| ALRT-01 | — | Pending | +| ALRT-02 | — | Pending | +| ALRT-03 | — | Pending | +| ALRT-04 | — | Pending | +| ANLY-01 | — | Pending | +| ANLY-02 | — | Pending | +| ANLY-03 | — | Pending | +| INFR-01 | — | Pending | +| INFR-02 | — | Pending | +| INFR-03 | — | Pending | +| INFR-04 | — | Pending | + +**Coverage:** +- v1 requirements: 15 total +- Mapped to phases: 0 +- Unmapped: 15 + +--- +*Requirements defined: 2026-02-24* +*Last updated: 2026-02-24 after initial definition*