- SUMMARY.md with deduplication, lazy transporter, and email decisions - STATE.md: plan 3/4, 50% progress, decisions recorded - ROADMAP.md: phase 02 updated (3/4 summaries) - REQUIREMENTS.md: ALRT-01, ALRT-02, ALRT-04 marked complete
4.1 KiB
4.1 KiB
Requirements: CIM Summary — Analytics & Monitoring
Defined: 2026-02-24 Core Value: When something breaks — an API key expires, a service goes down, a credential needs reauthorization — the admin knows immediately and knows exactly what to fix.
v1 Requirements
Requirements for initial release. Each maps to roadmap phases.
Service Health
- HLTH-01: Admin can view live health status (healthy/degraded/down) for Document AI, Claude/OpenAI, Supabase, and Firebase Auth
- HLTH-02: Each health probe makes a real authenticated API call, not just config checks
- HLTH-03: Health probes run on a scheduled interval, separate from document processing
- HLTH-04: Health probe results persist to Supabase (survive cold starts)
Alerting
- ALRT-01: Admin receives email alert when a service goes down or degrades
- ALRT-02: Alert deduplication prevents repeat emails for the same ongoing issue (cooldown period)
- ALRT-03: Admin sees in-app alert banner for active critical issues
- ALRT-04: Alert recipient stored as configuration, not hardcoded
Processing Analytics
- ANLY-01: Document processing events persist to Supabase at write time (not in-memory only)
- ANLY-02: Admin can view processing summary: upload counts, success/failure rates, avg processing time
- ANLY-03: Analytics instrumentation is non-blocking (fire-and-forget, never delays processing pipeline)
Infrastructure
- INFR-01: Database migrations create service_health_checks and alert_events tables with indexes on created_at
- INFR-02: Admin API routes protected by Firebase Auth with admin email check
- INFR-03: 30-day rolling data retention cleanup runs on schedule
- INFR-04: Analytics writes use existing Supabase connection, no new database infrastructure
v2 Requirements
Deferred to future release. Tracked but not in current roadmap.
Service Health
- HLTH-05: Admin can view 7-day service health history with uptime percentages
- HLTH-06: Real-time auth failure detection classifies auth errors (401/403) vs transient errors (429/503) and alerts immediately on credential issues
Alerting
- ALRT-05: Admin can acknowledge or snooze alerts from the UI
- ALRT-06: Admin receives recovery email when a downed service returns healthy
Processing Analytics
- ANLY-04: Admin can view processing time trend charts over time
- ANLY-05: Admin can view LLM token usage and estimated cost per document and per month
Infrastructure
- INFR-05: Dashboard shows staleness warning when monitoring data stops arriving
Out of Scope
| Feature | Reason |
|---|---|
| External monitoring tools (Grafana, Datadog) | Operational overhead unjustified for single-admin app |
| Multi-user analytics views | One admin user, RBAC complexity for zero benefit |
| WebSocket/SSE real-time updates | Polling at 60s intervals sufficient; WebSockets complex in Cloud Functions |
| Mobile push notifications | Email + in-app covers notification needs |
| Historical analytics beyond 30 days | Storage costs; can extend later |
| ML-based anomaly detection | Threshold-based alerting sufficient for this scale |
| Log aggregation / log search UI | Firebase Cloud Logging handles this |
Traceability
Which phases cover which requirements. Updated during roadmap creation.
| Requirement | Phase | Status |
|---|---|---|
| INFR-01 | Phase 1 | Complete |
| INFR-04 | Phase 1 | Complete |
| HLTH-02 | Phase 2 | Complete |
| HLTH-03 | Phase 2 | Pending |
| HLTH-04 | Phase 2 | Complete |
| ALRT-01 | Phase 2 | Complete |
| ALRT-02 | Phase 2 | Complete |
| ALRT-04 | Phase 2 | Complete |
| ANLY-01 | Phase 2 | Complete |
| ANLY-03 | Phase 2 | Complete |
| INFR-03 | Phase 2 | Pending |
| INFR-02 | Phase 3 | Pending |
| HLTH-01 | Phase 3 | Pending |
| ANLY-02 | Phase 3 | Pending |
| ALRT-03 | Phase 4 | Pending |
Coverage:
- v1 requirements: 15 total
- Mapped to phases: 15
- Unmapped: 0
Requirements defined: 2026-02-24 Last updated: 2026-02-24 — traceability mapped after roadmap creation