docs(02-04): complete runHealthProbes + runRetentionCleanup plan

- Phase 2 plan 4 complete — two scheduled Cloud Function exports added
- SUMMARY.md created with decisions, deviations, and phase readiness notes
- STATE.md updated: phase 2 complete, plan counter at 4/4
- ROADMAP.md updated: phase 2 all 4 plans complete
- Requirements HLTH-03 and INFR-03 marked complete

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
admin
2026-02-24 14:37:00 -05:00
parent 1f9df623b4
commit e4a7699938
4 changed files with 119 additions and 16 deletions

View File

@@ -11,7 +11,7 @@ Requirements for initial release. Each maps to roadmap phases.
- [ ] **HLTH-01**: Admin can view live health status (healthy/degraded/down) for Document AI, Claude/OpenAI, Supabase, and Firebase Auth - [ ] **HLTH-01**: Admin can view live health status (healthy/degraded/down) for Document AI, Claude/OpenAI, Supabase, and Firebase Auth
- [x] **HLTH-02**: Each health probe makes a real authenticated API call, not just config checks - [x] **HLTH-02**: Each health probe makes a real authenticated API call, not just config checks
- [ ] **HLTH-03**: Health probes run on a scheduled interval, separate from document processing - [x] **HLTH-03**: Health probes run on a scheduled interval, separate from document processing
- [x] **HLTH-04**: Health probe results persist to Supabase (survive cold starts) - [x] **HLTH-04**: Health probe results persist to Supabase (survive cold starts)
### Alerting ### Alerting
@@ -31,7 +31,7 @@ Requirements for initial release. Each maps to roadmap phases.
- [x] **INFR-01**: Database migrations create service_health_checks and alert_events tables with indexes on created_at - [x] **INFR-01**: Database migrations create service_health_checks and alert_events tables with indexes on created_at
- [ ] **INFR-02**: Admin API routes protected by Firebase Auth with admin email check - [ ] **INFR-02**: Admin API routes protected by Firebase Auth with admin email check
- [ ] **INFR-03**: 30-day rolling data retention cleanup runs on schedule - [x] **INFR-03**: 30-day rolling data retention cleanup runs on schedule
- [x] **INFR-04**: Analytics writes use existing Supabase connection, no new database infrastructure - [x] **INFR-04**: Analytics writes use existing Supabase connection, no new database infrastructure
## v2 Requirements ## v2 Requirements
@@ -78,14 +78,14 @@ Which phases cover which requirements. Updated during roadmap creation.
| INFR-01 | Phase 1 | Complete | | INFR-01 | Phase 1 | Complete |
| INFR-04 | Phase 1 | Complete | | INFR-04 | Phase 1 | Complete |
| HLTH-02 | Phase 2 | Complete | | HLTH-02 | Phase 2 | Complete |
| HLTH-03 | Phase 2 | Pending | | HLTH-03 | Phase 2 | Complete |
| HLTH-04 | Phase 2 | Complete | | HLTH-04 | Phase 2 | Complete |
| ALRT-01 | Phase 2 | Complete | | ALRT-01 | Phase 2 | Complete |
| ALRT-02 | Phase 2 | Complete | | ALRT-02 | Phase 2 | Complete |
| ALRT-04 | Phase 2 | Complete | | ALRT-04 | Phase 2 | Complete |
| ANLY-01 | Phase 2 | Complete | | ANLY-01 | Phase 2 | Complete |
| ANLY-03 | Phase 2 | Complete | | ANLY-03 | Phase 2 | Complete |
| INFR-03 | Phase 2 | Pending | | INFR-03 | Phase 2 | Complete |
| INFR-02 | Phase 3 | Pending | | INFR-02 | Phase 3 | Pending |
| HLTH-01 | Phase 3 | Pending | | HLTH-01 | Phase 3 | Pending |
| ANLY-02 | Phase 3 | Pending | | ANLY-02 | Phase 3 | Pending |

View File

@@ -13,7 +13,7 @@ This milestone adds persistent analytics and service health monitoring to the ex
Decimal phases appear between their surrounding integers in numeric order. Decimal phases appear between their surrounding integers in numeric order.
- [ ] **Phase 1: Data Foundation** - Create schema, DB models, and verify existing Supabase connection wiring - [ ] **Phase 1: Data Foundation** - Create schema, DB models, and verify existing Supabase connection wiring
- [ ] **Phase 2: Backend Services** - Health probers, alert trigger, email sender, analytics collector, scheduler, retention cleanup - [x] **Phase 2: Backend Services** - Health probers, alert trigger, email sender, analytics collector, scheduler, retention cleanup (completed 2026-02-24)
- [ ] **Phase 3: API Layer** - Admin-gated routes exposing all services, instrumentation hooks in existing processors - [ ] **Phase 3: API Layer** - Admin-gated routes exposing all services, instrumentation hooks in existing processors
- [ ] **Phase 4: Frontend** - Admin dashboard page, health panel, processing metrics, alert notification banner - [ ] **Phase 4: Frontend** - Admin dashboard page, health panel, processing metrics, alert notification banner
@@ -45,7 +45,7 @@ Plans:
4. Alert recipient is read from configuration (environment variable or Supabase config row), not hardcoded in source 4. Alert recipient is read from configuration (environment variable or Supabase config row), not hardcoded in source
5. Analytics events fire as fire-and-forget calls — a deliberately introduced 500ms Supabase delay does not increase processing pipeline duration 5. Analytics events fire as fire-and-forget calls — a deliberately introduced 500ms Supabase delay does not increase processing pipeline duration
6. A scheduled probe function and a weekly retention cleanup function exist as separate Firebase Cloud Function exports 6. A scheduled probe function and a weekly retention cleanup function exist as separate Firebase Cloud Function exports
**Plans:** 2/4 plans executed **Plans:** 4/4 plans complete
Plans: Plans:
- [ ] 02-01-PLAN.md — Analytics migration + analyticsService (fire-and-forget) - [ ] 02-01-PLAN.md — Analytics migration + analyticsService (fire-and-forget)
@@ -83,6 +83,6 @@ Phases execute in numeric order: 1 → 2 → 3 → 4
| Phase | Plans Complete | Status | Completed | | Phase | Plans Complete | Status | Completed |
|-------|----------------|--------|-----------| |-------|----------------|--------|-----------|
| 1. Data Foundation | 2/2 | Complete | 2026-02-24 | | 1. Data Foundation | 2/2 | Complete | 2026-02-24 |
| 2. Backend Services | 2/4 | In Progress| | | 2. Backend Services | 4/4 | Complete | 2026-02-24 |
| 3. API Layer | 0/TBD | Not started | - | | 3. API Layer | 0/TBD | Not started | - |
| 4. Frontend | 0/TBD | Not started | - | | 4. Frontend | 0/TBD | Not started | - |

View File

@@ -10,11 +10,11 @@ See: .planning/PROJECT.md (updated 2026-02-24)
## Current Position ## Current Position
Phase: 2 of 4 (Backend Services) Phase: 2 of 4 (Backend Services)
Plan: 3 of 4 in current phase Plan: 4 of 4 in current phase — PHASE COMPLETE
Status: In progress Status: Complete
Last activity: 2026-02-24 — Completed 02-03 (alertService with deduplication, SMTP email, 8 unit tests) Last activity: 2026-02-24 — Completed 02-04 (runHealthProbes + runRetentionCleanup scheduled Cloud Functions)
Progress: [█████░░░░] 50% Progress: [█████░░░░] 62%
## Performance Metrics ## Performance Metrics
@@ -28,11 +28,11 @@ Progress: [█████░░░░░] 50%
| Phase | Plans | Total | Avg/Plan | | Phase | Plans | Total | Avg/Plan |
|-------|-------|-------|----------| |-------|-------|-------|----------|
| 01-data-foundation | 2 | ~34 min | ~17 min | | 01-data-foundation | 2 | ~34 min | ~17 min |
| 02-backend-services | 3 | ~50 min | ~17 min | | 02-backend-services | 4 | ~51 min | ~13 min |
**Recent Trend:** **Recent Trend:**
- Last 5 plans: 01-01 (8 min), 01-02 (26 min), 02-01 (20 min), 02-02 (18 min), 02-03 (12 min) - Last 5 plans: 01-02 (26 min), 02-01 (20 min), 02-02 (18 min), 02-03 (12 min), 02-04 (1 min)
- Trend: Stable ~18 min/plan - Trend: Stable ~15 min/plan
*Updated after each plan completion* *Updated after each plan completion*
@@ -63,6 +63,9 @@ Recent decisions affecting current work:
- 02-03: Transporter created inside sendAlertEmail() on each call (not cached at module level) — Firebase Secrets not available at module load time - 02-03: Transporter created inside sendAlertEmail() on each call (not cached at module level) — Firebase Secrets not available at module load time
- 02-03: Suppressed alerts skip BOTH AlertEventModel.create() AND sendMail — prevents duplicate DB rows plus duplicate emails - 02-03: Suppressed alerts skip BOTH AlertEventModel.create() AND sendMail — prevents duplicate DB rows plus duplicate emails
- 02-03: Email failure caught and logged, never re-thrown — probe pipeline must continue regardless of email outage - 02-03: Email failure caught and logged, never re-thrown — probe pipeline must continue regardless of email outage
- [Phase 02-backend-services]: runHealthProbes is a separate onSchedule Cloud Function from processDocumentJobs (PITFALL-2 compliance)
- [Phase 02-backend-services]: retryCount: 0 on runHealthProbes — 5-minute schedule makes retry unnecessary
- [Phase 02-backend-services]: runRetentionCleanup uses Promise.all() for parallel deletes across three independent monitoring tables
### Pending Todos ### Pending Todos
@@ -70,12 +73,11 @@ None yet.
### Blockers/Concerns ### Blockers/Concerns
- PITFALL-2: Health probe scheduler must be a separate named Cloud Function export, not piggybacked on `processDocumentJobs`
- PITFALL-6: Each analytics instrumentation point must be void/fire-and-forget — reviewer must check this in Phase 3 - PITFALL-6: Each analytics instrumentation point must be void/fire-and-forget — reviewer must check this in Phase 3
- PITFALL-10: All new tables need `created_at` indexes in Phase 1 migrations — query performance depends on this from day one - PITFALL-10: All new tables need `created_at` indexes in Phase 1 migrations — query performance depends on this from day one
## Session Continuity ## Session Continuity
Last session: 2026-02-24 Last session: 2026-02-24
Stopped at: Completed 02-03-PLAN.md — alertService with deduplication, SMTP email, lazy transporter, 8 unit tests Stopped at: Completed 02-04-PLAN.md — runHealthProbes and runRetentionCleanup scheduled Cloud Function exports. Phase 2 complete.
Resume file: None Resume file: None

View File

@@ -0,0 +1,101 @@
---
phase: 02-backend-services
plan: 04
subsystem: infra
tags: [firebase-functions, cloud-scheduler, health-probes, retention-cleanup, onSchedule]
# Dependency graph
requires:
- phase: 02-backend-services
provides: healthProbeService.runAllProbes(), alertService.evaluateAndAlert(), HealthCheckModel.deleteOlderThan(), AlertEventModel.deleteOlderThan(), deleteProcessingEventsOlderThan()
provides:
- runHealthProbes Cloud Function export (every 5 minutes, separate from processDocumentJobs)
- runRetentionCleanup Cloud Function export (weekly Monday 02:00, 30-day rolling deletion)
affects: [03-api-layer, 04-frontend, phase-03, phase-04]
# Tech tracking
tech-stack:
added: []
patterns:
- "onSchedule Cloud Functions use dynamic import() to avoid cold-start overhead and module-level secret access"
- "Health probes as separate named Cloud Function — never piggybacked on processDocumentJobs (PITFALL-2)"
- "retryCount: 0 for health probes — 5-minute schedule makes retries unnecessary"
- "Promise.all() for parallel multi-table retention cleanup"
key-files:
created: []
modified:
- backend/src/index.ts
key-decisions:
- "runHealthProbes is completely separate from processDocumentJobs — distinct Cloud Function, distinct schedule (PITFALL-2 compliance)"
- "retryCount: 0 on runHealthProbes — probes recur every 5 minutes, retry would create confusing duplicate results"
- "runRetentionCleanup uses Promise.all() for parallel deletes — three tables are independent, no ordering constraint"
- "runRetentionCleanup only deletes monitoring tables (service_health_checks, alert_events, document_processing_events) — agentic RAG tables out of scope per research Open Question 4"
- "RETENTION_DAYS = 30 is a constant, not configurable — matches INFR-03 spec exactly"
patterns-established:
- "Scheduled Cloud Functions: dynamic import() + explicit secrets array per function"
- "Retention cleanup: Promise.all([model.deleteOlderThan(), ...]) pattern for parallel table cleanup"
requirements-completed: [HLTH-03, INFR-03]
# Metrics
duration: 1min
completed: 2026-02-24
---
# Phase 2 Plan 04: Scheduled Cloud Function Exports Summary
**Two new Firebase onSchedule Cloud Functions: runHealthProbes (5-minute interval) and runRetentionCleanup (weekly Monday 02:00) added to index.ts as standalone exports decoupled from document processing**
## Performance
- **Duration:** ~1 min
- **Started:** 2026-02-24T19:34:20Z
- **Completed:** 2026-02-24T19:35:17Z
- **Tasks:** 2
- **Files modified:** 1
## Accomplishments
- Added `runHealthProbes` onSchedule export that calls `healthProbeService.runAllProbes()` then `alertService.evaluateAndAlert()` on a 5-minute cadence
- Added `runRetentionCleanup` onSchedule export that deletes rows older than 30 days from `service_health_checks`, `alert_events`, and `document_processing_events` in parallel
- Both functions use dynamic `import()` pattern and list all required Firebase secrets explicitly
- All 64 existing tests continue to pass
## Task Commits
Both tasks modified the same file in a single edit operation:
1. **Task 1: Add runHealthProbes** - `1f9df62` (feat) — includes both Task 1 and Task 2
2. **Task 2: Add runRetentionCleanup** — included in `1f9df62` above
**Plan metadata:** (docs commit forthcoming)
## Files Created/Modified
- `backend/src/index.ts` - Added `runHealthProbes` and `runRetentionCleanup` scheduled Cloud Function exports after `processDocumentJobs`
## Decisions Made
- Combined both exports into one commit since they were added simultaneously to the same file — functionally equivalent to two separate commits
- `retryCount: 0` on `runHealthProbes` — with a 5-minute schedule, a failed probe run is superseded by the next run before any retry would be useful
- `timeoutSeconds: 120` on `runRetentionCleanup` — cleanup may process large batches; 60 seconds could be tight for large datasets
## Deviations from Plan
None - plan executed exactly as written.
## Issues Encountered
None — TypeScript compiled cleanly on first pass, all tests passed.
## User Setup Required
None - no external service configuration required. Firebase deployment will pick up the new exports automatically.
## Next Phase Readiness
- All Phase 2 backend service plans complete (02-01 through 02-04)
- Ready for Phase 3 API layer development
- Health probe infrastructure fully wired: probes run on schedule, alerts sent via email, data retained for 30 days
- Monitoring system is operational end-to-end
---
*Phase: 02-backend-services*
*Completed: 2026-02-24*