Add acceptance tests and align defaults to Sonnet 4

2026-02-23 14:45:57 -05:00
parent 14d5c360e5
commit 9480a3c994
12 changed files with 10034 additions and 85 deletions
--- a/TODO_AND_OPTIMIZATIONS.md
+++ b/TODO_AND_OPTIMIZATIONS.md
@@ -0,0 +1,18 @@
+# Operational To-Dos & Optimization Backlog
+
+## To-Do List (as of 2026-02-23)
+- **Wire Firebase Functions secrets**: Attach `ANTHROPIC_API_KEY`, `OPENAI_API_KEY`, `OPENROUTER_API_KEY`, `SUPABASE_SERVICE_KEY`, `SUPABASE_ANON_KEY`, `DATABASE_URL`, `EMAIL_PASS`, and `FIREBASE_SERVICE_ACCOUNT` to every deployed function so the runtime no longer depends on local `.env` values.
+- **Set `GCLOUD_PROJECT_ID` explicitly**: Export `GCLOUD_PROJECT_ID=cim-summarizer` (or the active project) for local scripts and production functions so Document AI processor paths stop defaulting to `projects/undefined`.
+- **Acceptance-test expansion**: Add additional CIM/output fixture pairs (beyond Handi Foods) so the automated acceptance suite enforces coverage across diverse deal structures.
+- **Backend log hygiene**: Keep tailing `logs/error.log` after each deploy to confirm the service account + Anthropic credential fixes remain in place; document notable findings in deployment notes.
+- **Infrastructure deployment checklist**: Update `DEPLOYMENT_GUIDE.md` with the exact Firebase/GCP commands used to fetch secrets and run Sonnet validation so future deploys stay reproducible.
+
+## Optimization Backlog (ordered by Accuracy → Speed → Cost benefit vs. implementation risk)
+1. **Deterministic financial parser enhancements** (status: partially addressed). Continue improving token alignment (multi-row tables, negative numbers) to reduce dependence on LLM retries. Risk: low, limited to parser module.
+2. **Retrieval gating per Agentic pass**. Swap the “top-N chunk blast” with similarity search keyed to each prompt (deal overview, market, thesis). Benefit: higher accuracy + lower token count. Risk: medium; needs robust Supabase RPC fallbacks.
+3. **Embedding cache keyed by document checksum**. Skip re-embedding when a document/version is unchanged to cut processing time/cost on retries. Risk: medium; requires schema changes to store content hashes.
+4. **Field-level validation & dependency checks prior to gap filling**. Enforce numeric relationships (e.g., EBITDA margin = EBITDA / Revenue) and re-query only the failing sections. Benefit: accuracy; risk: medium (adds validator & targeted prompts).
+5. **Stream Document AI chunks directly into chunker**. Avoid writing intermediate PDFs to disk/GCS when splitting >30 page CIMs. Benefit: speed/cost; risk: medium-high because it touches PDF splitting + Document AI integration.
+6. **Parallelize independent multi-pass queries** (e.g., run Pass 2 and Pass 3 concurrently when quota allows). Benefit: lower latency; risk: medium-high due to Anthropic rate limits & merge ordering.
+7. **Expose per-pass metrics via `/health/agentic-rag`**. Surface timing/token/cost data so regressions are visible. Benefit: operational accuracy; risk: low.
+8. **Structured comparison harness for CIM outputs**. Reuse the acceptance-test fixtures to generate diff reports for human reviewers (baseline vs. new model). Benefit: accuracy guardrail; risk: low once additional fixtures exist.