docs: map existing codebase

This commit is contained in:
admin
2026-02-24 10:28:22 -05:00
parent 9a906763c7
commit e6e1b1fa6f
7 changed files with 1969 additions and 0 deletions

View File

@@ -0,0 +1,243 @@
# Architecture
**Analysis Date:** 2026-02-24
## Pattern Overview
**Overall:** Full-stack distributed system combining Express.js backend with React frontend, implementing a **multi-stage document processing pipeline** with queued background jobs and real-time monitoring.
**Key Characteristics:**
- Server-rendered PDF generation with single-pass LLM processing
- Asynchronous job queue for background document processing (max 3 concurrent)
- Firebase authentication with Supabase PostgreSQL + pgvector for embeddings
- Multi-language LLM support (Anthropic, OpenAI, OpenRouter)
- Structured schema extraction using Zod and LLM-driven analysis
- Google Document AI for OCR and text extraction
- Real-time upload progress tracking via SSE/polling
- Correlation ID tracking throughout distributed pipeline
## Layers
**API Layer (Express + TypeScript):**
- Purpose: HTTP request routing, authentication, and response handling
- Location: `backend/src/index.ts`, `backend/src/routes/`, `backend/src/controllers/`
- Contains: Route definitions, request validation, error handling
- Depends on: Middleware (auth, validation), Services
- Used by: Frontend and external clients
**Authentication Layer:**
- Purpose: Firebase ID token verification and user identity validation
- Location: `backend/src/middleware/firebaseAuth.ts`, `backend/src/config/firebase.ts`
- Contains: Token verification, service account initialization, session recovery
- Depends on: Firebase Admin SDK, configuration
- Used by: All protected routes via `verifyFirebaseToken` middleware
**Controller Layer:**
- Purpose: Request handling, input validation, service orchestration
- Location: `backend/src/controllers/documentController.ts`, `backend/src/controllers/authController.ts`
- Contains: `getUploadUrl()`, `processDocument()`, `getDocumentStatus()` handlers
- Depends on: Models, Services, Middleware
- Used by: Routes
**Service Layer:**
- Purpose: Business logic, external API integration, document processing orchestration
- Location: `backend/src/services/`
- Contains:
- `unifiedDocumentProcessor.ts` - Main orchestrator, strategy selection
- `singlePassProcessor.ts` - 2-LLM-call extraction (pass 1 + quality check)
- `documentAiProcessor.ts` - Google Document AI text extraction
- `llmService.ts` - LLM API calls with retry logic (3 attempts, exponential backoff)
- `jobQueueService.ts` - Background job processing (EventEmitter-based)
- `fileStorageService.ts` - Google Cloud Storage signed URLs and uploads
- `vectorDatabaseService.ts` - Supabase vector embeddings and search
- `pdfGenerationService.ts` - Puppeteer-based PDF rendering
- `csvExportService.ts` - Financial data export
- Depends on: Models, Config, Utilities
- Used by: Controllers, Job Queue
**Model Layer (Data Access):**
- Purpose: Database interactions, query execution, schema validation
- Location: `backend/src/models/`
- Contains: `DocumentModel.ts`, `ProcessingJobModel.ts`, `UserModel.ts`, `VectorDatabaseModel.ts`
- Depends on: Supabase client, configuration
- Used by: Services, Controllers
**Job Queue Layer:**
- Purpose: Asynchronous background processing with priority and retry handling
- Location: `backend/src/services/jobQueueService.ts`, `backend/src/services/jobProcessorService.ts`
- Contains: In-memory queue, worker pool (max 3 concurrent), Firebase scheduled function trigger
- Depends on: Services (document processor), Models
- Used by: Controllers (to enqueue work), Scheduled functions (to trigger processing)
**Frontend Layer (React + TypeScript):**
- Purpose: User interface for document upload, processing monitoring, and review
- Location: `frontend/src/`
- Contains: Components (Upload, List, Viewer, Analytics), Services, Contexts
- Depends on: Backend API, Firebase Auth, Axios
- Used by: Web browsers
## Data Flow
**Document Upload & Processing Flow:**
1. **Upload Initiation** (Frontend)
- User selects PDF file via `DocumentUpload` component
- Calls `documentService.getUploadUrl()` → Backend `/documents/upload-url` endpoint
- Backend creates document record (status: 'uploading') and generates signed GCS URL
2. **File Upload** (Frontend → GCS)
- Frontend uploads file directly to Google Cloud Storage via signed URL
- Frontend polls `documentService.getDocumentStatus()` for upload completion
- `UploadMonitoringDashboard` displays real-time progress
3. **Processing Trigger** (Frontend → Backend)
- Frontend calls `POST /documents/{id}/process` once upload complete
- Controller creates processing job and enqueues to `jobQueueService`
- Controller immediately returns job ID
4. **Background Job Execution** (Job Queue)
- Scheduled Firebase function (`processDocumentJobs`) runs every 1 minute
- Calls `jobProcessorService.processJobs()` to dequeue and execute
- For each queued document:
- Fetch file from GCS
- Update status to 'extracting_text'
- Call `unifiedDocumentProcessor.processDocument()`
5. **Document Processing** (Single-Pass Strategy)
- **Pass 1 - LLM Extraction:**
- `documentAiProcessor.extractText()` (if needed) - Google Document AI OCR
- `llmService.processCIMDocument()` - Claude/OpenAI structured extraction
- Produces `CIMReview` object with financial, market, management data
- Updates document status to 'processing_llm'
- **Pass 2 - Quality Check:**
- `llmService.validateCIMReview()` - Verify completeness and accuracy
- Updates status to 'quality_validation'
- **PDF Generation:**
- `pdfGenerationService.generatePDF()` - Puppeteer renders HTML template
- Uploads PDF to GCS
- Updates status to 'generating_pdf'
- **Vector Indexing (Background):**
- `vectorDatabaseService.createDocumentEmbedding()` - Generate 3072-dim embeddings
- Chunk document semantically, store in Supabase with vector index
- Status moves to 'vector_indexing' then 'completed'
6. **Result Delivery** (Backend → Frontend)
- Frontend polls `GET /documents/{id}` to check completion
- When status = 'completed', fetches summary and analysis data
- `DocumentViewer` displays results, allows regeneration with feedback
**State Management:**
- Backend: Document status progresses through `uploading → extracting_text → processing_llm → generating_pdf → vector_indexing → completed` or `failed` at any step
- Frontend: AuthContext manages user/token, component state tracks selected document and loading states
- Job Queue: In-memory queue with EventEmitter for state transitions
## Key Abstractions
**Unified Processor:**
- Purpose: Strategy pattern for document processing (single-pass vs. agentic RAG vs. simple)
- Examples: `singlePassProcessor`, `simpleDocumentProcessor`, `optimizedAgenticRAGProcessor`
- Pattern: Pluggable strategies via `ProcessingStrategy` selection in config
**LLM Service:**
- Purpose: Unified interface for multiple LLM providers with retry logic
- Examples: `backend/src/services/llmService.ts` (Anthropic, OpenAI, OpenRouter)
- Pattern: Provider-agnostic API with `processCIMDocument()` returning structured `CIMReview`
**Vector Database Abstraction:**
- Purpose: PostgreSQL pgvector operations via Supabase for semantic search
- Examples: `backend/src/services/vectorDatabaseService.ts`
- Pattern: Embedding + chunking → vector search via cosine similarity
**File Storage Abstraction:**
- Purpose: Google Cloud Storage operations with signed URLs
- Examples: `backend/src/services/fileStorageService.ts`
- Pattern: Signed upload/download URLs for temporary access without IAM burden
**Job Queue Pattern:**
- Purpose: Async processing with retry and priority handling
- Examples: `backend/src/services/jobQueueService.ts` (EventEmitter-based)
- Pattern: Priority queue with exponential backoff retry
## Entry Points
**API Entry Point:**
- Location: `backend/src/index.ts`
- Triggers: Process startup or Firebase Functions invocation
- Responsibilities:
- Initialize Express app
- Set up middleware (CORS, helmet, rate limiting, authentication)
- Register routes (`/documents`, `/vector`, `/monitoring`, `/api/audit`)
- Start job queue service
- Export Firebase Functions v2 handlers (`api`, `processDocumentJobs`)
**Scheduled Job Processing:**
- Location: `backend/src/index.ts` (line 252: `processDocumentJobs` function export)
- Triggers: Firebase Cloud Scheduler every 1 minute
- Responsibilities:
- Health check database connection
- Detect stuck jobs (processing > 15 min, pending > 2 min)
- Call `jobProcessorService.processJobs()`
- Log metrics and errors
**Frontend Entry Point:**
- Location: `frontend/src/main.tsx`
- Triggers: Browser navigation
- Responsibilities:
- Initialize React app with AuthProvider
- Set up Firebase client
- Render routing structure (Login → Dashboard)
**Document Processing Controller:**
- Location: `backend/src/controllers/documentController.ts`
- Route: `POST /documents/{id}/process`
- Responsibilities:
- Validate user authentication
- Enqueue processing job
- Return job ID to client
## Error Handling
**Strategy:** Multi-layer error recovery with structured logging and graceful degradation
**Patterns:**
- **Retry Logic:** DocumentModel uses exponential backoff (1s → 2s → 4s) for network errors
- **LLM Retry:** `llmService` retries API calls 3 times with exponential backoff
- **Firebase Auth Recovery:** `firebaseAuth.ts` attempts session recovery on token verify failure
- **Job Queue Retry:** Jobs retry up to 3 times with configurable backoff (5s → 300s max)
- **Structured Error Logging:** All errors include correlation ID, stack trace, and context metadata
- **Circuit Breaker Pattern:** Database health check in `processDocumentJobs` prevents cascading failures
**Error Boundaries:**
- Global error handler at end of Express middleware chain (`errorHandler`)
- Try/catch in all async functions with context-aware logging
- Unhandled rejection listener at process level (line 24 of `index.ts`)
## Cross-Cutting Concerns
**Logging:**
- Framework: Winston (json + console in dev)
- Approach: Structured logger with correlation IDs, Winston transports for error/upload logs
- Location: `backend/src/utils/logger.ts`
- Pattern: `logger.info()`, `logger.error()`, `StructuredLogger` for operations
**Validation:**
- Approach: Joi schema in environment config, Zod for API request/response types
- Location: `backend/src/config/env.ts`, `backend/src/services/llmSchemas.ts`
- Pattern: Joi for config, Zod for runtime validation
**Authentication:**
- Approach: Firebase ID tokens verified via `verifyFirebaseToken` middleware
- Location: `backend/src/middleware/firebaseAuth.ts`
- Pattern: Bearer token in Authorization header, cached in req.user
**Correlation Tracking:**
- Approach: UUID correlation ID added to all requests, propagated through job processing
- Location: `backend/src/middleware/validation.ts` (addCorrelationId)
- Pattern: X-Correlation-ID header or generated UUID, included in all logs
---
*Architecture analysis: 2026-02-24*

View File

@@ -0,0 +1,329 @@
# Codebase Concerns
**Analysis Date:** 2026-02-24
## Tech Debt
**Console.log Debug Statements in Controllers:**
- Issue: Excessive `console.log()` calls with emoji prefixes left throughout `documentController.ts` instead of using proper structured logging via Winston logger
- Files: `backend/src/controllers/documentController.ts` (lines 12-80, multiple scattered instances)
- Impact: Production logs become noisy and unstructured; debug output leaks to stdout/stderr; makes it harder to parse logs for errors and metrics
- Fix approach: Replace all `console.log()` calls with `logger.info()`, `logger.debug()`, `logger.error()` via imported `logger` from `utils/logger.ts`. Follow pattern established in other services.
**Incomplete Job Statistics Tracking:**
- Issue: `jobQueueService.ts` and `jobProcessorService.ts` both have TODO markers indicating completed/failed job counts are not tracked (lines 606-607, 635-636)
- Files: `backend/src/services/jobQueueService.ts`, `backend/src/services/jobProcessorService.ts`
- Impact: Job queue health metrics are incomplete; cannot audit success/failure rates; monitoring dashboards will show incomplete data
- Fix approach: Implement `completedJobs` and `failedJobs` counters in both services using persistent storage or Redis. Update schema if needed.
**Config Migration Debug Cruft:**
- Issue: Multiple `console.log()` debug statements in `config/env.ts` (lines 23, 46, 51, 292) for Firebase Functions v1→v2 migration are still present
- Files: `backend/src/config/env.ts`
- Impact: Production logs polluted with migration warnings; makes it harder to spot real issues; clutters server startup output
- Fix approach: Remove all `[CONFIG DEBUG]` console.log statements once migration to Firebase Functions v2 is confirmed complete. Wrap remaining fallback logic in logger.debug() if diagnostics needed.
**Hardcoded Processing Strategy:**
- Issue: Historical commit shows processing strategy was hardcoded, potential for incomplete refactoring
- Files: `backend/src/services/`, controller logic
- Impact: May not correctly use configured strategy; processing may default unexpectedly
- Fix approach: Verify all processing paths read from `config.processingStrategy` and have proper fallback logic
**Type Safety Issues - `any` Type Usage:**
- Issue: 378 instances of `any` or `unknown` types found across backend TypeScript files
- Files: Widespread including `optimizedAgenticRAGProcessor.ts:17`, `pdfGenerationService.ts`, `vectorDatabaseService.ts`
- Impact: Loses type safety guarantees; harder to catch errors at compile time; refactoring becomes risky
- Fix approach: Gradually replace `any` with proper types. Start with service boundaries and public APIs. Create typed interfaces for common patterns.
## Known Bugs
**Project Panther CIM KPI Missing After Processing:**
- Symptoms: Document `Project Panther - Confidential Information Memorandum_vBluePoint.pdf` processed but dashboard shows "Not specified in CIM" for Revenue, EBITDA, Employees, Founded even though numeric tables exist in PDF
- Files: `backend/src/services/optimizedAgenticRAGProcessor.ts` (dealOverview mapper), processing pipeline
- Trigger: Process Project Panther test document through full agentic RAG pipeline
- Impact: Dashboard KPI cards remain empty; users see incomplete summaries
- Workaround: Manual data entry in dashboard; skip financial summary display for affected documents
- Fix approach: Trace through `optimizedAgenticRAGProcessor.generateLLMAnalysisMultiPass()``dealOverview` mapper. Add regression test for this specific document. Check if structured table extraction is working correctly.
**10+ Minute Processing Latency Regression:**
- Symptoms: Document `document-55c4a6e2-8c08-4734-87f6-24407cea50ac.pdf` (Project Panther) took ~10 minutes end-to-end despite typical processing being 2-3 minutes
- Files: `backend/src/services/unifiedDocumentProcessor.ts`, `optimizedAgenticRAGProcessor.ts`, `documentAiProcessor.ts`, `llmService.ts`
- Trigger: Large or complex CIM documents (30+ pages with tables)
- Impact: Users experience timeouts; processing approaching or exceeding 14-minute Firebase Functions limit
- Workaround: None currently; document fails to process if latency exceeds timeout
- Fix approach: Instrument each pipeline phase (PDF chunking, Document AI extraction, RAG passes, financial parser) with timing logs. Identify bottleneck(s). Profile GCS upload retries, Anthropic fallbacks. Consider parallel multi-pass queries within quota limits.
**Vector Search Timeouts After Index Growth:**
- Symptoms: Supabase vector search RPC calls timeout after 30 seconds; fallback to document-scoped search with limited results
- Files: `backend/src/services/vectorDatabaseService.ts` (lines 122-182)
- Trigger: Large embedded document collections (1000+ chunks); similarity search under load
- Impact: Retrieval quality degrades as index grows; fallback search returns fewer contextual chunks; RAG quality suffers
- Workaround: Fallback query uses document-scoped filtering and direct embedding lookup
- Fix approach: Implement query batching, result caching by content hash, or query optimization. Consider Pinecone migration if Supabase vector performance doesn't improve. Add metrics to track timeout frequency.
## Security Considerations
**Unencrypted Debug Logs in Production:**
- Risk: Sensitive document content, user IDs, and processing details may be exposed in logs if debug mode enabled in production
- Files: `backend/src/middleware/firebaseAuth.ts` (AUTH_DEBUG flag), `backend/src/config/env.ts`, `backend/src/controllers/documentController.ts`
- Current mitigation: Debug logging controlled by `AUTH_DEBUG` environment variable; not enabled by default
- Recommendations:
1. Ensure `AUTH_DEBUG` is never set to `true` in production
2. Implement log redaction middleware to strip PII (API keys, document content, user data)
3. Use correlation IDs instead of logging full request bodies
4. Add log level enforcement (error/warn only in production)
**Hardcoded Service Account Credentials Path:**
- Risk: If service account key JSON is accidentally committed or exposed, attacker gains full GCS and Document AI access
- Files: `backend/src/config/env.ts`, `backend/src/utils/googleServiceAccount.ts`
- Current mitigation: `.env` file in `.gitignore`; credentials path via env var
- Recommendations:
1. Use Firebase Function secrets (defineSecret()) instead of env files
2. Implement credential rotation policy
3. Add pre-commit hook to prevent `.json` key files in commits
4. Audit GCS bucket permissions quarterly
**Concurrent LLM Rate Limiting Insufficient:**
- Risk: Although `llmService.ts` limits concurrent calls to 1 (line 52), burst requests could still trigger Anthropic 429 rate limit errors during high load
- Files: `backend/src/services/llmService.ts` (MAX_CONCURRENT_LLM_CALLS = 1)
- Current mitigation: Max 1 concurrent call; retry with exponential backoff (3 attempts)
- Recommendations:
1. Consider reducing to 0.5 concurrent calls (queue instead of async) during peak hours
2. Add request batching for multi-pass analysis
3. Implement circuit breaker pattern for cascading failures
4. Monitor token spend and throttle proactively
**No Request Rate Limiting on Upload Endpoint:**
- Risk: Unauthenticated attackers could flood `/upload/url` endpoint to exhaust quota or fill storage
- Files: `backend/src/controllers/documentController.ts` (getUploadUrl endpoint), `backend/src/routes/documents.ts`
- Current mitigation: Firebase Auth check; file size limit enforced
- Recommendations:
1. Add rate limiter middleware (e.g., express-rate-limit) with per-user quotas
2. Implement request signing for upload URLs
3. Add CORS restrictions to known frontend domains
4. Monitor upload rate and alert on anomalies
## Performance Bottlenecks
**Large File PDF Chunking Memory Usage:**
- Problem: Documents larger than 50 MB may cause OOM errors during chunking; no memory limit guards
- Files: `backend/src/services/optimizedAgenticRAGProcessor.ts` (line 35, 4000-char chunks), `backend/src/services/unifiedDocumentProcessor.ts`
- Cause: Entire document text loaded into memory before chunking; large overlap between chunks multiplies footprint
- Improvement path:
1. Implement streaming chunk processing from GCS (read chunks, embed, write to DB before next chunk)
2. Reduce overlap from 200 to 100 characters or make dynamic based on document size
3. Add memory threshold checks; fail early with user-friendly error if approaching limit
4. Profile heap usage in tests with 50+ MB documents
**Embedding Generation for Large Documents:**
- Problem: Embedding 1000+ chunks sequentially takes 2-3 minutes; no concurrency despite `maxConcurrentEmbeddings = 5` setting
- Files: `backend/src/services/optimizedAgenticRAGProcessor.ts` (lines 37, 172-180 region)
- Cause: Batch size of 10 may be inefficient; OpenAI/Anthropic API concurrency not fully utilized
- Improvement path:
1. Increase batch size to 25-50 chunks per concurrent request (test quota limits)
2. Use Promise.all() instead of sequential embedding calls
3. Cache embeddings by content hash to skip re-embedding on retries
4. Add progress callback to track batch completion
**Multiple LLM Retries on Network Failure:**
- Problem: 3 retry attempts for each LLM call with exponential backoff means up to 30+ seconds per call; multi-pass analysis does 3+ passes
- Files: `backend/src/services/llmService.ts` (retry logic, lines 320+), `backend/src/services/optimizedAgenticRAGProcessor.ts` (line 83 multi-pass)
- Cause: No circuit breaker; all retries execute even if service degraded
- Improvement path:
1. Track consecutive failures; disable retries if failure rate >50% in last minute
2. Use adaptive retry backoff (double wait time only after first failure)
3. Implement multi-pass fallback: if Pass 2 fails, use Pass 1 results instead of failing entire document
4. Add metrics endpoint to show retry frequency and success rates
**PDF Generation Memory Leak with Puppeteer Page Pool:**
- Problem: Page pool in `pdfGenerationService.ts` may not properly release browser resources; max pool size 5 but no eviction policy
- Files: `backend/src/services/pdfGenerationService.ts` (lines 66-71, page pool)
- Cause: Pages may not be closed if PDF generation errors mid-stream; no cleanup on timeout
- Improvement path:
1. Implement LRU eviction: close oldest page if pool reaches max size
2. Add page timeout with forced close after 30s
3. Add memory monitoring; close all pages if heap >500MB
4. Log page pool stats every 5 minutes to detect leaks
## Fragile Areas
**Job Queue State Machine:**
- Files: `backend/src/services/jobQueueService.ts`, `backend/src/services/jobProcessorService.ts`, `backend/src/models/ProcessingJobModel.ts`
- Why fragile:
1. Job status transitions (pending → processing → completed) not atomic; race condition if two workers pick same job
2. Stuck job detection relies on timestamp comparison; clock skew or server restart breaks detection
3. No idempotency tokens; job retry on network error could trigger duplicate processing
- Safe modification:
1. Add database-level unique constraint on job ID + processing timestamp
2. Use database transactions for status updates
3. Implement idempotency with request deduplication ID
- Test coverage:
1. No unit tests found for concurrent job processing scenario
2. No integration tests with actual database
3. Add tests for: concurrent workers, stuck job reset, duplicate submissions
**Document Processing Pipeline Error Handling:**
- Files: `backend/src/controllers/documentController.ts` (lines 200+), `backend/src/services/unifiedDocumentProcessor.ts`
- Why fragile:
1. Hybrid approach tries job queue then fallback to immediate processing; error in job queue doesn't fully propagate
2. Document status not updated if processing fails mid-pipeline (remains 'processing_llm')
3. No compensating transaction to roll back partial results
- Safe modification:
1. Separate job submission from immediate processing; always update document status atomically
2. Add processing stage tracking (document_ai → chunking → embedding → llm → pdf)
3. Implement rollback logic: delete chunks and embeddings if LLM stage fails
- Test coverage:
1. Add tests for each pipeline stage failure
2. Test document status consistency after each failure
3. Add integration test with network failure injection
**Vector Database Search Fallback Chain:**
- Files: `backend/src/services/vectorDatabaseService.ts` (lines 110-182)
- Why fragile:
1. Three-level fallback (RPC search → document-scoped search → direct lookup) masks underlying issues
2. If Supabase RPC is degraded, system degrades silently instead of alerting
3. Fallback search may return stale or incorrect results without indication
- Safe modification:
1. Add circuit breaker: if timeout happens 3x in 5 minutes, stop trying RPC search
2. Return metadata flag indicating which fallback was used (for logging/debugging)
3. Add explicit timeout wrapped in try/catch, not via Promise.race() (cleaner code)
- Test coverage:
1. Mock Supabase timeout at each RPC level
2. Verify correct fallback is triggered
3. Add performance benchmarks for each search method
**Config Initialization Race Condition:**
- Files: `backend/src/config/env.ts` (lines 15-52)
- Why fragile:
1. Firebase Functions v1 fallback (`functions.config()`) may not be thread-safe
2. If multiple instances start simultaneously, config merge may be incomplete
3. No validation that config merge was successful
- Safe modification:
1. Remove v1 fallback entirely; require explicit Firebase Functions v2 setup
2. Validate all critical env vars before allowing service startup
3. Fail fast with clear error message if required vars missing
- Test coverage:
1. Add test for missing required env vars
2. Test with incomplete config to verify error message clarity
## Scaling Limits
**Supabase Concurrent Vector Search Connections:**
- Current capacity: RPC timeout 30 seconds; Supabase connection pool typically 100 max
- Limit: With 3 concurrent workers × multiple users, could exhaust connection pool during peak load
- Scaling path:
1. Implement connection pooling via PgBouncer (already in Supabase Pro tier)
2. Reduce timeout from 30s to 10s; fail faster and retry
3. Migrate to Pinecone if vector search becomes >30% of workload
**Firebase Functions Timeout (14 minutes):**
- Current capacity: Serverless function execution up to 15 minutes (1 minute buffer before hard timeout)
- Limit: Document processing hitting ~10 minutes; adding new features could exceed limit
- Scaling path:
1. Move processing to Cloud Run (1 hour limit) for large documents
2. Implement processing timeout failover: if approach 12 minutes, checkpoint and requeue
3. Add background worker pool for long-running jobs (separate from request path)
**LLM API Rate Limits (Anthropic/OpenAI):**
- Current capacity: 1 concurrent call; 3 retries per call; no per-minute or per-second throttling beyond single-call serialization
- Limit: Burst requests from multiple users could trigger 429 rate limit errors
- Scaling path:
1. Negotiate higher rate limits with API providers
2. Implement request queuing with exponential backoff per user
3. Add cost monitoring and soft-limit alerts (warn at 80% of quota)
**PDF Generation Browser Pool:**
- Current capacity: 5 browser pages maximum
- Limit: With 3+ concurrent document processing jobs, pool contention causes delays (queue wait time)
- Scaling path:
1. Increase pool size to 10 (requires more memory)
2. Move PDF generation to separate worker queue (decouple from request path)
3. Implement adaptive pool sizing based on available memory
**GCS Upload/Download Throughput:**
- Current capacity: Single-threaded upload/download; file transfer waits on GCS API latency
- Limit: Large documents (50+ MB) may timeout or be slow
- Scaling path:
1. Implement resumable uploads with multi-part chunks
2. Add parallel chunk uploads for files >10 MB
3. Cache frequently accessed documents in Redis
## Dependencies at Risk
**Firebase Functions v1 Deprecation (EOL Dec 31, 2025):**
- Risk: Runtime will be decommissioned; Node.js 20 support ending Oct 30, 2026 (warning already surfaced)
- Impact: Functions will stop working after deprecation date; forced migration required
- Migration plan:
1. Migrate to Firebase Functions v2 runtime (already partially done; fallback code still present)
2. Update `firebase-functions` package to latest major version
3. Remove deprecated `functions.config()` fallback once migration confirmed
4. Test all functions after upgrade
**Puppeteer Version Pinning:**
- Risk: Puppeteer has frequent security updates; pinned version likely outdated
- Impact: Browser vulnerabilities in PDF generation; potential sandbox bypass
- Migration plan:
1. Audit current Puppeteer version in `package.json`
2. Test upgrade path (may have breaking API changes)
3. Implement automated dependency security scanning
**Document AI API Versioning:**
- Risk: Google Cloud Document AI API may deprecate current processor version
- Impact: Processing pipeline breaks if processor ID no longer valid
- Migration plan:
1. Document current processor version and creation date
2. Subscribe to Google Cloud deprecation notices
3. Add feature flag to switch processor versions
4. Test new processor version before migration
## Missing Critical Features
**Job Processing Observability:**
- Problem: No metrics for job success rate, average processing time per stage, or failure breakdown by error type
- Blocks: Cannot diagnose performance regressions; cannot identify bottlenecks
- Implementation: Add `/health/agentic-rag` endpoint exposing per-pass timing, token usage, cost data
**Document Version History:**
- Problem: Processing pipeline overwrites `analysis_data` on each run; no ability to compare old vs. new results
- Blocks: Cannot detect if new model version improves accuracy; hard to debug regression
- Implementation: Add `document_versions` table; keep historical results; implement diff UI
**Retry Mechanism for Failed Documents:**
- Problem: Failed documents stay in failed state; no way to retry after infrastructure recovers
- Blocks: User must re-upload document; processing failures are permanent per upload
- Implementation: Add "Retry" button to failed document status; re-queue without user re-upload
## Test Coverage Gaps
**End-to-End Pipeline with Large Documents:**
- What's not tested: Full processing pipeline with 50+ MB documents; covers PDF chunking, Document AI extraction, embeddings, LLM analysis, PDF generation
- Files: No integration test covering full flow with large fixture
- Risk: Cannot detect if scaling to large documents introduces timeouts or memory issues
- Priority: High (Project Panther regression was not caught by tests)
**Concurrent Job Processing:**
- What's not tested: Multiple jobs submitted simultaneously; verify no race conditions in job queue or database
- Files: `backend/src/services/jobQueueService.ts`, `backend/src/models/ProcessingJobModel.ts`
- Risk: Race condition causes duplicate processing or lost job state in production
- Priority: High (affects reliability)
**Vector Database Fallback Scenarios:**
- What's not tested: Simulate Supabase RPC timeout and verify correct fallback search is executed
- Files: `backend/src/services/vectorDatabaseService.ts` (lines 110-182)
- Risk: Fallback search silent failures or incorrect results not detected
- Priority: Medium (affects search quality)
**LLM API Provider Switching:**
- What's not tested: Switch between Anthropic, OpenAI, OpenRouter; verify each provider works correctly
- Files: `backend/src/services/llmService.ts` (provider selection logic)
- Risk: Provider-specific bugs not caught until production usage
- Priority: Medium (currently only Anthropic heavily used)
**Error Propagation in Hybrid Processing:**
- What's not tested: Job queue failure → immediate processing fallback; verify document status and error reporting
- Files: `backend/src/controllers/documentController.ts` (lines 200+)
- Risk: Silent failures or incorrect status updates if fallback error not properly handled
- Priority: High (affects user experience)
---
*Concerns audit: 2026-02-24*

View File

@@ -0,0 +1,286 @@
# Coding Conventions
**Analysis Date:** 2026-02-24
## Naming Patterns
**Files:**
- Backend service files: `camelCase.ts` (e.g., `llmService.ts`, `unifiedDocumentProcessor.ts`, `vectorDatabaseService.ts`)
- Backend middleware/controllers: `camelCase.ts` (e.g., `errorHandler.ts`, `firebaseAuth.ts`)
- Frontend components: `PascalCase.tsx` (e.g., `DocumentUpload.tsx`, `LoginForm.tsx`, `ProtectedRoute.tsx`)
- Frontend utility files: `camelCase.ts` (e.g., `cn.ts` for class name utilities)
- Type definition files: `camelCase.ts` with `.d.ts` suffix optional (e.g., `express.d.ts`)
- Model files: `PascalCase.ts` in `backend/src/models/` (e.g., `DocumentModel.ts`)
- Config files: `camelCase.ts` (e.g., `env.ts`, `firebase.ts`, `supabase.ts`)
**Functions:**
- Both backend and frontend use camelCase: `processDocument()`, `validateUUID()`, `handleUpload()`
- React components are PascalCase: `DocumentUpload`, `ErrorHandler`
- Handler functions use `handle` or verb prefix: `handleVisibilityChange()`, `onDrop()`
- Async functions use descriptive names: `fetchDocuments()`, `uploadDocument()`, `processDocument()`
**Variables:**
- camelCase for all variables: `documentId`, `correlationId`, `isUploading`, `uploadedFiles`
- Constant state use UPPER_SNAKE_CASE in rare cases: `MAX_CONCURRENT_LLM_CALLS`, `MAX_TOKEN_LIMITS`
- Boolean prefixes: `is*` (isUploading, isAdmin), `has*` (hasError), `can*` (canProcess)
**Types:**
- Interfaces use PascalCase: `LLMRequest`, `UploadedFile`, `DocumentUploadProps`, `CIMReview`
- Type unions use PascalCase: `ErrorCategory`, `ProcessingStrategy`
- Generic types use single uppercase letter or descriptive name: `T`, `K`, `V`
- Enum values use UPPER_SNAKE_CASE: `ErrorCategory.VALIDATION`, `ErrorCategory.AUTHENTICATION`
**Interfaces vs Types:**
- **Interfaces** for object shapes that represent entities or components: `interface Document`, `interface UploadedFile`
- **Types** for unions, primitives, and specialized patterns: `type ProcessingStrategy = 'document_ai_agentic_rag' | 'simple_full_document'`
## Code Style
**Formatting:**
- No formal Prettier config detected in repo (allow varied formatting)
- 2-space indentation (observed in TypeScript files)
- Semicolons required at end of statements
- Single quotes for strings in TypeScript, double quotes in JSX attributes
- Line length: preferably under 100 characters but not enforced
**Linting:**
- Tool: ESLint with TypeScript support
- Config: `.eslintrc.js` in backend
- Key rules:
- `@typescript-eslint/no-unused-vars`: error (allows leading underscore for intentionally unused)
- `@typescript-eslint/no-explicit-any`: warn (use `unknown` instead)
- `@typescript-eslint/no-non-null-assertion`: warn (use proper type guards)
- `no-console`: off in backend (logging used via Winston)
- `no-undef`: error (strict undefined checking)
- Frontend ESLint ignores unused disable directives and has max-warnings: 0
**TypeScript Standards:**
- Strict mode not fully enabled (noImplicitAny disabled in tsconfig.json for legacy reasons)
- Prefer explicit typing over `any`: use `unknown` when type is truly unknown
- Type guards required for safety checks: `error instanceof Error ? error.message : String(error)`
- No type assertions with `as` for complex types; use proper type narrowing
## Import Organization
**Order:**
1. External framework/library imports (`express`, `react`, `winston`)
2. Google Cloud/Firebase imports (`@google-cloud/storage`, `firebase-admin`)
3. Third-party service imports (`axios`, `zod`, `joi`)
4. Internal config imports (`'../config/env'`, `'../config/firebase'`)
5. Internal utility imports (`'../utils/logger'`, `'../utils/cn'`)
6. Internal model imports (`'../models/DocumentModel'`)
7. Internal service imports (`'../services/llmService'`)
8. Internal middleware/helper imports (`'../middleware/errorHandler'`)
9. Type-only imports at the end: `import type { ProcessingStrategy } from '...'`
**Examples:**
Backend service pattern from `optimizedAgenticRAGProcessor.ts`:
```typescript
import { logger } from '../utils/logger';
import { vectorDatabaseService } from './vectorDatabaseService';
import { VectorDatabaseModel } from '../models/VectorDatabaseModel';
import { llmService } from './llmService';
import { CIMReview } from './llmSchemas';
import { config } from '../config/env';
import type { ParsedFinancials } from './financialTableParser';
import type { StructuredTable } from './documentAiProcessor';
```
Frontend component pattern from `DocumentList.tsx`:
```typescript
import React from 'react';
import {
FileText,
Eye,
Download,
Trash2,
Calendar,
User,
Clock
} from 'lucide-react';
import { cn } from '../utils/cn';
```
**Path Aliases:**
- No @ alias imports detected; all use relative `../` patterns
- Monorepo structure: frontend and backend in separate directories with independent module resolution
## Error Handling
**Patterns:**
1. **Structured Error Objects with Categories:**
- Use `ErrorCategory` enum for classification: `VALIDATION`, `AUTHENTICATION`, `AUTHORIZATION`, `NOT_FOUND`, `EXTERNAL_SERVICE`, `PROCESSING`, `DATABASE`, `SYSTEM`
- Attach `AppError` interface properties: `statusCode`, `isOperational`, `code`, `correlationId`, `category`, `retryable`, `context`
- Example from `errorHandler.ts`:
```typescript
const enhancedError: AppError = {
category: ErrorCategory.VALIDATION,
statusCode: 400,
code: 'INVALID_UUID_FORMAT',
retryable: false
};
```
2. **Try-Catch with Structured Logging:**
- Always catch errors with explicit type checking
- Log with structured data including correlation ID
- Example pattern:
```typescript
try {
await operation();
} catch (error) {
logger.error('Operation failed', {
error: error instanceof Error ? error.message : String(error),
stack: error instanceof Error ? error.stack : undefined,
context: { documentId, userId }
});
throw error;
}
```
3. **HTTP Response Pattern:**
- Success responses: `{ success: true, data: {...} }`
- Error responses: `{ success: false, error: { code, message, details, correlationId, timestamp, retryable } }`
- User-friendly messages mapped by error category
- Include `X-Correlation-ID` header in responses
4. **Retry Logic:**
- LLM service implements concurrency limiting: max 1 concurrent call to prevent rate limits
- 3 retry attempts for LLM API calls with exponential backoff (see `llmService.ts` lines 236-450)
- Jobs respect 14-minute timeout limit with graceful status updates
5. **External Service Errors:**
- Firebase Auth errors: extract from `error.message` and `error.name` (TokenExpiredError, JsonWebTokenError)
- Supabase errors: check `error.code` and `error.message`, handle UUID validation errors
- GCS errors: extract from error objects with proper null checks
## Logging
**Framework:** Winston logger from `backend/src/utils/logger.ts`
**Levels:**
- `logger.debug()`: Detailed diagnostic info (disabled in production)
- `logger.info()`: Normal operation information, upload start/completion, processing status
- `logger.warn()`: Warning conditions, CORS rejections, non-critical issues
- `logger.error()`: Error conditions with full context and stack traces
**Structured Logging Pattern:**
```typescript
logger.info('Message', {
correlationId: correlationId,
category: 'operation_type',
operation: 'specific_action',
documentId: documentId,
userId: userId,
metadata: value,
timestamp: new Date().toISOString()
});
```
**StructuredLogger Class:**
- Use for operations requiring correlation ID tracking
- Constructor: `const logger = new StructuredLogger(correlationId)`
- Specialized methods:
- `uploadStart()`, `uploadSuccess()`, `uploadError()` - for file operations
- `processingStart()`, `processingSuccess()`, `processingError()` - for document processing
- `storageOperation()` - for file storage operations
- `jobQueueOperation()` - for background jobs
- `info()`, `warn()`, `error()`, `debug()` - general logging
- All methods automatically attach correlation ID to metadata
**What NOT to Log:**
- Credentials, API keys, or sensitive data
- Large file contents or binary data
- User passwords or tokens (log only presence: "token available" or "NO_TOKEN")
- Request body contents (sanitized in error handler - only whitelisted fields: documentId, id, status, fileName, fileSize, contentType, correlationId)
**Console Usage:**
- Backend: `console.log` disabled by ESLint in production code; only Winston logger used
- Frontend: `console.log` used in development (observed in DocumentUpload, App components)
- Special case: logger initialization may use console.warn for setup diagnostics
## Comments
**When to Comment:**
- Complex algorithms or business logic: explain "why", not "what" the code does
- Non-obvious type conversions or workarounds
- Links to related issues, tickets, or documentation
- Critical security considerations or performance implications
- TODO items for incomplete work (format: `// TODO: [description]`)
**JSDoc/TSDoc:**
- Used for function and class documentation in utility and service files
- Function signature example from `test-helpers.ts`:
```typescript
/**
* Creates a mock correlation ID for testing
*/
export function createMockCorrelationId(): string
```
- Parameter and return types documented via TypeScript typing (preferred over verbose JSDoc)
- Service classes include operation summaries: `/** Process document using Document AI + Agentic RAG strategy */`
## Function Design
**Size:**
- Keep functions focused on single responsibility
- Long services (300+ lines) separate concerns into helper methods
- Controller/middleware functions stay under 50 lines
**Parameters:**
- Max 3-4 required parameters; use object for additional config
- Example: `processDocument(documentId: string, userId: string, text: string, options?: { strategy?: string })`
- Use destructuring for config objects: `{ strategy, maxTokens, temperature }`
**Return Values:**
- Async operations return Promise with typed success/error objects
- Pattern: `Promise<{ success: boolean; data: T; error?: string }>`
- Avoid throwing in service methods; return error in object
- Controllers/middleware can throw for Express error handler
**Type Signatures:**
- Always specify parameter and return types (no implicit `any`)
- Use generics for reusable patterns: `Promise<T>`, `Array<Document>`
- Union types for multiple possibilities: `'uploading' | 'uploaded' | 'processing' | 'completed' | 'error'`
## Module Design
**Exports:**
- Services exported as singleton instances: `export const llmService = new LLMService()`
- Utility functions exported as named exports: `export function validateUUID() { ... }`
- Type definitions exported from dedicated type files or alongside implementation
- Classes exported as default or named based on usage pattern
**Barrel Files:**
- Not consistently used; services import directly from implementation files
- Example: `import { llmService } from './llmService'` not from `./services/index`
- Consider adding for cleaner imports when services directory grows
**Service Singletons:**
- All services instantiated once and exported as singletons
- Examples:
- `backend/src/services/llmService.ts`: `export const llmService = new LLMService()`
- `backend/src/services/fileStorageService.ts`: `export const fileStorageService = new FileStorageService()`
- `backend/src/services/vectorDatabaseService.ts`: `export const vectorDatabaseService = new VectorDatabaseService()`
- Prevents multiple initialization and enables dependency sharing
**Frontend Context Pattern:**
- React Context for auth: `AuthContext` exports `useAuth()` hook
- Services pattern: `documentService` contains API methods, used as singleton
- No service singletons in frontend (class instances recreated as needed)
## Deprecated Patterns (DO NOT USE)
- ❌ Direct PostgreSQL connections - Use Supabase client instead
- ❌ JWT authentication - Use Firebase Auth tokens
- ❌ `console.log` in production code - Use Winston logger
- ❌ Type assertions with `as` for complex types - Use type guards
- ❌ Manual error handling without correlation IDs
- ❌ Redis caching - Not used in current architecture
- ❌ Jest testing - Use Vitest instead
---
*Convention analysis: 2026-02-24*

View File

@@ -0,0 +1,247 @@
# External Integrations
**Analysis Date:** 2026-02-24
## APIs & External Services
**Document Processing:**
- Google Document AI
- Purpose: OCR and text extraction from PDF documents with entity recognition and table parsing
- Client: `@google-cloud/documentai` 9.3.0
- Implementation: `backend/src/services/documentAiProcessor.ts`
- Auth: Google Application Credentials via `GOOGLE_APPLICATION_CREDENTIALS` or default credentials
- Configuration: Processor ID from `DOCUMENT_AI_PROCESSOR_ID`, location from `DOCUMENT_AI_LOCATION` (default: 'us')
- Max pages per chunk: 15 pages (configurable)
**Large Language Models:**
- OpenAI
- Purpose: LLM analysis of document content, embeddings for vector search
- SDK/Client: `openai` 5.10.2
- Auth: API key from `OPENAI_API_KEY`
- Models: Default `gpt-4-turbo`, embeddings via `text-embedding-3-small`
- Implementation: `backend/src/services/llmService.ts` with provider abstraction
- Retry: 3 attempts with exponential backoff
- Anthropic Claude
- Purpose: LLM analysis and document summary generation
- SDK/Client: `@anthropic-ai/sdk` 0.57.0
- Auth: API key from `ANTHROPIC_API_KEY`
- Models: Default `claude-sonnet-4-20250514` (configurable via `LLM_MODEL`)
- Implementation: `backend/src/services/llmService.ts`
- Concurrency: Max 1 concurrent LLM call to prevent rate limiting (Anthropic 429 errors)
- Retry: 3 attempts with exponential backoff
- OpenRouter
- Purpose: Alternative LLM provider supporting multiple models through single API
- SDK/Client: HTTP requests via `axios` to OpenRouter API
- Auth: `OPENROUTER_API_KEY` or optional Bring-Your-Own-Key mode (`OPENROUTER_USE_BYOK`)
- Configuration: `LLM_PROVIDER: 'openrouter'` activates this provider
- Implementation: `backend/src/services/llmService.ts`
**File Storage:**
- Google Cloud Storage (GCS)
- Purpose: Store uploaded PDFs, processed documents, and generated PDFs
- SDK/Client: `@google-cloud/storage` 7.16.0
- Auth: Google Application Credentials via `GOOGLE_APPLICATION_CREDENTIALS`
- Buckets:
- Input: `GCS_BUCKET_NAME` for uploaded documents
- Output: `DOCUMENT_AI_OUTPUT_BUCKET_NAME` for processing results
- Implementation: `backend/src/services/fileStorageService.ts` and `backend/src/services/documentAiProcessor.ts`
- Max file size: 100MB (configurable via `MAX_FILE_SIZE`)
## Data Storage
**Databases:**
- Supabase PostgreSQL
- Connection: `SUPABASE_URL` for PostgREST API, `DATABASE_URL` for direct PostgreSQL
- Client: `@supabase/supabase-js` 2.53.0 for REST API, `pg` 8.11.3 for direct pool connections
- Auth: `SUPABASE_ANON_KEY` for client operations, `SUPABASE_SERVICE_KEY` for server operations
- Implementation:
- `backend/src/config/supabase.ts` - Client initialization with 30-second request timeout
- `backend/src/models/` - All data models (DocumentModel, UserModel, ProcessingJobModel, VectorDatabaseModel)
- Vector Support: pgvector extension for semantic search
- Tables:
- `users` - User accounts and authentication data
- `documents` - CIM documents with status tracking
- `document_chunks` - Text chunks with embeddings for vector search
- `document_feedback` - User feedback on summaries
- `document_versions` - Document version history
- `document_audit_logs` - Audit trail for compliance
- `processing_jobs` - Background job queue with status tracking
- `performance_metrics` - System performance data
- Connection pooling: Max 5 connections, 30-second idle timeout, 2-second connection timeout
**Vector Database:**
- Supabase pgvector (built into PostgreSQL)
- Purpose: Semantic search and RAG context retrieval
- Implementation: `backend/src/services/vectorDatabaseService.ts`
- Embedding generation: Via OpenAI `text-embedding-3-small` (embedded in service)
- Search: Cosine similarity via Supabase RPC calls
- Semantic cache: 1-hour TTL for cached embeddings
**File Storage:**
- Google Cloud Storage (primary storage above)
- Local filesystem (fallback for development, stored in `uploads/` directory)
**Caching:**
- In-memory semantic cache (Supabase vector embeddings) with 1-hour TTL
- No external cache service (Redis, Memcached) currently used
## Authentication & Identity
**Auth Provider:**
- Firebase Authentication
- Purpose: User authentication, JWT token generation and verification
- Client: `firebase` 12.0.0 (frontend at `frontend/src/config/firebase.ts`)
- Admin: `firebase-admin` 13.4.0 (backend at `backend/src/config/firebase.ts`)
- Implementation:
- Frontend: `frontend/src/services/authService.ts` - Login, logout, token refresh
- Backend: `backend/src/middleware/firebaseAuth.ts` - Token verification middleware
- Project: `cim-summarizer` (hardcoded in config)
- Flow: User logs in with Firebase, receives ID token, frontend sends token in Authorization header
**Token-Based Auth:**
- JWT (JSON Web Tokens)
- Purpose: API request authentication
- Implementation: `backend/src/middleware/firebaseAuth.ts`
- Verification: Firebase Admin SDK verifies token signature and expiration
- Header: `Authorization: Bearer <token>`
**Fallback Auth (for service-to-service):**
- API Key based (not currently exposed but framework supports it in `backend/src/config/env.ts`)
## Monitoring & Observability
**Error Tracking:**
- No external error tracking service configured
- Errors logged via Winston logger with correlation IDs for tracing
**Logs:**
- Winston logger 3.11.0 - Structured JSON logging at `backend/src/utils/logger.ts`
- Transports: Console (development), File-based for production logs
- Correlation ID middleware at `backend/src/middleware/errorHandler.ts` - Every request traced
- Request logging: Morgan 1.10.0 with Winston transport
- Firebase Functions Cloud Logging: Automatic integration for Cloud Functions deployments
**Monitoring Endpoints:**
- `GET /health` - Basic health check with uptime and environment info
- `GET /health/config` - Configuration validation status
- `GET /health/agentic-rag` - Agentic RAG system health (placeholder)
- `GET /monitoring/dashboard` - Aggregated system metrics (queryable by time range)
## CI/CD & Deployment
**Hosting:**
- **Backend**:
- Firebase Cloud Functions (default, Node.js 20 runtime)
- Google Cloud Run (alternative containerized deployment)
- Configuration: `backend/firebase.json` defines function source, runtime, and predeploy hooks
- **Frontend**:
- Firebase Hosting (CDN-backed static hosting)
- Configuration: Defined in `frontend/` directory with `firebase.json`
**Deployment Commands:**
```bash
# Backend deployment
npm run deploy:firebase # Deploy functions to Firebase
npm run deploy:cloud-run # Deploy to Cloud Run
npm run docker:build # Build Docker image
npm run docker:push # Push to GCR
# Frontend deployment
npm run deploy:firebase # Deploy to Firebase Hosting
npm run deploy:preview # Deploy to preview channel
# Emulator
npm run emulator # Run Firebase emulator locally
npm run emulator:ui # Run emulator with UI
```
**Build Pipeline:**
- TypeScript compilation: `tsc` targets ES2020
- Predeploy: Defined in `firebase.json` - runs `npm run build`
- Docker image for Cloud Run: `Dockerfile` in backend root
## Environment Configuration
**Required env vars (Production):**
```
NODE_ENV=production
LLM_PROVIDER=anthropic
GCLOUD_PROJECT_ID=cim-summarizer
DOCUMENT_AI_PROCESSOR_ID=<processor-id>
GCS_BUCKET_NAME=<bucket-name>
DOCUMENT_AI_OUTPUT_BUCKET_NAME=<output-bucket>
SUPABASE_URL=https://<project>.supabase.co
SUPABASE_ANON_KEY=<anon-key>
SUPABASE_SERVICE_KEY=<service-key>
DATABASE_URL=postgresql://postgres:<password>@aws-0-us-central-1.pooler.supabase.com:6543/postgres
ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...
FIREBASE_PROJECT_ID=cim-summarizer
```
**Optional env vars:**
```
DOCUMENT_AI_LOCATION=us
VECTOR_PROVIDER=supabase
LLM_MODEL=claude-sonnet-4-20250514
LLM_MAX_TOKENS=16000
LLM_TEMPERATURE=0.1
OPENROUTER_API_KEY=<key>
OPENROUTER_USE_BYOK=true
GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
```
**Secrets location:**
- Development: `.env` file (gitignored, never committed)
- Production: Firebase Functions secrets via `firebase functions:secrets:set`
- Google Credentials: `backend/serviceAccountKey.json` for local dev, service account in Cloud Functions environment
## Webhooks & Callbacks
**Incoming:**
- No external webhooks currently configured
- All document processing triggered by HTTP POST to `POST /documents/upload`
**Outgoing:**
- No outgoing webhooks implemented
- Document processing is synchronous (within 14-minute Cloud Function timeout) or async via job queue
**Real-time Monitoring:**
- Server-Sent Events (SSE) not implemented
- Polling endpoints for progress:
- `GET /documents/{id}/progress` - Document processing progress
- `GET /documents/queue/status` - Job queue status (frontend polls every 5 seconds)
## Rate Limiting & Quotas
**API Rate Limits:**
- Express rate limiter: 1000 requests per 15 minutes per IP
- LLM provider limits: Anthropic limited to 1 concurrent call (application-level throttling)
- OpenAI rate limits: Handled by SDK with backoff
**File Upload Limits:**
- Max file size: 100MB (configurable via `MAX_FILE_SIZE`)
- Allowed MIME types: `application/pdf` (configurable via `ALLOWED_FILE_TYPES`)
## Network Configuration
**CORS Origins (Allowed):**
- `https://cim-summarizer.web.app` (production)
- `https://cim-summarizer.firebaseapp.com` (production)
- `http://localhost:3000` (development)
- `http://localhost:5173` (development)
- `https://localhost:3000` (SSL local dev)
- `https://localhost:5173` (SSL local dev)
**Port Mappings:**
- Frontend dev: Port 5173 (Vite dev server)
- Backend dev: Port 5001 (Firebase Functions emulator)
- Backend API: Port 5000 (Express in standard deployment)
- Vite proxy to backend: `/api` routes proxied from port 5173 to `http://localhost:5000`
---
*Integration audit: 2026-02-24*

148
.planning/codebase/STACK.md Normal file
View File

@@ -0,0 +1,148 @@
# Technology Stack
**Analysis Date:** 2026-02-24
## Languages
**Primary:**
- TypeScript 5.2.2 - Both backend and frontend, strict mode enabled
- JavaScript (CommonJS) - Build outputs and configuration
**Supporting:**
- SQL - Supabase PostgreSQL database via migrations in `backend/src/models/migrations/`
## Runtime
**Environment:**
- Node.js 20 (specified in `backend/firebase.json`)
- Browser (ES2020 target for both client and server)
**Package Manager:**
- npm - Primary package manager for both backend and frontend
- Lockfile: `package-lock.json` present in both `backend/` and `frontend/`
## Frameworks
**Backend - Core:**
- Express.js 4.18.2 - HTTP server and REST API framework at `backend/src/index.ts`
- Firebase Admin SDK 13.4.0 - Authentication and service account management at `backend/src/config/firebase.ts`
- Firebase Functions 6.4.0 - Cloud Functions deployment runtime at port 5001
**Frontend - Core:**
- React 18.2.0 - UI framework with TypeScript support
- Vite 4.5.0 - Build tool and dev server (port 5173 for dev, port 3000 production)
**Backend - Testing:**
- Vitest 2.1.0 - Test runner with v8 coverage provider at `backend/vitest.config.ts`
- Configuration: Global test environment set to 'node', 30-second test timeout
**Backend - Build/Dev:**
- ts-node 10.9.2 - TypeScript execution for scripts
- ts-node-dev 2.0.0 - Live reload development server with `--transpile-only` flag
- TypeScript Compiler (tsc) 5.2.2 - Strict type checking, ES2020 target
**Frontend - Build/Dev:**
- Vite React plugin 4.1.1 - React JSX transformation
- TailwindCSS 3.3.5 - Utility-first CSS framework with PostCSS 8.4.31
## Key Dependencies
**Critical Infrastructure:**
- `@google-cloud/documentai` 9.3.0 - Google Document AI OCR/text extraction at `backend/src/services/documentAiProcessor.ts`
- `@google-cloud/storage` 7.16.0 - Google Cloud Storage (GCS) for file uploads and processing
- `@supabase/supabase-js` 2.53.0 - PostgreSQL database client with vector support at `backend/src/config/supabase.ts`
- `pg` 8.11.3 - Direct PostgreSQL connection pool for critical operations bypassing PostgREST
**LLM & AI:**
- `@anthropic-ai/sdk` 0.57.0 - Claude API integration with support for Anthropic provider
- `openai` 5.10.2 - OpenAI API and embeddings (text-embedding-3-small)
- Both providers abstracted via `backend/src/services/llmService.ts`
**PDF Processing:**
- `pdf-lib` 1.17.1 - PDF generation and manipulation at `backend/src/services/pdfGenerationService.ts`
- `pdf-parse` 1.1.1 - PDF text extraction
- `pdfkit` 0.17.1 - PDF document creation
**Document Processing:**
- `puppeteer` 21.11.0 - Headless Chrome for HTML/PDF conversion
**Security & Authentication:**
- `firebase` 12.0.0 (frontend) - Firebase client SDK for authentication at `frontend/src/config/firebase.ts`
- `firebase-admin` 13.4.0 (backend) - Admin SDK for token verification at `backend/src/middleware/firebaseAuth.ts`
- `jsonwebtoken` 9.0.2 - JWT token creation and verification
- `bcryptjs` 2.4.3 - Password hashing with 12 rounds default
**API & HTTP:**
- `axios` 1.11.0 - HTTP client for both frontend and backend
- `cors` 2.8.5 - Cross-Origin Resource Sharing middleware for Express
- `helmet` 7.1.0 - Security headers middleware
- `morgan` 1.10.0 - HTTP request logging middleware
- `express-rate-limit` 7.1.5 - Rate limiting middleware (1000 requests per 15 minutes)
**Data Validation & Schema:**
- `zod` 3.25.76 - TypeScript-first schema validation at `backend/src/services/llmSchemas.ts`
- `zod-to-json-schema` 3.24.6 - Convert Zod schemas to JSON Schema for LLM structured output
- `joi` 17.11.0 - Environment variable validation in `backend/src/config/env.ts`
**Logging & Monitoring:**
- `winston` 3.11.0 - Structured logging framework with multiple transports at `backend/src/utils/logger.ts`
**Frontend - UI Components:**
- `lucide-react` 0.294.0 - Icon library
- `react-dom` 18.2.0 - React rendering for web
- `react-router-dom` 6.20.1 - Client-side routing
- `react-dropzone` 14.3.8 - File upload handling
- `clsx` 2.0.0 - Conditional className utility
- `tailwind-merge` 2.0.0 - Merge Tailwind classes with conflict resolution
**Utilities:**
- `uuid` 11.1.0 - Unique identifier generation
- `dotenv` 16.3.1 - Environment variable loading from `.env` files
## Configuration
**Environment:**
- **.env file support** - Dotenv loads from `.env` for local development in `backend/src/config/env.ts`
- **Environment validation** - Joi schema at `backend/src/config/env.ts` validates all required/optional env vars
- **Firebase Functions v2** - Uses `defineString()` and `defineSecret()` for secure configuration (migration from v1 functions.config())
**Key Configuration Variables (Backend):**
- `NODE_ENV` - 'development' | 'production' | 'test'
- `LLM_PROVIDER` - 'openai' | 'anthropic' | 'openrouter' (default: 'openai')
- `GCLOUD_PROJECT_ID` - Google Cloud project ID (required)
- `DOCUMENT_AI_PROCESSOR_ID` - Document AI processor ID (required)
- `GCS_BUCKET_NAME` - Google Cloud Storage bucket (required)
- `SUPABASE_URL`, `SUPABASE_ANON_KEY`, `SUPABASE_SERVICE_KEY` - Supabase PostgreSQL connection
- `DATABASE_URL` - Direct PostgreSQL connection string for bypass operations
- `OPENAI_API_KEY` - OpenAI API key for embeddings and models
- `ANTHROPIC_API_KEY` - Anthropic Claude API key
- `OPENROUTER_API_KEY` - OpenRouter API key (optional, uses BYOK with Anthropic key)
**Key Configuration Variables (Frontend):**
- `VITE_API_BASE_URL` - Backend API endpoint
- `VITE_FIREBASE_*` - Firebase configuration (API key, auth domain, project ID, etc.)
**Build Configuration:**
- **Backend**: `backend/tsconfig.json` - Strict TypeScript, CommonJS module output, ES2020 target
- **Frontend**: `frontend/tsconfig.json` - ES2020 target, JSX React support, path alias `@/*`
- **Firebase**: `backend/firebase.json` - Node.js 20 runtime, Firebase Functions emulator on port 5001
## Platform Requirements
**Development:**
- Node.js 20.x
- npm 9+
- Google Cloud credentials (for Document AI and GCS)
- Firebase project credentials (service account key)
- Supabase project URL and keys
**Production:**
- **Backend**: Firebase Cloud Functions (Node.js 20 runtime) or Google Cloud Run
- **Frontend**: Firebase Hosting (CDN-backed static hosting)
- **Database**: Supabase PostgreSQL with pgvector extension for vector search
- **Storage**: Google Cloud Storage for documents and generated PDFs
- **Memory Limits**: Backend configured with `--max-old-space-size=8192` for large document processing
---
*Stack analysis: 2026-02-24*

View File

@@ -0,0 +1,374 @@
# Codebase Structure
**Analysis Date:** 2026-02-24
## Directory Layout
```
cim_summary/
├── backend/ # Express.js + TypeScript backend (Node.js)
│ ├── src/
│ │ ├── index.ts # Express app + Firebase Functions exports
│ │ ├── controllers/ # Request handlers
│ │ ├── models/ # Database access + schema
│ │ ├── services/ # Business logic + external integrations
│ │ ├── routes/ # Express route definitions
│ │ ├── middleware/ # Express middleware (auth, validation, error)
│ │ ├── config/ # Configuration (env, firebase, supabase)
│ │ ├── utils/ # Utilities (logger, validation, parsing)
│ │ ├── types/ # TypeScript type definitions
│ │ ├── scripts/ # One-off CLI scripts (diagnostics, setup)
│ │ ├── assets/ # Static assets (HTML templates)
│ │ └── __tests__/ # Test suites (unit, integration, acceptance)
│ ├── package.json # Node dependencies
│ ├── tsconfig.json # TypeScript config
│ ├── .eslintrc.json # ESLint config
│ └── dist/ # Compiled JavaScript (generated)
├── frontend/ # React + Vite + TypeScript frontend
│ ├── src/
│ │ ├── main.tsx # React entry point
│ │ ├── App.tsx # Root component with routing
│ │ ├── components/ # React components (UI)
│ │ ├── services/ # API clients (documentService, authService)
│ │ ├── contexts/ # React Context (AuthContext)
│ │ ├── config/ # Configuration (env, firebase)
│ │ ├── types/ # TypeScript interfaces
│ │ ├── utils/ # Utilities (validation, cn, auth debug)
│ │ └── assets/ # Static images and icons
│ ├── package.json # Node dependencies
│ ├── tsconfig.json # TypeScript config
│ ├── vite.config.ts # Vite bundler config
│ ├── eslintrc.json # ESLint config
│ ├── tailwind.config.js # Tailwind CSS config
│ ├── postcss.config.js # PostCSS config
│ └── dist/ # Built static assets (generated)
├── .planning/ # GSD planning directory
│ └── codebase/ # Codebase analysis documents
├── package.json # Monorepo root package (if used)
├── .git/ # Git repository
├── .gitignore # Git ignore rules
├── .cursorrules # Cursor IDE configuration
├── README.md # Project overview
├── CONFIGURATION_GUIDE.md # Setup instructions
├── CODEBASE_ARCHITECTURE_SUMMARY.md # Existing architecture notes
└── [PDF documents] # Sample CIM documents for testing
```
## Directory Purposes
**backend/src/:**
- Purpose: All backend server code
- Contains: TypeScript source files
- Key files: `index.ts` (main app), routes, controllers, services, models
**backend/src/controllers/:**
- Purpose: HTTP request handlers
- Contains: `documentController.ts`, `authController.ts`
- Functions: Map HTTP requests to service calls, handle validation, construct responses
**backend/src/services/:**
- Purpose: Business logic and external integrations
- Contains: Document processing, LLM integration, file storage, database, job queue
- Key files:
- `unifiedDocumentProcessor.ts` - Orchestrator, strategy selection
- `singlePassProcessor.ts` - 2-LLM extraction (current default)
- `optimizedAgenticRAGProcessor.ts` - Advanced agentic processing (stub)
- `documentAiProcessor.ts` - Google Document AI OCR
- `llmService.ts` - LLM API calls (Anthropic/OpenAI/OpenRouter)
- `jobQueueService.ts` - Async job queue (in-memory, EventEmitter)
- `jobProcessorService.ts` - Dequeue and execute jobs
- `fileStorageService.ts` - GCS signed URLs and upload
- `vectorDatabaseService.ts` - Supabase pgvector operations
- `pdfGenerationService.ts` - Puppeteer PDF rendering
- `uploadProgressService.ts` - Track upload status
- `uploadMonitoringService.ts` - Monitor processing progress
- `llmSchemas.ts` - Zod schemas for LLM extraction (CIMReview, financial data)
**backend/src/models/:**
- Purpose: Database access layer and schema definitions
- Contains: Document, User, ProcessingJob, Feedback models
- Key files:
- `types.ts` - TypeScript interfaces (Document, ProcessingJob, ProcessingStatus)
- `DocumentModel.ts` - Document CRUD with retry logic
- `ProcessingJobModel.ts` - Job tracking in database
- `UserModel.ts` - User management
- `VectorDatabaseModel.ts` - Vector embedding queries
- `migrate.ts` - Database migrations
- `seed.ts` - Test data seeding
- `migrations/` - SQL migration files
**backend/src/routes/:**
- Purpose: Express route definitions
- Contains: Route handlers and middleware bindings
- Key files:
- `documents.ts` - GET/POST/PUT/DELETE document endpoints
- `vector.ts` - Vector search endpoints
- `monitoring.ts` - Health and status endpoints
- `documentAudit.ts` - Audit log endpoints
**backend/src/middleware/:**
- Purpose: Express middleware for cross-cutting concerns
- Contains: Authentication, validation, error handling
- Key files:
- `firebaseAuth.ts` - Firebase ID token verification
- `errorHandler.ts` - Global error handling + correlation ID
- `notFoundHandler.ts` - 404 handler
- `validation.ts` - Request validation (UUID, pagination)
**backend/src/config/:**
- Purpose: Configuration and initialization
- Contains: Environment setup, service initialization
- Key files:
- `env.ts` - Environment variable validation (Joi schema)
- `firebase.ts` - Firebase Admin SDK initialization
- `supabase.ts` - Supabase client and pool setup
- `database.ts` - PostgreSQL connection (legacy)
- `errorConfig.ts` - Error handling config
**backend/src/utils/:**
- Purpose: Shared utility functions
- Contains: Logging, validation, parsing
- Key files:
- `logger.ts` - Winston logger setup (console + file transports)
- `validation.ts` - UUID and pagination validators
- `googleServiceAccount.ts` - Google Cloud credentials resolution
- `financialExtractor.ts` - Financial data parsing (deprecated for single-pass)
- `templateParser.ts` - CIM template utilities
- `auth.ts` - Authentication helpers
**backend/src/scripts/:**
- Purpose: One-off CLI scripts for diagnostics and setup
- Contains: Database setup, testing, monitoring
- Key files:
- `setup-database.ts` - Initialize database schema
- `monitor-document-processing.ts` - Watch job queue status
- `check-current-job.ts` - Debug stuck jobs
- `test-full-llm-pipeline.ts` - End-to-end testing
- `comprehensive-diagnostic.ts` - System health check
**backend/src/__tests__/:**
- Purpose: Test suites
- Contains: Unit, integration, acceptance tests
- Subdirectories:
- `unit/` - Isolated component tests
- `integration/` - Multi-component tests
- `acceptance/` - End-to-end flow tests
- `mocks/` - Mock data and fixtures
- `utils/` - Test utilities
**frontend/src/:**
- Purpose: All frontend code
- Contains: React components, services, types
**frontend/src/components/:**
- Purpose: React UI components
- Contains: Page components, reusable widgets
- Key files:
- `DocumentUpload.tsx` - File upload UI with drag-and-drop
- `DocumentList.tsx` - List of processed documents
- `DocumentViewer.tsx` - View and edit extracted data
- `ProcessingProgress.tsx` - Real-time processing status
- `UploadMonitoringDashboard.tsx` - Admin view of active jobs
- `LoginForm.tsx` - Firebase auth login UI
- `ProtectedRoute.tsx` - Route guard for authenticated pages
- `Analytics.tsx` - Document analytics and statistics
- `CIMReviewTemplate.tsx` - Display extracted CIM review data
**frontend/src/services/:**
- Purpose: API clients and external service integration
- Contains: HTTP clients for backend
- Key files:
- `documentService.ts` - Document API calls (upload, list, process, status)
- `authService.ts` - Firebase authentication (login, logout, token)
- `adminService.ts` - Admin-only operations
**frontend/src/contexts/:**
- Purpose: React Context for global state
- Contains: AuthContext for user and authentication state
- Key files:
- `AuthContext.tsx` - User, token, login/logout state
**frontend/src/config/:**
- Purpose: Configuration
- Contains: Environment variables, Firebase setup
- Key files:
- `env.ts` - VITE_API_BASE_URL and other env vars
- `firebase.ts` - Firebase client initialization
**frontend/src/types/:**
- Purpose: TypeScript interfaces
- Contains: API response types, component props
- Key files:
- `auth.ts` - User, LoginCredentials, AuthContextType
**frontend/src/utils/:**
- Purpose: Shared utility functions
- Contains: Validation, CSS utilities
- Key files:
- `validation.ts` - Email, password validators
- `cn.ts` - Classname merger (clsx wrapper)
- `authDebug.ts` - Authentication debugging helpers
## Key File Locations
**Entry Points:**
- `backend/src/index.ts` - Main Express app and Firebase Functions exports
- `frontend/src/main.tsx` - React entry point
- `frontend/src/App.tsx` - Root component with routing
**Configuration:**
- `backend/src/config/env.ts` - Environment variable schema and validation
- `backend/src/config/firebase.ts` - Firebase Admin SDK setup
- `backend/src/config/supabase.ts` - Supabase client and connection pool
- `frontend/src/config/firebase.ts` - Firebase client configuration
- `frontend/src/config/env.ts` - Frontend environment variables
**Core Logic:**
- `backend/src/services/unifiedDocumentProcessor.ts` - Main document processing orchestrator
- `backend/src/services/singlePassProcessor.ts` - Single-pass 2-LLM strategy
- `backend/src/services/llmService.ts` - LLM API integration with retry
- `backend/src/services/jobQueueService.ts` - Background job queue
- `backend/src/services/vectorDatabaseService.ts` - Vector search implementation
**Testing:**
- `backend/src/__tests__/unit/` - Unit tests
- `backend/src/__tests__/integration/` - Integration tests
- `backend/src/__tests__/acceptance/` - End-to-end tests
**Database:**
- `backend/src/models/types.ts` - TypeScript type definitions
- `backend/src/models/DocumentModel.ts` - Document CRUD operations
- `backend/src/models/ProcessingJobModel.ts` - Job tracking
- `backend/src/models/migrations/` - SQL migration files
**Middleware:**
- `backend/src/middleware/firebaseAuth.ts` - JWT authentication
- `backend/src/middleware/errorHandler.ts` - Global error handling
- `backend/src/middleware/validation.ts` - Input validation
**Logging:**
- `backend/src/utils/logger.ts` - Winston logger configuration
## Naming Conventions
**Files:**
- Controllers: `{resource}Controller.ts` (e.g., `documentController.ts`)
- Services: `{service}Service.ts` or descriptive (e.g., `llmService.ts`, `singlePassProcessor.ts`)
- Models: `{Entity}Model.ts` (e.g., `DocumentModel.ts`)
- Routes: `{resource}.ts` (e.g., `documents.ts`)
- Middleware: `{purpose}Handler.ts` or `{purpose}.ts` (e.g., `firebaseAuth.ts`)
- Types/Interfaces: `types.ts` or `{name}Types.ts`
- Tests: `{file}.test.ts` or `{file}.spec.ts`
**Directories:**
- Plurals for collections: `services/`, `models/`, `utils/`, `routes/`, `controllers/`
- Singular for specific features: `config/`, `middleware/`, `types/`, `contexts/`
- Nested by feature in larger directories: `__tests__/unit/`, `models/migrations/`
**Functions/Variables:**
- Camel case: `processDocument()`, `getUserId()`, `documentId`
- Constants: UPPER_SNAKE_CASE: `MAX_RETRIES`, `TIMEOUT_MS`
- Private methods: Prefix with `_` or use TypeScript `private`: `_retryOperation()`
**Classes:**
- Pascal case: `DocumentModel`, `JobQueueService`, `SinglePassProcessor`
- Service instances exported as singletons: `export const llmService = new LLMService()`
**React Components:**
- Pascal case: `DocumentUpload.tsx`, `ProtectedRoute.tsx`
- Hooks: `use{Feature}` (e.g., `useAuth` from AuthContext)
## Where to Add New Code
**New Document Processing Strategy:**
- Primary code: `backend/src/services/{strategyName}Processor.ts`
- Schema: Add types to `backend/src/services/llmSchemas.ts`
- Integration: Register in `backend/src/services/unifiedDocumentProcessor.ts`
- Tests: `backend/src/__tests__/integration/{strategyName}.test.ts`
**New API Endpoint:**
- Route: `backend/src/routes/{resource}.ts`
- Controller: `backend/src/controllers/{resource}Controller.ts`
- Service: `backend/src/services/{resource}Service.ts` (if needed)
- Model: `backend/src/models/{Resource}Model.ts` (if database access)
- Tests: `backend/src/__tests__/integration/{endpoint}.test.ts`
**New React Component:**
- Component: `frontend/src/components/{ComponentName}.tsx`
- Types: Add to `frontend/src/types/` or inline in component
- Services: Use existing `frontend/src/services/documentService.ts`
- Tests: `frontend/src/__tests__/{ComponentName}.test.tsx` (if added)
**Shared Utilities:**
- Backend: `backend/src/utils/{utility}.ts`
- Frontend: `frontend/src/utils/{utility}.ts`
- Avoid code duplication - consider extracting common patterns
**Database Schema Changes:**
- Migration file: `backend/src/models/migrations/{timestamp}_{description}.sql`
- TypeScript interface: Update `backend/src/models/types.ts`
- Model methods: Update corresponding `*Model.ts` file
- Run: `npm run db:migrate` in backend
**Configuration Changes:**
- Environment: Update `backend/src/config/env.ts` (Joi schema)
- Frontend env: Update `frontend/src/config/env.ts`
- Firebase secrets: Use `firebase functions:secrets:set VAR_NAME`
- Local dev: Add to `.env` file (gitignored)
## Special Directories
**backend/src/__tests__/mocks/:**
- Purpose: Mock data and fixtures for testing
- Generated: No (manually maintained)
- Committed: Yes
- Usage: Import in tests for consistent test data
**backend/src/scripts/:**
- Purpose: One-off CLI utilities for development and operations
- Generated: No (manually maintained)
- Committed: Yes
- Execution: `ts-node src/scripts/{script}.ts` or `npm run {script}`
**backend/src/assets/:**
- Purpose: Static HTML templates for PDF generation
- Generated: No (manually maintained)
- Committed: Yes
- Usage: Rendered by Puppeteer in `pdfGenerationService.ts`
**backend/src/models/migrations/:**
- Purpose: Database schema migration SQL files
- Generated: No (manually created)
- Committed: Yes
- Execution: Run via `npm run db:migrate`
**frontend/src/assets/:**
- Purpose: Images, icons, logos
- Generated: No (manually added)
- Committed: Yes
- Usage: Import in components (e.g., `bluepoint-logo.png`)
**backend/dist/ and frontend/dist/:**
- Purpose: Compiled JavaScript and optimized bundles
- Generated: Yes (build output)
- Committed: No (gitignored)
- Regeneration: `npm run build` in respective directory
**backend/node_modules/ and frontend/node_modules/:**
- Purpose: Installed dependencies
- Generated: Yes (npm install)
- Committed: No (gitignored)
- Regeneration: `npm install`
**backend/logs/:**
- Purpose: Runtime log files
- Generated: Yes (runtime)
- Committed: No (gitignored)
- Contents: `error.log`, `upload.log`, combined logs
---
*Structure analysis: 2026-02-24*

View File

@@ -0,0 +1,342 @@
# Testing Patterns
**Analysis Date:** 2026-02-24
## Test Framework
**Runner:**
- Vitest 2.1.0
- Config: No dedicated `vitest.config.ts` found (uses defaults)
- Node.js test environment
**Assertion Library:**
- Vitest native assertions via `expect()`
- Examples: `expect(value).toBe()`, `expect(value).toBeDefined()`, `expect(array).toContain()`
**Run Commands:**
```bash
npm test # Run all tests once
npm run test:watch # Watch mode for continuous testing
npm run test:coverage # Generate coverage report
```
**Coverage Tool:**
- `@vitest/coverage-v8` 2.1.0
- Tracks line, branch, function, and statement coverage
- V8 backend for accurate coverage metrics
## Test File Organization
**Location:**
- Co-located in `backend/src/__tests__/` directory
- Subdirectories for logical grouping:
- `backend/src/__tests__/utils/` - Utility function tests
- `backend/src/__tests__/mocks/` - Mock implementations
- `backend/src/__tests__/acceptance/` - Acceptance/integration tests
**Naming:**
- Pattern: `[feature].test.ts` or `[feature].spec.ts`
- Examples:
- `backend/src/__tests__/financial-summary.test.ts`
- `backend/src/__tests__/acceptance/handiFoods.acceptance.test.ts`
**Structure:**
```
backend/src/__tests__/
├── utils/
│ └── test-helpers.ts # Test utility functions
├── mocks/
│ └── logger.mock.ts # Mock implementations
└── acceptance/
└── handiFoods.acceptance.test.ts # Acceptance tests
```
## Test Structure
**Suite Organization:**
```typescript
import { describe, test, expect, beforeAll } from 'vitest';
describe('Feature Category', () => {
describe('Nested Behavior Group', () => {
test('should do specific thing', () => {
expect(result).toBe(expected);
});
test('should handle edge case', () => {
expect(edge).toBeDefined();
});
});
});
```
From `financial-summary.test.ts`:
```typescript
describe('Financial Summary Fixes', () => {
describe('Period Ordering', () => {
test('Summary table should display periods in chronological order (FY3 → FY2 → FY1 → LTM)', () => {
const periods = ['fy3', 'fy2', 'fy1', 'ltm'];
const expectedOrder = ['FY3', 'FY2', 'FY1', 'LTM'];
expect(periods[0]).toBe('fy3');
expect(periods[3]).toBe('ltm');
});
});
});
```
**Patterns:**
1. **Setup Pattern:**
- Use `beforeAll()` for shared test data initialization
- Example from `handiFoods.acceptance.test.ts`:
```typescript
beforeAll(() => {
const normalize = (text: string) => text.replace(/\s+/g, ' ').toLowerCase();
const cimRaw = fs.readFileSync(cimTextPath, 'utf-8');
const outputRaw = fs.readFileSync(outputTextPath, 'utf-8');
cimNormalized = normalize(cimRaw);
outputNormalized = normalize(outputRaw);
});
```
2. **Teardown Pattern:**
- Not explicitly shown in current tests
- Use `afterAll()` for resource cleanup if needed
3. **Assertion Pattern:**
- Descriptive test names that read as sentences: `'should display periods in chronological order'`
- Multiple assertions per test acceptable for related checks
- Use `expect().toContain()` for array/string membership
- Use `expect().toBeDefined()` for existence checks
- Use `expect().toBeGreaterThan()` for numeric comparisons
## Mocking
**Framework:** Vitest `vi` mock utilities
**Patterns:**
1. **Mock Logger:**
```typescript
import { vi } from 'vitest';
export const mockLogger = {
debug: vi.fn(),
info: vi.fn(),
warn: vi.fn(),
error: vi.fn(),
};
export const mockStructuredLogger = {
uploadStart: vi.fn(),
uploadSuccess: vi.fn(),
uploadError: vi.fn(),
processingStart: vi.fn(),
processingSuccess: vi.fn(),
processingError: vi.fn(),
storageOperation: vi.fn(),
jobQueueOperation: vi.fn(),
info: vi.fn(),
warn: vi.fn(),
error: vi.fn(),
debug: vi.fn(),
};
```
2. **Mock Service Pattern:**
- Create mock implementations in `backend/src/__tests__/mocks/`
- Export as named exports: `export const mockLogger`, `export const mockStructuredLogger`
- Use `vi.fn()` for all callable methods to track calls and arguments
3. **What to Mock:**
- External services: Firebase Auth, Supabase, Google Cloud APIs
- Logger: always mock to prevent log spam during tests
- File system operations (in unit tests; use real files in acceptance tests)
- LLM API calls: mock responses to avoid quota usage
4. **What NOT to Mock:**
- Core utility functions: use real implementations
- Type definitions: no need to mock types
- Pure functions: test directly without mocks
- Business logic calculations: test with real data
## Fixtures and Factories
**Test Data:**
1. **Helper Factory Pattern:**
From `backend/src/__tests__/utils/test-helpers.ts`:
```typescript
export function createMockCorrelationId(): string {
return `test-correlation-${Date.now()}-${Math.random().toString(36).substr(2, 9)}`;
}
export function createMockUserId(): string {
return `test-user-${Date.now()}-${Math.random().toString(36).substr(2, 9)}`;
}
export function createMockDocumentId(): string {
return `test-doc-${Date.now()}-${Math.random().toString(36).substr(2, 9)}`;
}
export function createMockJobId(): string {
return `test-job-${Date.now()}-${Math.random().toString(36).substr(2, 9)}`;
}
export function wait(ms: number): Promise<void> {
return new Promise((resolve) => setTimeout(resolve, ms));
}
```
2. **Acceptance Test Fixtures:**
- Located in `backend/test-fixtures/` directory
- Example: `backend/test-fixtures/handiFoods/` contains:
- `handi-foods-cim.txt` - Reference CIM content
- `handi-foods-output.txt` - Expected processor output
- Loaded via `fs.readFileSync()` in `beforeAll()` hooks
**Location:**
- Test helpers: `backend/src/__tests__/utils/test-helpers.ts`
- Acceptance fixtures: `backend/test-fixtures/` (outside src)
- Mocks: `backend/src/__tests__/mocks/`
## Coverage
**Requirements:**
- No automated coverage enforcement detected (no threshold in config)
- Manual review recommended for critical paths
**View Coverage:**
```bash
npm run test:coverage
```
## Test Types
**Unit Tests:**
- **Scope:** Individual functions, services, utilities
- **Approach:** Test in isolation with mocks for dependencies
- **Examples:**
- Financial parser tests: parse tables with various formats
- Period ordering tests: verify chronological order logic
- Validate UUID format tests: regex pattern matching
- **Location:** `backend/src/__tests__/[feature].test.ts`
**Integration Tests:**
- **Scope:** Multiple components working together
- **Approach:** May use real Supabase/Firebase or mocks depending on test level
- **Not heavily used:** minimal integration test infrastructure
- **Pattern:** Could use real database in test environment with cleanup
**Acceptance Tests:**
- **Scope:** End-to-end feature validation with real artifacts
- **Approach:** Load reference files, process through entire pipeline, verify output
- **Example:** `handiFoods.acceptance.test.ts`
- Loads CIM text file
- Loads processor output file
- Validates all reference facts exist in both
- Validates key fields resolved instead of fallback messages
- **Location:** `backend/src/__tests__/acceptance/`
**E2E Tests:**
- Not implemented in current setup
- Would require browser automation (no Playwright/Cypress config found)
- Frontend testing: not currently automated
## Common Patterns
**Async Testing:**
```typescript
test('should process document asynchronously', async () => {
const result = await processDocument(documentId, userId, text);
expect(result.success).toBe(true);
});
```
**Error Testing:**
```typescript
test('should validate UUID format', () => {
const id = 'invalid-id';
const uuidRegex = /^[0-9a-f]{8}-[0-9a-f]{4}-4[0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}$/i;
expect(uuidRegex.test(id)).toBe(false);
});
```
**Array/Collection Testing:**
```typescript
test('should extract all financial periods', () => {
const result = parseFinancialsFromText(tableText);
expect(result.data.fy3.revenue).toBeDefined();
expect(result.data.fy2.revenue).toBeDefined();
expect(result.data.fy1.revenue).toBeDefined();
expect(result.data.ltm.revenue).toBeDefined();
});
```
**Text/Content Testing (Acceptance):**
```typescript
test('verifies each reference fact exists in CIM and generated output', () => {
for (const fact of referenceFacts) {
for (const token of fact.tokens) {
expect(cimNormalized).toContain(token);
expect(outputNormalized).toContain(token);
}
}
});
```
**Normalization for Content Testing:**
```typescript
// Normalize whitespace and case for robust text matching
const normalize = (text: string) => text.replace(/\s+/g, ' ').toLowerCase();
const normalizedCIM = normalize(cimRaw);
expect(normalizedCIM).toContain('reference-phrase');
```
## Test Coverage Priorities
**Critical Paths (Test First):**
1. Document upload and file storage operations
2. Firebase authentication and token validation
3. LLM service API interactions with retry logic
4. Error handling and correlation ID tracking
5. Financial data extraction and parsing
6. PDF generation pipeline
**Important Paths (Test Early):**
1. Vector embeddings and database operations
2. Job queue processing and timeout handling
3. Google Document AI text extraction
4. Supabase Row Level Security policies
**Nice-to-Have (Test Later):**
1. UI component rendering (would require React Testing Library)
2. CSS/styling validation
3. Frontend form submission flows
4. Analytics tracking
## Current Testing Gaps
**Untested Areas:**
- Backend services: Most services lack unit tests (llmService, fileStorageService, etc.)
- Database models: No model tests for Supabase operations
- Controllers/Endpoints: No API endpoint tests
- Frontend components: No React component tests
- Integration flows: Document upload through processing to PDF generation
**Missing Patterns:**
- No database integration test setup (fixtures, transactions)
- No API request/response validation tests
- No performance/load tests
- No security tests (auth bypass, XSS, injection)
## Deprecated Test Patterns (DO NOT USE)
- ❌ Jest test suite - Use Vitest instead
- ❌ Direct PostgreSQL connection tests - Use Supabase in test mode
- ❌ Legacy test files referencing removed services - Updated implementations used only
---
*Testing analysis: 2026-02-24*