docs: map existing codebase
This commit is contained in:
243
.planning/codebase/ARCHITECTURE.md
Normal file
243
.planning/codebase/ARCHITECTURE.md
Normal file
@@ -0,0 +1,243 @@
|
|||||||
|
# Architecture
|
||||||
|
|
||||||
|
**Analysis Date:** 2026-02-24
|
||||||
|
|
||||||
|
## Pattern Overview
|
||||||
|
|
||||||
|
**Overall:** Full-stack distributed system combining Express.js backend with React frontend, implementing a **multi-stage document processing pipeline** with queued background jobs and real-time monitoring.
|
||||||
|
|
||||||
|
**Key Characteristics:**
|
||||||
|
- Server-rendered PDF generation with single-pass LLM processing
|
||||||
|
- Asynchronous job queue for background document processing (max 3 concurrent)
|
||||||
|
- Firebase authentication with Supabase PostgreSQL + pgvector for embeddings
|
||||||
|
- Multi-language LLM support (Anthropic, OpenAI, OpenRouter)
|
||||||
|
- Structured schema extraction using Zod and LLM-driven analysis
|
||||||
|
- Google Document AI for OCR and text extraction
|
||||||
|
- Real-time upload progress tracking via SSE/polling
|
||||||
|
- Correlation ID tracking throughout distributed pipeline
|
||||||
|
|
||||||
|
## Layers
|
||||||
|
|
||||||
|
**API Layer (Express + TypeScript):**
|
||||||
|
- Purpose: HTTP request routing, authentication, and response handling
|
||||||
|
- Location: `backend/src/index.ts`, `backend/src/routes/`, `backend/src/controllers/`
|
||||||
|
- Contains: Route definitions, request validation, error handling
|
||||||
|
- Depends on: Middleware (auth, validation), Services
|
||||||
|
- Used by: Frontend and external clients
|
||||||
|
|
||||||
|
**Authentication Layer:**
|
||||||
|
- Purpose: Firebase ID token verification and user identity validation
|
||||||
|
- Location: `backend/src/middleware/firebaseAuth.ts`, `backend/src/config/firebase.ts`
|
||||||
|
- Contains: Token verification, service account initialization, session recovery
|
||||||
|
- Depends on: Firebase Admin SDK, configuration
|
||||||
|
- Used by: All protected routes via `verifyFirebaseToken` middleware
|
||||||
|
|
||||||
|
**Controller Layer:**
|
||||||
|
- Purpose: Request handling, input validation, service orchestration
|
||||||
|
- Location: `backend/src/controllers/documentController.ts`, `backend/src/controllers/authController.ts`
|
||||||
|
- Contains: `getUploadUrl()`, `processDocument()`, `getDocumentStatus()` handlers
|
||||||
|
- Depends on: Models, Services, Middleware
|
||||||
|
- Used by: Routes
|
||||||
|
|
||||||
|
**Service Layer:**
|
||||||
|
- Purpose: Business logic, external API integration, document processing orchestration
|
||||||
|
- Location: `backend/src/services/`
|
||||||
|
- Contains:
|
||||||
|
- `unifiedDocumentProcessor.ts` - Main orchestrator, strategy selection
|
||||||
|
- `singlePassProcessor.ts` - 2-LLM-call extraction (pass 1 + quality check)
|
||||||
|
- `documentAiProcessor.ts` - Google Document AI text extraction
|
||||||
|
- `llmService.ts` - LLM API calls with retry logic (3 attempts, exponential backoff)
|
||||||
|
- `jobQueueService.ts` - Background job processing (EventEmitter-based)
|
||||||
|
- `fileStorageService.ts` - Google Cloud Storage signed URLs and uploads
|
||||||
|
- `vectorDatabaseService.ts` - Supabase vector embeddings and search
|
||||||
|
- `pdfGenerationService.ts` - Puppeteer-based PDF rendering
|
||||||
|
- `csvExportService.ts` - Financial data export
|
||||||
|
- Depends on: Models, Config, Utilities
|
||||||
|
- Used by: Controllers, Job Queue
|
||||||
|
|
||||||
|
**Model Layer (Data Access):**
|
||||||
|
- Purpose: Database interactions, query execution, schema validation
|
||||||
|
- Location: `backend/src/models/`
|
||||||
|
- Contains: `DocumentModel.ts`, `ProcessingJobModel.ts`, `UserModel.ts`, `VectorDatabaseModel.ts`
|
||||||
|
- Depends on: Supabase client, configuration
|
||||||
|
- Used by: Services, Controllers
|
||||||
|
|
||||||
|
**Job Queue Layer:**
|
||||||
|
- Purpose: Asynchronous background processing with priority and retry handling
|
||||||
|
- Location: `backend/src/services/jobQueueService.ts`, `backend/src/services/jobProcessorService.ts`
|
||||||
|
- Contains: In-memory queue, worker pool (max 3 concurrent), Firebase scheduled function trigger
|
||||||
|
- Depends on: Services (document processor), Models
|
||||||
|
- Used by: Controllers (to enqueue work), Scheduled functions (to trigger processing)
|
||||||
|
|
||||||
|
**Frontend Layer (React + TypeScript):**
|
||||||
|
- Purpose: User interface for document upload, processing monitoring, and review
|
||||||
|
- Location: `frontend/src/`
|
||||||
|
- Contains: Components (Upload, List, Viewer, Analytics), Services, Contexts
|
||||||
|
- Depends on: Backend API, Firebase Auth, Axios
|
||||||
|
- Used by: Web browsers
|
||||||
|
|
||||||
|
## Data Flow
|
||||||
|
|
||||||
|
**Document Upload & Processing Flow:**
|
||||||
|
|
||||||
|
1. **Upload Initiation** (Frontend)
|
||||||
|
- User selects PDF file via `DocumentUpload` component
|
||||||
|
- Calls `documentService.getUploadUrl()` → Backend `/documents/upload-url` endpoint
|
||||||
|
- Backend creates document record (status: 'uploading') and generates signed GCS URL
|
||||||
|
|
||||||
|
2. **File Upload** (Frontend → GCS)
|
||||||
|
- Frontend uploads file directly to Google Cloud Storage via signed URL
|
||||||
|
- Frontend polls `documentService.getDocumentStatus()` for upload completion
|
||||||
|
- `UploadMonitoringDashboard` displays real-time progress
|
||||||
|
|
||||||
|
3. **Processing Trigger** (Frontend → Backend)
|
||||||
|
- Frontend calls `POST /documents/{id}/process` once upload complete
|
||||||
|
- Controller creates processing job and enqueues to `jobQueueService`
|
||||||
|
- Controller immediately returns job ID
|
||||||
|
|
||||||
|
4. **Background Job Execution** (Job Queue)
|
||||||
|
- Scheduled Firebase function (`processDocumentJobs`) runs every 1 minute
|
||||||
|
- Calls `jobProcessorService.processJobs()` to dequeue and execute
|
||||||
|
- For each queued document:
|
||||||
|
- Fetch file from GCS
|
||||||
|
- Update status to 'extracting_text'
|
||||||
|
- Call `unifiedDocumentProcessor.processDocument()`
|
||||||
|
|
||||||
|
5. **Document Processing** (Single-Pass Strategy)
|
||||||
|
- **Pass 1 - LLM Extraction:**
|
||||||
|
- `documentAiProcessor.extractText()` (if needed) - Google Document AI OCR
|
||||||
|
- `llmService.processCIMDocument()` - Claude/OpenAI structured extraction
|
||||||
|
- Produces `CIMReview` object with financial, market, management data
|
||||||
|
- Updates document status to 'processing_llm'
|
||||||
|
|
||||||
|
- **Pass 2 - Quality Check:**
|
||||||
|
- `llmService.validateCIMReview()` - Verify completeness and accuracy
|
||||||
|
- Updates status to 'quality_validation'
|
||||||
|
|
||||||
|
- **PDF Generation:**
|
||||||
|
- `pdfGenerationService.generatePDF()` - Puppeteer renders HTML template
|
||||||
|
- Uploads PDF to GCS
|
||||||
|
- Updates status to 'generating_pdf'
|
||||||
|
|
||||||
|
- **Vector Indexing (Background):**
|
||||||
|
- `vectorDatabaseService.createDocumentEmbedding()` - Generate 3072-dim embeddings
|
||||||
|
- Chunk document semantically, store in Supabase with vector index
|
||||||
|
- Status moves to 'vector_indexing' then 'completed'
|
||||||
|
|
||||||
|
6. **Result Delivery** (Backend → Frontend)
|
||||||
|
- Frontend polls `GET /documents/{id}` to check completion
|
||||||
|
- When status = 'completed', fetches summary and analysis data
|
||||||
|
- `DocumentViewer` displays results, allows regeneration with feedback
|
||||||
|
|
||||||
|
**State Management:**
|
||||||
|
- Backend: Document status progresses through `uploading → extracting_text → processing_llm → generating_pdf → vector_indexing → completed` or `failed` at any step
|
||||||
|
- Frontend: AuthContext manages user/token, component state tracks selected document and loading states
|
||||||
|
- Job Queue: In-memory queue with EventEmitter for state transitions
|
||||||
|
|
||||||
|
## Key Abstractions
|
||||||
|
|
||||||
|
**Unified Processor:**
|
||||||
|
- Purpose: Strategy pattern for document processing (single-pass vs. agentic RAG vs. simple)
|
||||||
|
- Examples: `singlePassProcessor`, `simpleDocumentProcessor`, `optimizedAgenticRAGProcessor`
|
||||||
|
- Pattern: Pluggable strategies via `ProcessingStrategy` selection in config
|
||||||
|
|
||||||
|
**LLM Service:**
|
||||||
|
- Purpose: Unified interface for multiple LLM providers with retry logic
|
||||||
|
- Examples: `backend/src/services/llmService.ts` (Anthropic, OpenAI, OpenRouter)
|
||||||
|
- Pattern: Provider-agnostic API with `processCIMDocument()` returning structured `CIMReview`
|
||||||
|
|
||||||
|
**Vector Database Abstraction:**
|
||||||
|
- Purpose: PostgreSQL pgvector operations via Supabase for semantic search
|
||||||
|
- Examples: `backend/src/services/vectorDatabaseService.ts`
|
||||||
|
- Pattern: Embedding + chunking → vector search via cosine similarity
|
||||||
|
|
||||||
|
**File Storage Abstraction:**
|
||||||
|
- Purpose: Google Cloud Storage operations with signed URLs
|
||||||
|
- Examples: `backend/src/services/fileStorageService.ts`
|
||||||
|
- Pattern: Signed upload/download URLs for temporary access without IAM burden
|
||||||
|
|
||||||
|
**Job Queue Pattern:**
|
||||||
|
- Purpose: Async processing with retry and priority handling
|
||||||
|
- Examples: `backend/src/services/jobQueueService.ts` (EventEmitter-based)
|
||||||
|
- Pattern: Priority queue with exponential backoff retry
|
||||||
|
|
||||||
|
## Entry Points
|
||||||
|
|
||||||
|
**API Entry Point:**
|
||||||
|
- Location: `backend/src/index.ts`
|
||||||
|
- Triggers: Process startup or Firebase Functions invocation
|
||||||
|
- Responsibilities:
|
||||||
|
- Initialize Express app
|
||||||
|
- Set up middleware (CORS, helmet, rate limiting, authentication)
|
||||||
|
- Register routes (`/documents`, `/vector`, `/monitoring`, `/api/audit`)
|
||||||
|
- Start job queue service
|
||||||
|
- Export Firebase Functions v2 handlers (`api`, `processDocumentJobs`)
|
||||||
|
|
||||||
|
**Scheduled Job Processing:**
|
||||||
|
- Location: `backend/src/index.ts` (line 252: `processDocumentJobs` function export)
|
||||||
|
- Triggers: Firebase Cloud Scheduler every 1 minute
|
||||||
|
- Responsibilities:
|
||||||
|
- Health check database connection
|
||||||
|
- Detect stuck jobs (processing > 15 min, pending > 2 min)
|
||||||
|
- Call `jobProcessorService.processJobs()`
|
||||||
|
- Log metrics and errors
|
||||||
|
|
||||||
|
**Frontend Entry Point:**
|
||||||
|
- Location: `frontend/src/main.tsx`
|
||||||
|
- Triggers: Browser navigation
|
||||||
|
- Responsibilities:
|
||||||
|
- Initialize React app with AuthProvider
|
||||||
|
- Set up Firebase client
|
||||||
|
- Render routing structure (Login → Dashboard)
|
||||||
|
|
||||||
|
**Document Processing Controller:**
|
||||||
|
- Location: `backend/src/controllers/documentController.ts`
|
||||||
|
- Route: `POST /documents/{id}/process`
|
||||||
|
- Responsibilities:
|
||||||
|
- Validate user authentication
|
||||||
|
- Enqueue processing job
|
||||||
|
- Return job ID to client
|
||||||
|
|
||||||
|
## Error Handling
|
||||||
|
|
||||||
|
**Strategy:** Multi-layer error recovery with structured logging and graceful degradation
|
||||||
|
|
||||||
|
**Patterns:**
|
||||||
|
- **Retry Logic:** DocumentModel uses exponential backoff (1s → 2s → 4s) for network errors
|
||||||
|
- **LLM Retry:** `llmService` retries API calls 3 times with exponential backoff
|
||||||
|
- **Firebase Auth Recovery:** `firebaseAuth.ts` attempts session recovery on token verify failure
|
||||||
|
- **Job Queue Retry:** Jobs retry up to 3 times with configurable backoff (5s → 300s max)
|
||||||
|
- **Structured Error Logging:** All errors include correlation ID, stack trace, and context metadata
|
||||||
|
- **Circuit Breaker Pattern:** Database health check in `processDocumentJobs` prevents cascading failures
|
||||||
|
|
||||||
|
**Error Boundaries:**
|
||||||
|
- Global error handler at end of Express middleware chain (`errorHandler`)
|
||||||
|
- Try/catch in all async functions with context-aware logging
|
||||||
|
- Unhandled rejection listener at process level (line 24 of `index.ts`)
|
||||||
|
|
||||||
|
## Cross-Cutting Concerns
|
||||||
|
|
||||||
|
**Logging:**
|
||||||
|
- Framework: Winston (json + console in dev)
|
||||||
|
- Approach: Structured logger with correlation IDs, Winston transports for error/upload logs
|
||||||
|
- Location: `backend/src/utils/logger.ts`
|
||||||
|
- Pattern: `logger.info()`, `logger.error()`, `StructuredLogger` for operations
|
||||||
|
|
||||||
|
**Validation:**
|
||||||
|
- Approach: Joi schema in environment config, Zod for API request/response types
|
||||||
|
- Location: `backend/src/config/env.ts`, `backend/src/services/llmSchemas.ts`
|
||||||
|
- Pattern: Joi for config, Zod for runtime validation
|
||||||
|
|
||||||
|
**Authentication:**
|
||||||
|
- Approach: Firebase ID tokens verified via `verifyFirebaseToken` middleware
|
||||||
|
- Location: `backend/src/middleware/firebaseAuth.ts`
|
||||||
|
- Pattern: Bearer token in Authorization header, cached in req.user
|
||||||
|
|
||||||
|
**Correlation Tracking:**
|
||||||
|
- Approach: UUID correlation ID added to all requests, propagated through job processing
|
||||||
|
- Location: `backend/src/middleware/validation.ts` (addCorrelationId)
|
||||||
|
- Pattern: X-Correlation-ID header or generated UUID, included in all logs
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*Architecture analysis: 2026-02-24*
|
||||||
329
.planning/codebase/CONCERNS.md
Normal file
329
.planning/codebase/CONCERNS.md
Normal file
@@ -0,0 +1,329 @@
|
|||||||
|
# Codebase Concerns
|
||||||
|
|
||||||
|
**Analysis Date:** 2026-02-24
|
||||||
|
|
||||||
|
## Tech Debt
|
||||||
|
|
||||||
|
**Console.log Debug Statements in Controllers:**
|
||||||
|
- Issue: Excessive `console.log()` calls with emoji prefixes left throughout `documentController.ts` instead of using proper structured logging via Winston logger
|
||||||
|
- Files: `backend/src/controllers/documentController.ts` (lines 12-80, multiple scattered instances)
|
||||||
|
- Impact: Production logs become noisy and unstructured; debug output leaks to stdout/stderr; makes it harder to parse logs for errors and metrics
|
||||||
|
- Fix approach: Replace all `console.log()` calls with `logger.info()`, `logger.debug()`, `logger.error()` via imported `logger` from `utils/logger.ts`. Follow pattern established in other services.
|
||||||
|
|
||||||
|
**Incomplete Job Statistics Tracking:**
|
||||||
|
- Issue: `jobQueueService.ts` and `jobProcessorService.ts` both have TODO markers indicating completed/failed job counts are not tracked (lines 606-607, 635-636)
|
||||||
|
- Files: `backend/src/services/jobQueueService.ts`, `backend/src/services/jobProcessorService.ts`
|
||||||
|
- Impact: Job queue health metrics are incomplete; cannot audit success/failure rates; monitoring dashboards will show incomplete data
|
||||||
|
- Fix approach: Implement `completedJobs` and `failedJobs` counters in both services using persistent storage or Redis. Update schema if needed.
|
||||||
|
|
||||||
|
**Config Migration Debug Cruft:**
|
||||||
|
- Issue: Multiple `console.log()` debug statements in `config/env.ts` (lines 23, 46, 51, 292) for Firebase Functions v1→v2 migration are still present
|
||||||
|
- Files: `backend/src/config/env.ts`
|
||||||
|
- Impact: Production logs polluted with migration warnings; makes it harder to spot real issues; clutters server startup output
|
||||||
|
- Fix approach: Remove all `[CONFIG DEBUG]` console.log statements once migration to Firebase Functions v2 is confirmed complete. Wrap remaining fallback logic in logger.debug() if diagnostics needed.
|
||||||
|
|
||||||
|
**Hardcoded Processing Strategy:**
|
||||||
|
- Issue: Historical commit shows processing strategy was hardcoded, potential for incomplete refactoring
|
||||||
|
- Files: `backend/src/services/`, controller logic
|
||||||
|
- Impact: May not correctly use configured strategy; processing may default unexpectedly
|
||||||
|
- Fix approach: Verify all processing paths read from `config.processingStrategy` and have proper fallback logic
|
||||||
|
|
||||||
|
**Type Safety Issues - `any` Type Usage:**
|
||||||
|
- Issue: 378 instances of `any` or `unknown` types found across backend TypeScript files
|
||||||
|
- Files: Widespread including `optimizedAgenticRAGProcessor.ts:17`, `pdfGenerationService.ts`, `vectorDatabaseService.ts`
|
||||||
|
- Impact: Loses type safety guarantees; harder to catch errors at compile time; refactoring becomes risky
|
||||||
|
- Fix approach: Gradually replace `any` with proper types. Start with service boundaries and public APIs. Create typed interfaces for common patterns.
|
||||||
|
|
||||||
|
## Known Bugs
|
||||||
|
|
||||||
|
**Project Panther CIM KPI Missing After Processing:**
|
||||||
|
- Symptoms: Document `Project Panther - Confidential Information Memorandum_vBluePoint.pdf` processed but dashboard shows "Not specified in CIM" for Revenue, EBITDA, Employees, Founded even though numeric tables exist in PDF
|
||||||
|
- Files: `backend/src/services/optimizedAgenticRAGProcessor.ts` (dealOverview mapper), processing pipeline
|
||||||
|
- Trigger: Process Project Panther test document through full agentic RAG pipeline
|
||||||
|
- Impact: Dashboard KPI cards remain empty; users see incomplete summaries
|
||||||
|
- Workaround: Manual data entry in dashboard; skip financial summary display for affected documents
|
||||||
|
- Fix approach: Trace through `optimizedAgenticRAGProcessor.generateLLMAnalysisMultiPass()` → `dealOverview` mapper. Add regression test for this specific document. Check if structured table extraction is working correctly.
|
||||||
|
|
||||||
|
**10+ Minute Processing Latency Regression:**
|
||||||
|
- Symptoms: Document `document-55c4a6e2-8c08-4734-87f6-24407cea50ac.pdf` (Project Panther) took ~10 minutes end-to-end despite typical processing being 2-3 minutes
|
||||||
|
- Files: `backend/src/services/unifiedDocumentProcessor.ts`, `optimizedAgenticRAGProcessor.ts`, `documentAiProcessor.ts`, `llmService.ts`
|
||||||
|
- Trigger: Large or complex CIM documents (30+ pages with tables)
|
||||||
|
- Impact: Users experience timeouts; processing approaching or exceeding 14-minute Firebase Functions limit
|
||||||
|
- Workaround: None currently; document fails to process if latency exceeds timeout
|
||||||
|
- Fix approach: Instrument each pipeline phase (PDF chunking, Document AI extraction, RAG passes, financial parser) with timing logs. Identify bottleneck(s). Profile GCS upload retries, Anthropic fallbacks. Consider parallel multi-pass queries within quota limits.
|
||||||
|
|
||||||
|
**Vector Search Timeouts After Index Growth:**
|
||||||
|
- Symptoms: Supabase vector search RPC calls timeout after 30 seconds; fallback to document-scoped search with limited results
|
||||||
|
- Files: `backend/src/services/vectorDatabaseService.ts` (lines 122-182)
|
||||||
|
- Trigger: Large embedded document collections (1000+ chunks); similarity search under load
|
||||||
|
- Impact: Retrieval quality degrades as index grows; fallback search returns fewer contextual chunks; RAG quality suffers
|
||||||
|
- Workaround: Fallback query uses document-scoped filtering and direct embedding lookup
|
||||||
|
- Fix approach: Implement query batching, result caching by content hash, or query optimization. Consider Pinecone migration if Supabase vector performance doesn't improve. Add metrics to track timeout frequency.
|
||||||
|
|
||||||
|
## Security Considerations
|
||||||
|
|
||||||
|
**Unencrypted Debug Logs in Production:**
|
||||||
|
- Risk: Sensitive document content, user IDs, and processing details may be exposed in logs if debug mode enabled in production
|
||||||
|
- Files: `backend/src/middleware/firebaseAuth.ts` (AUTH_DEBUG flag), `backend/src/config/env.ts`, `backend/src/controllers/documentController.ts`
|
||||||
|
- Current mitigation: Debug logging controlled by `AUTH_DEBUG` environment variable; not enabled by default
|
||||||
|
- Recommendations:
|
||||||
|
1. Ensure `AUTH_DEBUG` is never set to `true` in production
|
||||||
|
2. Implement log redaction middleware to strip PII (API keys, document content, user data)
|
||||||
|
3. Use correlation IDs instead of logging full request bodies
|
||||||
|
4. Add log level enforcement (error/warn only in production)
|
||||||
|
|
||||||
|
**Hardcoded Service Account Credentials Path:**
|
||||||
|
- Risk: If service account key JSON is accidentally committed or exposed, attacker gains full GCS and Document AI access
|
||||||
|
- Files: `backend/src/config/env.ts`, `backend/src/utils/googleServiceAccount.ts`
|
||||||
|
- Current mitigation: `.env` file in `.gitignore`; credentials path via env var
|
||||||
|
- Recommendations:
|
||||||
|
1. Use Firebase Function secrets (defineSecret()) instead of env files
|
||||||
|
2. Implement credential rotation policy
|
||||||
|
3. Add pre-commit hook to prevent `.json` key files in commits
|
||||||
|
4. Audit GCS bucket permissions quarterly
|
||||||
|
|
||||||
|
**Concurrent LLM Rate Limiting Insufficient:**
|
||||||
|
- Risk: Although `llmService.ts` limits concurrent calls to 1 (line 52), burst requests could still trigger Anthropic 429 rate limit errors during high load
|
||||||
|
- Files: `backend/src/services/llmService.ts` (MAX_CONCURRENT_LLM_CALLS = 1)
|
||||||
|
- Current mitigation: Max 1 concurrent call; retry with exponential backoff (3 attempts)
|
||||||
|
- Recommendations:
|
||||||
|
1. Consider reducing to 0.5 concurrent calls (queue instead of async) during peak hours
|
||||||
|
2. Add request batching for multi-pass analysis
|
||||||
|
3. Implement circuit breaker pattern for cascading failures
|
||||||
|
4. Monitor token spend and throttle proactively
|
||||||
|
|
||||||
|
**No Request Rate Limiting on Upload Endpoint:**
|
||||||
|
- Risk: Unauthenticated attackers could flood `/upload/url` endpoint to exhaust quota or fill storage
|
||||||
|
- Files: `backend/src/controllers/documentController.ts` (getUploadUrl endpoint), `backend/src/routes/documents.ts`
|
||||||
|
- Current mitigation: Firebase Auth check; file size limit enforced
|
||||||
|
- Recommendations:
|
||||||
|
1. Add rate limiter middleware (e.g., express-rate-limit) with per-user quotas
|
||||||
|
2. Implement request signing for upload URLs
|
||||||
|
3. Add CORS restrictions to known frontend domains
|
||||||
|
4. Monitor upload rate and alert on anomalies
|
||||||
|
|
||||||
|
## Performance Bottlenecks
|
||||||
|
|
||||||
|
**Large File PDF Chunking Memory Usage:**
|
||||||
|
- Problem: Documents larger than 50 MB may cause OOM errors during chunking; no memory limit guards
|
||||||
|
- Files: `backend/src/services/optimizedAgenticRAGProcessor.ts` (line 35, 4000-char chunks), `backend/src/services/unifiedDocumentProcessor.ts`
|
||||||
|
- Cause: Entire document text loaded into memory before chunking; large overlap between chunks multiplies footprint
|
||||||
|
- Improvement path:
|
||||||
|
1. Implement streaming chunk processing from GCS (read chunks, embed, write to DB before next chunk)
|
||||||
|
2. Reduce overlap from 200 to 100 characters or make dynamic based on document size
|
||||||
|
3. Add memory threshold checks; fail early with user-friendly error if approaching limit
|
||||||
|
4. Profile heap usage in tests with 50+ MB documents
|
||||||
|
|
||||||
|
**Embedding Generation for Large Documents:**
|
||||||
|
- Problem: Embedding 1000+ chunks sequentially takes 2-3 minutes; no concurrency despite `maxConcurrentEmbeddings = 5` setting
|
||||||
|
- Files: `backend/src/services/optimizedAgenticRAGProcessor.ts` (lines 37, 172-180 region)
|
||||||
|
- Cause: Batch size of 10 may be inefficient; OpenAI/Anthropic API concurrency not fully utilized
|
||||||
|
- Improvement path:
|
||||||
|
1. Increase batch size to 25-50 chunks per concurrent request (test quota limits)
|
||||||
|
2. Use Promise.all() instead of sequential embedding calls
|
||||||
|
3. Cache embeddings by content hash to skip re-embedding on retries
|
||||||
|
4. Add progress callback to track batch completion
|
||||||
|
|
||||||
|
**Multiple LLM Retries on Network Failure:**
|
||||||
|
- Problem: 3 retry attempts for each LLM call with exponential backoff means up to 30+ seconds per call; multi-pass analysis does 3+ passes
|
||||||
|
- Files: `backend/src/services/llmService.ts` (retry logic, lines 320+), `backend/src/services/optimizedAgenticRAGProcessor.ts` (line 83 multi-pass)
|
||||||
|
- Cause: No circuit breaker; all retries execute even if service degraded
|
||||||
|
- Improvement path:
|
||||||
|
1. Track consecutive failures; disable retries if failure rate >50% in last minute
|
||||||
|
2. Use adaptive retry backoff (double wait time only after first failure)
|
||||||
|
3. Implement multi-pass fallback: if Pass 2 fails, use Pass 1 results instead of failing entire document
|
||||||
|
4. Add metrics endpoint to show retry frequency and success rates
|
||||||
|
|
||||||
|
**PDF Generation Memory Leak with Puppeteer Page Pool:**
|
||||||
|
- Problem: Page pool in `pdfGenerationService.ts` may not properly release browser resources; max pool size 5 but no eviction policy
|
||||||
|
- Files: `backend/src/services/pdfGenerationService.ts` (lines 66-71, page pool)
|
||||||
|
- Cause: Pages may not be closed if PDF generation errors mid-stream; no cleanup on timeout
|
||||||
|
- Improvement path:
|
||||||
|
1. Implement LRU eviction: close oldest page if pool reaches max size
|
||||||
|
2. Add page timeout with forced close after 30s
|
||||||
|
3. Add memory monitoring; close all pages if heap >500MB
|
||||||
|
4. Log page pool stats every 5 minutes to detect leaks
|
||||||
|
|
||||||
|
## Fragile Areas
|
||||||
|
|
||||||
|
**Job Queue State Machine:**
|
||||||
|
- Files: `backend/src/services/jobQueueService.ts`, `backend/src/services/jobProcessorService.ts`, `backend/src/models/ProcessingJobModel.ts`
|
||||||
|
- Why fragile:
|
||||||
|
1. Job status transitions (pending → processing → completed) not atomic; race condition if two workers pick same job
|
||||||
|
2. Stuck job detection relies on timestamp comparison; clock skew or server restart breaks detection
|
||||||
|
3. No idempotency tokens; job retry on network error could trigger duplicate processing
|
||||||
|
- Safe modification:
|
||||||
|
1. Add database-level unique constraint on job ID + processing timestamp
|
||||||
|
2. Use database transactions for status updates
|
||||||
|
3. Implement idempotency with request deduplication ID
|
||||||
|
- Test coverage:
|
||||||
|
1. No unit tests found for concurrent job processing scenario
|
||||||
|
2. No integration tests with actual database
|
||||||
|
3. Add tests for: concurrent workers, stuck job reset, duplicate submissions
|
||||||
|
|
||||||
|
**Document Processing Pipeline Error Handling:**
|
||||||
|
- Files: `backend/src/controllers/documentController.ts` (lines 200+), `backend/src/services/unifiedDocumentProcessor.ts`
|
||||||
|
- Why fragile:
|
||||||
|
1. Hybrid approach tries job queue then fallback to immediate processing; error in job queue doesn't fully propagate
|
||||||
|
2. Document status not updated if processing fails mid-pipeline (remains 'processing_llm')
|
||||||
|
3. No compensating transaction to roll back partial results
|
||||||
|
- Safe modification:
|
||||||
|
1. Separate job submission from immediate processing; always update document status atomically
|
||||||
|
2. Add processing stage tracking (document_ai → chunking → embedding → llm → pdf)
|
||||||
|
3. Implement rollback logic: delete chunks and embeddings if LLM stage fails
|
||||||
|
- Test coverage:
|
||||||
|
1. Add tests for each pipeline stage failure
|
||||||
|
2. Test document status consistency after each failure
|
||||||
|
3. Add integration test with network failure injection
|
||||||
|
|
||||||
|
**Vector Database Search Fallback Chain:**
|
||||||
|
- Files: `backend/src/services/vectorDatabaseService.ts` (lines 110-182)
|
||||||
|
- Why fragile:
|
||||||
|
1. Three-level fallback (RPC search → document-scoped search → direct lookup) masks underlying issues
|
||||||
|
2. If Supabase RPC is degraded, system degrades silently instead of alerting
|
||||||
|
3. Fallback search may return stale or incorrect results without indication
|
||||||
|
- Safe modification:
|
||||||
|
1. Add circuit breaker: if timeout happens 3x in 5 minutes, stop trying RPC search
|
||||||
|
2. Return metadata flag indicating which fallback was used (for logging/debugging)
|
||||||
|
3. Add explicit timeout wrapped in try/catch, not via Promise.race() (cleaner code)
|
||||||
|
- Test coverage:
|
||||||
|
1. Mock Supabase timeout at each RPC level
|
||||||
|
2. Verify correct fallback is triggered
|
||||||
|
3. Add performance benchmarks for each search method
|
||||||
|
|
||||||
|
**Config Initialization Race Condition:**
|
||||||
|
- Files: `backend/src/config/env.ts` (lines 15-52)
|
||||||
|
- Why fragile:
|
||||||
|
1. Firebase Functions v1 fallback (`functions.config()`) may not be thread-safe
|
||||||
|
2. If multiple instances start simultaneously, config merge may be incomplete
|
||||||
|
3. No validation that config merge was successful
|
||||||
|
- Safe modification:
|
||||||
|
1. Remove v1 fallback entirely; require explicit Firebase Functions v2 setup
|
||||||
|
2. Validate all critical env vars before allowing service startup
|
||||||
|
3. Fail fast with clear error message if required vars missing
|
||||||
|
- Test coverage:
|
||||||
|
1. Add test for missing required env vars
|
||||||
|
2. Test with incomplete config to verify error message clarity
|
||||||
|
|
||||||
|
## Scaling Limits
|
||||||
|
|
||||||
|
**Supabase Concurrent Vector Search Connections:**
|
||||||
|
- Current capacity: RPC timeout 30 seconds; Supabase connection pool typically 100 max
|
||||||
|
- Limit: With 3 concurrent workers × multiple users, could exhaust connection pool during peak load
|
||||||
|
- Scaling path:
|
||||||
|
1. Implement connection pooling via PgBouncer (already in Supabase Pro tier)
|
||||||
|
2. Reduce timeout from 30s to 10s; fail faster and retry
|
||||||
|
3. Migrate to Pinecone if vector search becomes >30% of workload
|
||||||
|
|
||||||
|
**Firebase Functions Timeout (14 minutes):**
|
||||||
|
- Current capacity: Serverless function execution up to 15 minutes (1 minute buffer before hard timeout)
|
||||||
|
- Limit: Document processing hitting ~10 minutes; adding new features could exceed limit
|
||||||
|
- Scaling path:
|
||||||
|
1. Move processing to Cloud Run (1 hour limit) for large documents
|
||||||
|
2. Implement processing timeout failover: if approach 12 minutes, checkpoint and requeue
|
||||||
|
3. Add background worker pool for long-running jobs (separate from request path)
|
||||||
|
|
||||||
|
**LLM API Rate Limits (Anthropic/OpenAI):**
|
||||||
|
- Current capacity: 1 concurrent call; 3 retries per call; no per-minute or per-second throttling beyond single-call serialization
|
||||||
|
- Limit: Burst requests from multiple users could trigger 429 rate limit errors
|
||||||
|
- Scaling path:
|
||||||
|
1. Negotiate higher rate limits with API providers
|
||||||
|
2. Implement request queuing with exponential backoff per user
|
||||||
|
3. Add cost monitoring and soft-limit alerts (warn at 80% of quota)
|
||||||
|
|
||||||
|
**PDF Generation Browser Pool:**
|
||||||
|
- Current capacity: 5 browser pages maximum
|
||||||
|
- Limit: With 3+ concurrent document processing jobs, pool contention causes delays (queue wait time)
|
||||||
|
- Scaling path:
|
||||||
|
1. Increase pool size to 10 (requires more memory)
|
||||||
|
2. Move PDF generation to separate worker queue (decouple from request path)
|
||||||
|
3. Implement adaptive pool sizing based on available memory
|
||||||
|
|
||||||
|
**GCS Upload/Download Throughput:**
|
||||||
|
- Current capacity: Single-threaded upload/download; file transfer waits on GCS API latency
|
||||||
|
- Limit: Large documents (50+ MB) may timeout or be slow
|
||||||
|
- Scaling path:
|
||||||
|
1. Implement resumable uploads with multi-part chunks
|
||||||
|
2. Add parallel chunk uploads for files >10 MB
|
||||||
|
3. Cache frequently accessed documents in Redis
|
||||||
|
|
||||||
|
## Dependencies at Risk
|
||||||
|
|
||||||
|
**Firebase Functions v1 Deprecation (EOL Dec 31, 2025):**
|
||||||
|
- Risk: Runtime will be decommissioned; Node.js 20 support ending Oct 30, 2026 (warning already surfaced)
|
||||||
|
- Impact: Functions will stop working after deprecation date; forced migration required
|
||||||
|
- Migration plan:
|
||||||
|
1. Migrate to Firebase Functions v2 runtime (already partially done; fallback code still present)
|
||||||
|
2. Update `firebase-functions` package to latest major version
|
||||||
|
3. Remove deprecated `functions.config()` fallback once migration confirmed
|
||||||
|
4. Test all functions after upgrade
|
||||||
|
|
||||||
|
**Puppeteer Version Pinning:**
|
||||||
|
- Risk: Puppeteer has frequent security updates; pinned version likely outdated
|
||||||
|
- Impact: Browser vulnerabilities in PDF generation; potential sandbox bypass
|
||||||
|
- Migration plan:
|
||||||
|
1. Audit current Puppeteer version in `package.json`
|
||||||
|
2. Test upgrade path (may have breaking API changes)
|
||||||
|
3. Implement automated dependency security scanning
|
||||||
|
|
||||||
|
**Document AI API Versioning:**
|
||||||
|
- Risk: Google Cloud Document AI API may deprecate current processor version
|
||||||
|
- Impact: Processing pipeline breaks if processor ID no longer valid
|
||||||
|
- Migration plan:
|
||||||
|
1. Document current processor version and creation date
|
||||||
|
2. Subscribe to Google Cloud deprecation notices
|
||||||
|
3. Add feature flag to switch processor versions
|
||||||
|
4. Test new processor version before migration
|
||||||
|
|
||||||
|
## Missing Critical Features
|
||||||
|
|
||||||
|
**Job Processing Observability:**
|
||||||
|
- Problem: No metrics for job success rate, average processing time per stage, or failure breakdown by error type
|
||||||
|
- Blocks: Cannot diagnose performance regressions; cannot identify bottlenecks
|
||||||
|
- Implementation: Add `/health/agentic-rag` endpoint exposing per-pass timing, token usage, cost data
|
||||||
|
|
||||||
|
**Document Version History:**
|
||||||
|
- Problem: Processing pipeline overwrites `analysis_data` on each run; no ability to compare old vs. new results
|
||||||
|
- Blocks: Cannot detect if new model version improves accuracy; hard to debug regression
|
||||||
|
- Implementation: Add `document_versions` table; keep historical results; implement diff UI
|
||||||
|
|
||||||
|
**Retry Mechanism for Failed Documents:**
|
||||||
|
- Problem: Failed documents stay in failed state; no way to retry after infrastructure recovers
|
||||||
|
- Blocks: User must re-upload document; processing failures are permanent per upload
|
||||||
|
- Implementation: Add "Retry" button to failed document status; re-queue without user re-upload
|
||||||
|
|
||||||
|
## Test Coverage Gaps
|
||||||
|
|
||||||
|
**End-to-End Pipeline with Large Documents:**
|
||||||
|
- What's not tested: Full processing pipeline with 50+ MB documents; covers PDF chunking, Document AI extraction, embeddings, LLM analysis, PDF generation
|
||||||
|
- Files: No integration test covering full flow with large fixture
|
||||||
|
- Risk: Cannot detect if scaling to large documents introduces timeouts or memory issues
|
||||||
|
- Priority: High (Project Panther regression was not caught by tests)
|
||||||
|
|
||||||
|
**Concurrent Job Processing:**
|
||||||
|
- What's not tested: Multiple jobs submitted simultaneously; verify no race conditions in job queue or database
|
||||||
|
- Files: `backend/src/services/jobQueueService.ts`, `backend/src/models/ProcessingJobModel.ts`
|
||||||
|
- Risk: Race condition causes duplicate processing or lost job state in production
|
||||||
|
- Priority: High (affects reliability)
|
||||||
|
|
||||||
|
**Vector Database Fallback Scenarios:**
|
||||||
|
- What's not tested: Simulate Supabase RPC timeout and verify correct fallback search is executed
|
||||||
|
- Files: `backend/src/services/vectorDatabaseService.ts` (lines 110-182)
|
||||||
|
- Risk: Fallback search silent failures or incorrect results not detected
|
||||||
|
- Priority: Medium (affects search quality)
|
||||||
|
|
||||||
|
**LLM API Provider Switching:**
|
||||||
|
- What's not tested: Switch between Anthropic, OpenAI, OpenRouter; verify each provider works correctly
|
||||||
|
- Files: `backend/src/services/llmService.ts` (provider selection logic)
|
||||||
|
- Risk: Provider-specific bugs not caught until production usage
|
||||||
|
- Priority: Medium (currently only Anthropic heavily used)
|
||||||
|
|
||||||
|
**Error Propagation in Hybrid Processing:**
|
||||||
|
- What's not tested: Job queue failure → immediate processing fallback; verify document status and error reporting
|
||||||
|
- Files: `backend/src/controllers/documentController.ts` (lines 200+)
|
||||||
|
- Risk: Silent failures or incorrect status updates if fallback error not properly handled
|
||||||
|
- Priority: High (affects user experience)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*Concerns audit: 2026-02-24*
|
||||||
286
.planning/codebase/CONVENTIONS.md
Normal file
286
.planning/codebase/CONVENTIONS.md
Normal file
@@ -0,0 +1,286 @@
|
|||||||
|
# Coding Conventions
|
||||||
|
|
||||||
|
**Analysis Date:** 2026-02-24
|
||||||
|
|
||||||
|
## Naming Patterns
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Backend service files: `camelCase.ts` (e.g., `llmService.ts`, `unifiedDocumentProcessor.ts`, `vectorDatabaseService.ts`)
|
||||||
|
- Backend middleware/controllers: `camelCase.ts` (e.g., `errorHandler.ts`, `firebaseAuth.ts`)
|
||||||
|
- Frontend components: `PascalCase.tsx` (e.g., `DocumentUpload.tsx`, `LoginForm.tsx`, `ProtectedRoute.tsx`)
|
||||||
|
- Frontend utility files: `camelCase.ts` (e.g., `cn.ts` for class name utilities)
|
||||||
|
- Type definition files: `camelCase.ts` with `.d.ts` suffix optional (e.g., `express.d.ts`)
|
||||||
|
- Model files: `PascalCase.ts` in `backend/src/models/` (e.g., `DocumentModel.ts`)
|
||||||
|
- Config files: `camelCase.ts` (e.g., `env.ts`, `firebase.ts`, `supabase.ts`)
|
||||||
|
|
||||||
|
**Functions:**
|
||||||
|
- Both backend and frontend use camelCase: `processDocument()`, `validateUUID()`, `handleUpload()`
|
||||||
|
- React components are PascalCase: `DocumentUpload`, `ErrorHandler`
|
||||||
|
- Handler functions use `handle` or verb prefix: `handleVisibilityChange()`, `onDrop()`
|
||||||
|
- Async functions use descriptive names: `fetchDocuments()`, `uploadDocument()`, `processDocument()`
|
||||||
|
|
||||||
|
**Variables:**
|
||||||
|
- camelCase for all variables: `documentId`, `correlationId`, `isUploading`, `uploadedFiles`
|
||||||
|
- Constant state use UPPER_SNAKE_CASE in rare cases: `MAX_CONCURRENT_LLM_CALLS`, `MAX_TOKEN_LIMITS`
|
||||||
|
- Boolean prefixes: `is*` (isUploading, isAdmin), `has*` (hasError), `can*` (canProcess)
|
||||||
|
|
||||||
|
**Types:**
|
||||||
|
- Interfaces use PascalCase: `LLMRequest`, `UploadedFile`, `DocumentUploadProps`, `CIMReview`
|
||||||
|
- Type unions use PascalCase: `ErrorCategory`, `ProcessingStrategy`
|
||||||
|
- Generic types use single uppercase letter or descriptive name: `T`, `K`, `V`
|
||||||
|
- Enum values use UPPER_SNAKE_CASE: `ErrorCategory.VALIDATION`, `ErrorCategory.AUTHENTICATION`
|
||||||
|
|
||||||
|
**Interfaces vs Types:**
|
||||||
|
- **Interfaces** for object shapes that represent entities or components: `interface Document`, `interface UploadedFile`
|
||||||
|
- **Types** for unions, primitives, and specialized patterns: `type ProcessingStrategy = 'document_ai_agentic_rag' | 'simple_full_document'`
|
||||||
|
|
||||||
|
## Code Style
|
||||||
|
|
||||||
|
**Formatting:**
|
||||||
|
- No formal Prettier config detected in repo (allow varied formatting)
|
||||||
|
- 2-space indentation (observed in TypeScript files)
|
||||||
|
- Semicolons required at end of statements
|
||||||
|
- Single quotes for strings in TypeScript, double quotes in JSX attributes
|
||||||
|
- Line length: preferably under 100 characters but not enforced
|
||||||
|
|
||||||
|
**Linting:**
|
||||||
|
- Tool: ESLint with TypeScript support
|
||||||
|
- Config: `.eslintrc.js` in backend
|
||||||
|
- Key rules:
|
||||||
|
- `@typescript-eslint/no-unused-vars`: error (allows leading underscore for intentionally unused)
|
||||||
|
- `@typescript-eslint/no-explicit-any`: warn (use `unknown` instead)
|
||||||
|
- `@typescript-eslint/no-non-null-assertion`: warn (use proper type guards)
|
||||||
|
- `no-console`: off in backend (logging used via Winston)
|
||||||
|
- `no-undef`: error (strict undefined checking)
|
||||||
|
- Frontend ESLint ignores unused disable directives and has max-warnings: 0
|
||||||
|
|
||||||
|
**TypeScript Standards:**
|
||||||
|
- Strict mode not fully enabled (noImplicitAny disabled in tsconfig.json for legacy reasons)
|
||||||
|
- Prefer explicit typing over `any`: use `unknown` when type is truly unknown
|
||||||
|
- Type guards required for safety checks: `error instanceof Error ? error.message : String(error)`
|
||||||
|
- No type assertions with `as` for complex types; use proper type narrowing
|
||||||
|
|
||||||
|
## Import Organization
|
||||||
|
|
||||||
|
**Order:**
|
||||||
|
1. External framework/library imports (`express`, `react`, `winston`)
|
||||||
|
2. Google Cloud/Firebase imports (`@google-cloud/storage`, `firebase-admin`)
|
||||||
|
3. Third-party service imports (`axios`, `zod`, `joi`)
|
||||||
|
4. Internal config imports (`'../config/env'`, `'../config/firebase'`)
|
||||||
|
5. Internal utility imports (`'../utils/logger'`, `'../utils/cn'`)
|
||||||
|
6. Internal model imports (`'../models/DocumentModel'`)
|
||||||
|
7. Internal service imports (`'../services/llmService'`)
|
||||||
|
8. Internal middleware/helper imports (`'../middleware/errorHandler'`)
|
||||||
|
9. Type-only imports at the end: `import type { ProcessingStrategy } from '...'`
|
||||||
|
|
||||||
|
**Examples:**
|
||||||
|
|
||||||
|
Backend service pattern from `optimizedAgenticRAGProcessor.ts`:
|
||||||
|
```typescript
|
||||||
|
import { logger } from '../utils/logger';
|
||||||
|
import { vectorDatabaseService } from './vectorDatabaseService';
|
||||||
|
import { VectorDatabaseModel } from '../models/VectorDatabaseModel';
|
||||||
|
import { llmService } from './llmService';
|
||||||
|
import { CIMReview } from './llmSchemas';
|
||||||
|
import { config } from '../config/env';
|
||||||
|
import type { ParsedFinancials } from './financialTableParser';
|
||||||
|
import type { StructuredTable } from './documentAiProcessor';
|
||||||
|
```
|
||||||
|
|
||||||
|
Frontend component pattern from `DocumentList.tsx`:
|
||||||
|
```typescript
|
||||||
|
import React from 'react';
|
||||||
|
import {
|
||||||
|
FileText,
|
||||||
|
Eye,
|
||||||
|
Download,
|
||||||
|
Trash2,
|
||||||
|
Calendar,
|
||||||
|
User,
|
||||||
|
Clock
|
||||||
|
} from 'lucide-react';
|
||||||
|
import { cn } from '../utils/cn';
|
||||||
|
```
|
||||||
|
|
||||||
|
**Path Aliases:**
|
||||||
|
- No @ alias imports detected; all use relative `../` patterns
|
||||||
|
- Monorepo structure: frontend and backend in separate directories with independent module resolution
|
||||||
|
|
||||||
|
## Error Handling
|
||||||
|
|
||||||
|
**Patterns:**
|
||||||
|
|
||||||
|
1. **Structured Error Objects with Categories:**
|
||||||
|
- Use `ErrorCategory` enum for classification: `VALIDATION`, `AUTHENTICATION`, `AUTHORIZATION`, `NOT_FOUND`, `EXTERNAL_SERVICE`, `PROCESSING`, `DATABASE`, `SYSTEM`
|
||||||
|
- Attach `AppError` interface properties: `statusCode`, `isOperational`, `code`, `correlationId`, `category`, `retryable`, `context`
|
||||||
|
- Example from `errorHandler.ts`:
|
||||||
|
```typescript
|
||||||
|
const enhancedError: AppError = {
|
||||||
|
category: ErrorCategory.VALIDATION,
|
||||||
|
statusCode: 400,
|
||||||
|
code: 'INVALID_UUID_FORMAT',
|
||||||
|
retryable: false
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Try-Catch with Structured Logging:**
|
||||||
|
- Always catch errors with explicit type checking
|
||||||
|
- Log with structured data including correlation ID
|
||||||
|
- Example pattern:
|
||||||
|
```typescript
|
||||||
|
try {
|
||||||
|
await operation();
|
||||||
|
} catch (error) {
|
||||||
|
logger.error('Operation failed', {
|
||||||
|
error: error instanceof Error ? error.message : String(error),
|
||||||
|
stack: error instanceof Error ? error.stack : undefined,
|
||||||
|
context: { documentId, userId }
|
||||||
|
});
|
||||||
|
throw error;
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **HTTP Response Pattern:**
|
||||||
|
- Success responses: `{ success: true, data: {...} }`
|
||||||
|
- Error responses: `{ success: false, error: { code, message, details, correlationId, timestamp, retryable } }`
|
||||||
|
- User-friendly messages mapped by error category
|
||||||
|
- Include `X-Correlation-ID` header in responses
|
||||||
|
|
||||||
|
4. **Retry Logic:**
|
||||||
|
- LLM service implements concurrency limiting: max 1 concurrent call to prevent rate limits
|
||||||
|
- 3 retry attempts for LLM API calls with exponential backoff (see `llmService.ts` lines 236-450)
|
||||||
|
- Jobs respect 14-minute timeout limit with graceful status updates
|
||||||
|
|
||||||
|
5. **External Service Errors:**
|
||||||
|
- Firebase Auth errors: extract from `error.message` and `error.name` (TokenExpiredError, JsonWebTokenError)
|
||||||
|
- Supabase errors: check `error.code` and `error.message`, handle UUID validation errors
|
||||||
|
- GCS errors: extract from error objects with proper null checks
|
||||||
|
|
||||||
|
## Logging
|
||||||
|
|
||||||
|
**Framework:** Winston logger from `backend/src/utils/logger.ts`
|
||||||
|
|
||||||
|
**Levels:**
|
||||||
|
- `logger.debug()`: Detailed diagnostic info (disabled in production)
|
||||||
|
- `logger.info()`: Normal operation information, upload start/completion, processing status
|
||||||
|
- `logger.warn()`: Warning conditions, CORS rejections, non-critical issues
|
||||||
|
- `logger.error()`: Error conditions with full context and stack traces
|
||||||
|
|
||||||
|
**Structured Logging Pattern:**
|
||||||
|
```typescript
|
||||||
|
logger.info('Message', {
|
||||||
|
correlationId: correlationId,
|
||||||
|
category: 'operation_type',
|
||||||
|
operation: 'specific_action',
|
||||||
|
documentId: documentId,
|
||||||
|
userId: userId,
|
||||||
|
metadata: value,
|
||||||
|
timestamp: new Date().toISOString()
|
||||||
|
});
|
||||||
|
```
|
||||||
|
|
||||||
|
**StructuredLogger Class:**
|
||||||
|
- Use for operations requiring correlation ID tracking
|
||||||
|
- Constructor: `const logger = new StructuredLogger(correlationId)`
|
||||||
|
- Specialized methods:
|
||||||
|
- `uploadStart()`, `uploadSuccess()`, `uploadError()` - for file operations
|
||||||
|
- `processingStart()`, `processingSuccess()`, `processingError()` - for document processing
|
||||||
|
- `storageOperation()` - for file storage operations
|
||||||
|
- `jobQueueOperation()` - for background jobs
|
||||||
|
- `info()`, `warn()`, `error()`, `debug()` - general logging
|
||||||
|
- All methods automatically attach correlation ID to metadata
|
||||||
|
|
||||||
|
**What NOT to Log:**
|
||||||
|
- Credentials, API keys, or sensitive data
|
||||||
|
- Large file contents or binary data
|
||||||
|
- User passwords or tokens (log only presence: "token available" or "NO_TOKEN")
|
||||||
|
- Request body contents (sanitized in error handler - only whitelisted fields: documentId, id, status, fileName, fileSize, contentType, correlationId)
|
||||||
|
|
||||||
|
**Console Usage:**
|
||||||
|
- Backend: `console.log` disabled by ESLint in production code; only Winston logger used
|
||||||
|
- Frontend: `console.log` used in development (observed in DocumentUpload, App components)
|
||||||
|
- Special case: logger initialization may use console.warn for setup diagnostics
|
||||||
|
|
||||||
|
## Comments
|
||||||
|
|
||||||
|
**When to Comment:**
|
||||||
|
- Complex algorithms or business logic: explain "why", not "what" the code does
|
||||||
|
- Non-obvious type conversions or workarounds
|
||||||
|
- Links to related issues, tickets, or documentation
|
||||||
|
- Critical security considerations or performance implications
|
||||||
|
- TODO items for incomplete work (format: `// TODO: [description]`)
|
||||||
|
|
||||||
|
**JSDoc/TSDoc:**
|
||||||
|
- Used for function and class documentation in utility and service files
|
||||||
|
- Function signature example from `test-helpers.ts`:
|
||||||
|
```typescript
|
||||||
|
/**
|
||||||
|
* Creates a mock correlation ID for testing
|
||||||
|
*/
|
||||||
|
export function createMockCorrelationId(): string
|
||||||
|
```
|
||||||
|
- Parameter and return types documented via TypeScript typing (preferred over verbose JSDoc)
|
||||||
|
- Service classes include operation summaries: `/** Process document using Document AI + Agentic RAG strategy */`
|
||||||
|
|
||||||
|
## Function Design
|
||||||
|
|
||||||
|
**Size:**
|
||||||
|
- Keep functions focused on single responsibility
|
||||||
|
- Long services (300+ lines) separate concerns into helper methods
|
||||||
|
- Controller/middleware functions stay under 50 lines
|
||||||
|
|
||||||
|
**Parameters:**
|
||||||
|
- Max 3-4 required parameters; use object for additional config
|
||||||
|
- Example: `processDocument(documentId: string, userId: string, text: string, options?: { strategy?: string })`
|
||||||
|
- Use destructuring for config objects: `{ strategy, maxTokens, temperature }`
|
||||||
|
|
||||||
|
**Return Values:**
|
||||||
|
- Async operations return Promise with typed success/error objects
|
||||||
|
- Pattern: `Promise<{ success: boolean; data: T; error?: string }>`
|
||||||
|
- Avoid throwing in service methods; return error in object
|
||||||
|
- Controllers/middleware can throw for Express error handler
|
||||||
|
|
||||||
|
**Type Signatures:**
|
||||||
|
- Always specify parameter and return types (no implicit `any`)
|
||||||
|
- Use generics for reusable patterns: `Promise<T>`, `Array<Document>`
|
||||||
|
- Union types for multiple possibilities: `'uploading' | 'uploaded' | 'processing' | 'completed' | 'error'`
|
||||||
|
|
||||||
|
## Module Design
|
||||||
|
|
||||||
|
**Exports:**
|
||||||
|
- Services exported as singleton instances: `export const llmService = new LLMService()`
|
||||||
|
- Utility functions exported as named exports: `export function validateUUID() { ... }`
|
||||||
|
- Type definitions exported from dedicated type files or alongside implementation
|
||||||
|
- Classes exported as default or named based on usage pattern
|
||||||
|
|
||||||
|
**Barrel Files:**
|
||||||
|
- Not consistently used; services import directly from implementation files
|
||||||
|
- Example: `import { llmService } from './llmService'` not from `./services/index`
|
||||||
|
- Consider adding for cleaner imports when services directory grows
|
||||||
|
|
||||||
|
**Service Singletons:**
|
||||||
|
- All services instantiated once and exported as singletons
|
||||||
|
- Examples:
|
||||||
|
- `backend/src/services/llmService.ts`: `export const llmService = new LLMService()`
|
||||||
|
- `backend/src/services/fileStorageService.ts`: `export const fileStorageService = new FileStorageService()`
|
||||||
|
- `backend/src/services/vectorDatabaseService.ts`: `export const vectorDatabaseService = new VectorDatabaseService()`
|
||||||
|
- Prevents multiple initialization and enables dependency sharing
|
||||||
|
|
||||||
|
**Frontend Context Pattern:**
|
||||||
|
- React Context for auth: `AuthContext` exports `useAuth()` hook
|
||||||
|
- Services pattern: `documentService` contains API methods, used as singleton
|
||||||
|
- No service singletons in frontend (class instances recreated as needed)
|
||||||
|
|
||||||
|
## Deprecated Patterns (DO NOT USE)
|
||||||
|
|
||||||
|
- ❌ Direct PostgreSQL connections - Use Supabase client instead
|
||||||
|
- ❌ JWT authentication - Use Firebase Auth tokens
|
||||||
|
- ❌ `console.log` in production code - Use Winston logger
|
||||||
|
- ❌ Type assertions with `as` for complex types - Use type guards
|
||||||
|
- ❌ Manual error handling without correlation IDs
|
||||||
|
- ❌ Redis caching - Not used in current architecture
|
||||||
|
- ❌ Jest testing - Use Vitest instead
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*Convention analysis: 2026-02-24*
|
||||||
247
.planning/codebase/INTEGRATIONS.md
Normal file
247
.planning/codebase/INTEGRATIONS.md
Normal file
@@ -0,0 +1,247 @@
|
|||||||
|
# External Integrations
|
||||||
|
|
||||||
|
**Analysis Date:** 2026-02-24
|
||||||
|
|
||||||
|
## APIs & External Services
|
||||||
|
|
||||||
|
**Document Processing:**
|
||||||
|
- Google Document AI
|
||||||
|
- Purpose: OCR and text extraction from PDF documents with entity recognition and table parsing
|
||||||
|
- Client: `@google-cloud/documentai` 9.3.0
|
||||||
|
- Implementation: `backend/src/services/documentAiProcessor.ts`
|
||||||
|
- Auth: Google Application Credentials via `GOOGLE_APPLICATION_CREDENTIALS` or default credentials
|
||||||
|
- Configuration: Processor ID from `DOCUMENT_AI_PROCESSOR_ID`, location from `DOCUMENT_AI_LOCATION` (default: 'us')
|
||||||
|
- Max pages per chunk: 15 pages (configurable)
|
||||||
|
|
||||||
|
**Large Language Models:**
|
||||||
|
- OpenAI
|
||||||
|
- Purpose: LLM analysis of document content, embeddings for vector search
|
||||||
|
- SDK/Client: `openai` 5.10.2
|
||||||
|
- Auth: API key from `OPENAI_API_KEY`
|
||||||
|
- Models: Default `gpt-4-turbo`, embeddings via `text-embedding-3-small`
|
||||||
|
- Implementation: `backend/src/services/llmService.ts` with provider abstraction
|
||||||
|
- Retry: 3 attempts with exponential backoff
|
||||||
|
|
||||||
|
- Anthropic Claude
|
||||||
|
- Purpose: LLM analysis and document summary generation
|
||||||
|
- SDK/Client: `@anthropic-ai/sdk` 0.57.0
|
||||||
|
- Auth: API key from `ANTHROPIC_API_KEY`
|
||||||
|
- Models: Default `claude-sonnet-4-20250514` (configurable via `LLM_MODEL`)
|
||||||
|
- Implementation: `backend/src/services/llmService.ts`
|
||||||
|
- Concurrency: Max 1 concurrent LLM call to prevent rate limiting (Anthropic 429 errors)
|
||||||
|
- Retry: 3 attempts with exponential backoff
|
||||||
|
|
||||||
|
- OpenRouter
|
||||||
|
- Purpose: Alternative LLM provider supporting multiple models through single API
|
||||||
|
- SDK/Client: HTTP requests via `axios` to OpenRouter API
|
||||||
|
- Auth: `OPENROUTER_API_KEY` or optional Bring-Your-Own-Key mode (`OPENROUTER_USE_BYOK`)
|
||||||
|
- Configuration: `LLM_PROVIDER: 'openrouter'` activates this provider
|
||||||
|
- Implementation: `backend/src/services/llmService.ts`
|
||||||
|
|
||||||
|
**File Storage:**
|
||||||
|
- Google Cloud Storage (GCS)
|
||||||
|
- Purpose: Store uploaded PDFs, processed documents, and generated PDFs
|
||||||
|
- SDK/Client: `@google-cloud/storage` 7.16.0
|
||||||
|
- Auth: Google Application Credentials via `GOOGLE_APPLICATION_CREDENTIALS`
|
||||||
|
- Buckets:
|
||||||
|
- Input: `GCS_BUCKET_NAME` for uploaded documents
|
||||||
|
- Output: `DOCUMENT_AI_OUTPUT_BUCKET_NAME` for processing results
|
||||||
|
- Implementation: `backend/src/services/fileStorageService.ts` and `backend/src/services/documentAiProcessor.ts`
|
||||||
|
- Max file size: 100MB (configurable via `MAX_FILE_SIZE`)
|
||||||
|
|
||||||
|
## Data Storage
|
||||||
|
|
||||||
|
**Databases:**
|
||||||
|
- Supabase PostgreSQL
|
||||||
|
- Connection: `SUPABASE_URL` for PostgREST API, `DATABASE_URL` for direct PostgreSQL
|
||||||
|
- Client: `@supabase/supabase-js` 2.53.0 for REST API, `pg` 8.11.3 for direct pool connections
|
||||||
|
- Auth: `SUPABASE_ANON_KEY` for client operations, `SUPABASE_SERVICE_KEY` for server operations
|
||||||
|
- Implementation:
|
||||||
|
- `backend/src/config/supabase.ts` - Client initialization with 30-second request timeout
|
||||||
|
- `backend/src/models/` - All data models (DocumentModel, UserModel, ProcessingJobModel, VectorDatabaseModel)
|
||||||
|
- Vector Support: pgvector extension for semantic search
|
||||||
|
- Tables:
|
||||||
|
- `users` - User accounts and authentication data
|
||||||
|
- `documents` - CIM documents with status tracking
|
||||||
|
- `document_chunks` - Text chunks with embeddings for vector search
|
||||||
|
- `document_feedback` - User feedback on summaries
|
||||||
|
- `document_versions` - Document version history
|
||||||
|
- `document_audit_logs` - Audit trail for compliance
|
||||||
|
- `processing_jobs` - Background job queue with status tracking
|
||||||
|
- `performance_metrics` - System performance data
|
||||||
|
- Connection pooling: Max 5 connections, 30-second idle timeout, 2-second connection timeout
|
||||||
|
|
||||||
|
**Vector Database:**
|
||||||
|
- Supabase pgvector (built into PostgreSQL)
|
||||||
|
- Purpose: Semantic search and RAG context retrieval
|
||||||
|
- Implementation: `backend/src/services/vectorDatabaseService.ts`
|
||||||
|
- Embedding generation: Via OpenAI `text-embedding-3-small` (embedded in service)
|
||||||
|
- Search: Cosine similarity via Supabase RPC calls
|
||||||
|
- Semantic cache: 1-hour TTL for cached embeddings
|
||||||
|
|
||||||
|
**File Storage:**
|
||||||
|
- Google Cloud Storage (primary storage above)
|
||||||
|
- Local filesystem (fallback for development, stored in `uploads/` directory)
|
||||||
|
|
||||||
|
**Caching:**
|
||||||
|
- In-memory semantic cache (Supabase vector embeddings) with 1-hour TTL
|
||||||
|
- No external cache service (Redis, Memcached) currently used
|
||||||
|
|
||||||
|
## Authentication & Identity
|
||||||
|
|
||||||
|
**Auth Provider:**
|
||||||
|
- Firebase Authentication
|
||||||
|
- Purpose: User authentication, JWT token generation and verification
|
||||||
|
- Client: `firebase` 12.0.0 (frontend at `frontend/src/config/firebase.ts`)
|
||||||
|
- Admin: `firebase-admin` 13.4.0 (backend at `backend/src/config/firebase.ts`)
|
||||||
|
- Implementation:
|
||||||
|
- Frontend: `frontend/src/services/authService.ts` - Login, logout, token refresh
|
||||||
|
- Backend: `backend/src/middleware/firebaseAuth.ts` - Token verification middleware
|
||||||
|
- Project: `cim-summarizer` (hardcoded in config)
|
||||||
|
- Flow: User logs in with Firebase, receives ID token, frontend sends token in Authorization header
|
||||||
|
|
||||||
|
**Token-Based Auth:**
|
||||||
|
- JWT (JSON Web Tokens)
|
||||||
|
- Purpose: API request authentication
|
||||||
|
- Implementation: `backend/src/middleware/firebaseAuth.ts`
|
||||||
|
- Verification: Firebase Admin SDK verifies token signature and expiration
|
||||||
|
- Header: `Authorization: Bearer <token>`
|
||||||
|
|
||||||
|
**Fallback Auth (for service-to-service):**
|
||||||
|
- API Key based (not currently exposed but framework supports it in `backend/src/config/env.ts`)
|
||||||
|
|
||||||
|
## Monitoring & Observability
|
||||||
|
|
||||||
|
**Error Tracking:**
|
||||||
|
- No external error tracking service configured
|
||||||
|
- Errors logged via Winston logger with correlation IDs for tracing
|
||||||
|
|
||||||
|
**Logs:**
|
||||||
|
- Winston logger 3.11.0 - Structured JSON logging at `backend/src/utils/logger.ts`
|
||||||
|
- Transports: Console (development), File-based for production logs
|
||||||
|
- Correlation ID middleware at `backend/src/middleware/errorHandler.ts` - Every request traced
|
||||||
|
- Request logging: Morgan 1.10.0 with Winston transport
|
||||||
|
- Firebase Functions Cloud Logging: Automatic integration for Cloud Functions deployments
|
||||||
|
|
||||||
|
**Monitoring Endpoints:**
|
||||||
|
- `GET /health` - Basic health check with uptime and environment info
|
||||||
|
- `GET /health/config` - Configuration validation status
|
||||||
|
- `GET /health/agentic-rag` - Agentic RAG system health (placeholder)
|
||||||
|
- `GET /monitoring/dashboard` - Aggregated system metrics (queryable by time range)
|
||||||
|
|
||||||
|
## CI/CD & Deployment
|
||||||
|
|
||||||
|
**Hosting:**
|
||||||
|
- **Backend**:
|
||||||
|
- Firebase Cloud Functions (default, Node.js 20 runtime)
|
||||||
|
- Google Cloud Run (alternative containerized deployment)
|
||||||
|
- Configuration: `backend/firebase.json` defines function source, runtime, and predeploy hooks
|
||||||
|
|
||||||
|
- **Frontend**:
|
||||||
|
- Firebase Hosting (CDN-backed static hosting)
|
||||||
|
- Configuration: Defined in `frontend/` directory with `firebase.json`
|
||||||
|
|
||||||
|
**Deployment Commands:**
|
||||||
|
```bash
|
||||||
|
# Backend deployment
|
||||||
|
npm run deploy:firebase # Deploy functions to Firebase
|
||||||
|
npm run deploy:cloud-run # Deploy to Cloud Run
|
||||||
|
npm run docker:build # Build Docker image
|
||||||
|
npm run docker:push # Push to GCR
|
||||||
|
|
||||||
|
# Frontend deployment
|
||||||
|
npm run deploy:firebase # Deploy to Firebase Hosting
|
||||||
|
npm run deploy:preview # Deploy to preview channel
|
||||||
|
|
||||||
|
# Emulator
|
||||||
|
npm run emulator # Run Firebase emulator locally
|
||||||
|
npm run emulator:ui # Run emulator with UI
|
||||||
|
```
|
||||||
|
|
||||||
|
**Build Pipeline:**
|
||||||
|
- TypeScript compilation: `tsc` targets ES2020
|
||||||
|
- Predeploy: Defined in `firebase.json` - runs `npm run build`
|
||||||
|
- Docker image for Cloud Run: `Dockerfile` in backend root
|
||||||
|
|
||||||
|
## Environment Configuration
|
||||||
|
|
||||||
|
**Required env vars (Production):**
|
||||||
|
```
|
||||||
|
NODE_ENV=production
|
||||||
|
LLM_PROVIDER=anthropic
|
||||||
|
GCLOUD_PROJECT_ID=cim-summarizer
|
||||||
|
DOCUMENT_AI_PROCESSOR_ID=<processor-id>
|
||||||
|
GCS_BUCKET_NAME=<bucket-name>
|
||||||
|
DOCUMENT_AI_OUTPUT_BUCKET_NAME=<output-bucket>
|
||||||
|
SUPABASE_URL=https://<project>.supabase.co
|
||||||
|
SUPABASE_ANON_KEY=<anon-key>
|
||||||
|
SUPABASE_SERVICE_KEY=<service-key>
|
||||||
|
DATABASE_URL=postgresql://postgres:<password>@aws-0-us-central-1.pooler.supabase.com:6543/postgres
|
||||||
|
ANTHROPIC_API_KEY=sk-ant-...
|
||||||
|
OPENAI_API_KEY=sk-...
|
||||||
|
FIREBASE_PROJECT_ID=cim-summarizer
|
||||||
|
```
|
||||||
|
|
||||||
|
**Optional env vars:**
|
||||||
|
```
|
||||||
|
DOCUMENT_AI_LOCATION=us
|
||||||
|
VECTOR_PROVIDER=supabase
|
||||||
|
LLM_MODEL=claude-sonnet-4-20250514
|
||||||
|
LLM_MAX_TOKENS=16000
|
||||||
|
LLM_TEMPERATURE=0.1
|
||||||
|
OPENROUTER_API_KEY=<key>
|
||||||
|
OPENROUTER_USE_BYOK=true
|
||||||
|
GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
|
||||||
|
```
|
||||||
|
|
||||||
|
**Secrets location:**
|
||||||
|
- Development: `.env` file (gitignored, never committed)
|
||||||
|
- Production: Firebase Functions secrets via `firebase functions:secrets:set`
|
||||||
|
- Google Credentials: `backend/serviceAccountKey.json` for local dev, service account in Cloud Functions environment
|
||||||
|
|
||||||
|
## Webhooks & Callbacks
|
||||||
|
|
||||||
|
**Incoming:**
|
||||||
|
- No external webhooks currently configured
|
||||||
|
- All document processing triggered by HTTP POST to `POST /documents/upload`
|
||||||
|
|
||||||
|
**Outgoing:**
|
||||||
|
- No outgoing webhooks implemented
|
||||||
|
- Document processing is synchronous (within 14-minute Cloud Function timeout) or async via job queue
|
||||||
|
|
||||||
|
**Real-time Monitoring:**
|
||||||
|
- Server-Sent Events (SSE) not implemented
|
||||||
|
- Polling endpoints for progress:
|
||||||
|
- `GET /documents/{id}/progress` - Document processing progress
|
||||||
|
- `GET /documents/queue/status` - Job queue status (frontend polls every 5 seconds)
|
||||||
|
|
||||||
|
## Rate Limiting & Quotas
|
||||||
|
|
||||||
|
**API Rate Limits:**
|
||||||
|
- Express rate limiter: 1000 requests per 15 minutes per IP
|
||||||
|
- LLM provider limits: Anthropic limited to 1 concurrent call (application-level throttling)
|
||||||
|
- OpenAI rate limits: Handled by SDK with backoff
|
||||||
|
|
||||||
|
**File Upload Limits:**
|
||||||
|
- Max file size: 100MB (configurable via `MAX_FILE_SIZE`)
|
||||||
|
- Allowed MIME types: `application/pdf` (configurable via `ALLOWED_FILE_TYPES`)
|
||||||
|
|
||||||
|
## Network Configuration
|
||||||
|
|
||||||
|
**CORS Origins (Allowed):**
|
||||||
|
- `https://cim-summarizer.web.app` (production)
|
||||||
|
- `https://cim-summarizer.firebaseapp.com` (production)
|
||||||
|
- `http://localhost:3000` (development)
|
||||||
|
- `http://localhost:5173` (development)
|
||||||
|
- `https://localhost:3000` (SSL local dev)
|
||||||
|
- `https://localhost:5173` (SSL local dev)
|
||||||
|
|
||||||
|
**Port Mappings:**
|
||||||
|
- Frontend dev: Port 5173 (Vite dev server)
|
||||||
|
- Backend dev: Port 5001 (Firebase Functions emulator)
|
||||||
|
- Backend API: Port 5000 (Express in standard deployment)
|
||||||
|
- Vite proxy to backend: `/api` routes proxied from port 5173 to `http://localhost:5000`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*Integration audit: 2026-02-24*
|
||||||
148
.planning/codebase/STACK.md
Normal file
148
.planning/codebase/STACK.md
Normal file
@@ -0,0 +1,148 @@
|
|||||||
|
# Technology Stack
|
||||||
|
|
||||||
|
**Analysis Date:** 2026-02-24
|
||||||
|
|
||||||
|
## Languages
|
||||||
|
|
||||||
|
**Primary:**
|
||||||
|
- TypeScript 5.2.2 - Both backend and frontend, strict mode enabled
|
||||||
|
- JavaScript (CommonJS) - Build outputs and configuration
|
||||||
|
|
||||||
|
**Supporting:**
|
||||||
|
- SQL - Supabase PostgreSQL database via migrations in `backend/src/models/migrations/`
|
||||||
|
|
||||||
|
## Runtime
|
||||||
|
|
||||||
|
**Environment:**
|
||||||
|
- Node.js 20 (specified in `backend/firebase.json`)
|
||||||
|
- Browser (ES2020 target for both client and server)
|
||||||
|
|
||||||
|
**Package Manager:**
|
||||||
|
- npm - Primary package manager for both backend and frontend
|
||||||
|
- Lockfile: `package-lock.json` present in both `backend/` and `frontend/`
|
||||||
|
|
||||||
|
## Frameworks
|
||||||
|
|
||||||
|
**Backend - Core:**
|
||||||
|
- Express.js 4.18.2 - HTTP server and REST API framework at `backend/src/index.ts`
|
||||||
|
- Firebase Admin SDK 13.4.0 - Authentication and service account management at `backend/src/config/firebase.ts`
|
||||||
|
- Firebase Functions 6.4.0 - Cloud Functions deployment runtime at port 5001
|
||||||
|
|
||||||
|
**Frontend - Core:**
|
||||||
|
- React 18.2.0 - UI framework with TypeScript support
|
||||||
|
- Vite 4.5.0 - Build tool and dev server (port 5173 for dev, port 3000 production)
|
||||||
|
|
||||||
|
**Backend - Testing:**
|
||||||
|
- Vitest 2.1.0 - Test runner with v8 coverage provider at `backend/vitest.config.ts`
|
||||||
|
- Configuration: Global test environment set to 'node', 30-second test timeout
|
||||||
|
|
||||||
|
**Backend - Build/Dev:**
|
||||||
|
- ts-node 10.9.2 - TypeScript execution for scripts
|
||||||
|
- ts-node-dev 2.0.0 - Live reload development server with `--transpile-only` flag
|
||||||
|
- TypeScript Compiler (tsc) 5.2.2 - Strict type checking, ES2020 target
|
||||||
|
|
||||||
|
**Frontend - Build/Dev:**
|
||||||
|
- Vite React plugin 4.1.1 - React JSX transformation
|
||||||
|
- TailwindCSS 3.3.5 - Utility-first CSS framework with PostCSS 8.4.31
|
||||||
|
|
||||||
|
## Key Dependencies
|
||||||
|
|
||||||
|
**Critical Infrastructure:**
|
||||||
|
- `@google-cloud/documentai` 9.3.0 - Google Document AI OCR/text extraction at `backend/src/services/documentAiProcessor.ts`
|
||||||
|
- `@google-cloud/storage` 7.16.0 - Google Cloud Storage (GCS) for file uploads and processing
|
||||||
|
- `@supabase/supabase-js` 2.53.0 - PostgreSQL database client with vector support at `backend/src/config/supabase.ts`
|
||||||
|
- `pg` 8.11.3 - Direct PostgreSQL connection pool for critical operations bypassing PostgREST
|
||||||
|
|
||||||
|
**LLM & AI:**
|
||||||
|
- `@anthropic-ai/sdk` 0.57.0 - Claude API integration with support for Anthropic provider
|
||||||
|
- `openai` 5.10.2 - OpenAI API and embeddings (text-embedding-3-small)
|
||||||
|
- Both providers abstracted via `backend/src/services/llmService.ts`
|
||||||
|
|
||||||
|
**PDF Processing:**
|
||||||
|
- `pdf-lib` 1.17.1 - PDF generation and manipulation at `backend/src/services/pdfGenerationService.ts`
|
||||||
|
- `pdf-parse` 1.1.1 - PDF text extraction
|
||||||
|
- `pdfkit` 0.17.1 - PDF document creation
|
||||||
|
|
||||||
|
**Document Processing:**
|
||||||
|
- `puppeteer` 21.11.0 - Headless Chrome for HTML/PDF conversion
|
||||||
|
|
||||||
|
**Security & Authentication:**
|
||||||
|
- `firebase` 12.0.0 (frontend) - Firebase client SDK for authentication at `frontend/src/config/firebase.ts`
|
||||||
|
- `firebase-admin` 13.4.0 (backend) - Admin SDK for token verification at `backend/src/middleware/firebaseAuth.ts`
|
||||||
|
- `jsonwebtoken` 9.0.2 - JWT token creation and verification
|
||||||
|
- `bcryptjs` 2.4.3 - Password hashing with 12 rounds default
|
||||||
|
|
||||||
|
**API & HTTP:**
|
||||||
|
- `axios` 1.11.0 - HTTP client for both frontend and backend
|
||||||
|
- `cors` 2.8.5 - Cross-Origin Resource Sharing middleware for Express
|
||||||
|
- `helmet` 7.1.0 - Security headers middleware
|
||||||
|
- `morgan` 1.10.0 - HTTP request logging middleware
|
||||||
|
- `express-rate-limit` 7.1.5 - Rate limiting middleware (1000 requests per 15 minutes)
|
||||||
|
|
||||||
|
**Data Validation & Schema:**
|
||||||
|
- `zod` 3.25.76 - TypeScript-first schema validation at `backend/src/services/llmSchemas.ts`
|
||||||
|
- `zod-to-json-schema` 3.24.6 - Convert Zod schemas to JSON Schema for LLM structured output
|
||||||
|
- `joi` 17.11.0 - Environment variable validation in `backend/src/config/env.ts`
|
||||||
|
|
||||||
|
**Logging & Monitoring:**
|
||||||
|
- `winston` 3.11.0 - Structured logging framework with multiple transports at `backend/src/utils/logger.ts`
|
||||||
|
|
||||||
|
**Frontend - UI Components:**
|
||||||
|
- `lucide-react` 0.294.0 - Icon library
|
||||||
|
- `react-dom` 18.2.0 - React rendering for web
|
||||||
|
- `react-router-dom` 6.20.1 - Client-side routing
|
||||||
|
- `react-dropzone` 14.3.8 - File upload handling
|
||||||
|
- `clsx` 2.0.0 - Conditional className utility
|
||||||
|
- `tailwind-merge` 2.0.0 - Merge Tailwind classes with conflict resolution
|
||||||
|
|
||||||
|
**Utilities:**
|
||||||
|
- `uuid` 11.1.0 - Unique identifier generation
|
||||||
|
- `dotenv` 16.3.1 - Environment variable loading from `.env` files
|
||||||
|
|
||||||
|
## Configuration
|
||||||
|
|
||||||
|
**Environment:**
|
||||||
|
- **.env file support** - Dotenv loads from `.env` for local development in `backend/src/config/env.ts`
|
||||||
|
- **Environment validation** - Joi schema at `backend/src/config/env.ts` validates all required/optional env vars
|
||||||
|
- **Firebase Functions v2** - Uses `defineString()` and `defineSecret()` for secure configuration (migration from v1 functions.config())
|
||||||
|
|
||||||
|
**Key Configuration Variables (Backend):**
|
||||||
|
- `NODE_ENV` - 'development' | 'production' | 'test'
|
||||||
|
- `LLM_PROVIDER` - 'openai' | 'anthropic' | 'openrouter' (default: 'openai')
|
||||||
|
- `GCLOUD_PROJECT_ID` - Google Cloud project ID (required)
|
||||||
|
- `DOCUMENT_AI_PROCESSOR_ID` - Document AI processor ID (required)
|
||||||
|
- `GCS_BUCKET_NAME` - Google Cloud Storage bucket (required)
|
||||||
|
- `SUPABASE_URL`, `SUPABASE_ANON_KEY`, `SUPABASE_SERVICE_KEY` - Supabase PostgreSQL connection
|
||||||
|
- `DATABASE_URL` - Direct PostgreSQL connection string for bypass operations
|
||||||
|
- `OPENAI_API_KEY` - OpenAI API key for embeddings and models
|
||||||
|
- `ANTHROPIC_API_KEY` - Anthropic Claude API key
|
||||||
|
- `OPENROUTER_API_KEY` - OpenRouter API key (optional, uses BYOK with Anthropic key)
|
||||||
|
|
||||||
|
**Key Configuration Variables (Frontend):**
|
||||||
|
- `VITE_API_BASE_URL` - Backend API endpoint
|
||||||
|
- `VITE_FIREBASE_*` - Firebase configuration (API key, auth domain, project ID, etc.)
|
||||||
|
|
||||||
|
**Build Configuration:**
|
||||||
|
- **Backend**: `backend/tsconfig.json` - Strict TypeScript, CommonJS module output, ES2020 target
|
||||||
|
- **Frontend**: `frontend/tsconfig.json` - ES2020 target, JSX React support, path alias `@/*`
|
||||||
|
- **Firebase**: `backend/firebase.json` - Node.js 20 runtime, Firebase Functions emulator on port 5001
|
||||||
|
|
||||||
|
## Platform Requirements
|
||||||
|
|
||||||
|
**Development:**
|
||||||
|
- Node.js 20.x
|
||||||
|
- npm 9+
|
||||||
|
- Google Cloud credentials (for Document AI and GCS)
|
||||||
|
- Firebase project credentials (service account key)
|
||||||
|
- Supabase project URL and keys
|
||||||
|
|
||||||
|
**Production:**
|
||||||
|
- **Backend**: Firebase Cloud Functions (Node.js 20 runtime) or Google Cloud Run
|
||||||
|
- **Frontend**: Firebase Hosting (CDN-backed static hosting)
|
||||||
|
- **Database**: Supabase PostgreSQL with pgvector extension for vector search
|
||||||
|
- **Storage**: Google Cloud Storage for documents and generated PDFs
|
||||||
|
- **Memory Limits**: Backend configured with `--max-old-space-size=8192` for large document processing
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*Stack analysis: 2026-02-24*
|
||||||
374
.planning/codebase/STRUCTURE.md
Normal file
374
.planning/codebase/STRUCTURE.md
Normal file
@@ -0,0 +1,374 @@
|
|||||||
|
# Codebase Structure
|
||||||
|
|
||||||
|
**Analysis Date:** 2026-02-24
|
||||||
|
|
||||||
|
## Directory Layout
|
||||||
|
|
||||||
|
```
|
||||||
|
cim_summary/
|
||||||
|
├── backend/ # Express.js + TypeScript backend (Node.js)
|
||||||
|
│ ├── src/
|
||||||
|
│ │ ├── index.ts # Express app + Firebase Functions exports
|
||||||
|
│ │ ├── controllers/ # Request handlers
|
||||||
|
│ │ ├── models/ # Database access + schema
|
||||||
|
│ │ ├── services/ # Business logic + external integrations
|
||||||
|
│ │ ├── routes/ # Express route definitions
|
||||||
|
│ │ ├── middleware/ # Express middleware (auth, validation, error)
|
||||||
|
│ │ ├── config/ # Configuration (env, firebase, supabase)
|
||||||
|
│ │ ├── utils/ # Utilities (logger, validation, parsing)
|
||||||
|
│ │ ├── types/ # TypeScript type definitions
|
||||||
|
│ │ ├── scripts/ # One-off CLI scripts (diagnostics, setup)
|
||||||
|
│ │ ├── assets/ # Static assets (HTML templates)
|
||||||
|
│ │ └── __tests__/ # Test suites (unit, integration, acceptance)
|
||||||
|
│ ├── package.json # Node dependencies
|
||||||
|
│ ├── tsconfig.json # TypeScript config
|
||||||
|
│ ├── .eslintrc.json # ESLint config
|
||||||
|
│ └── dist/ # Compiled JavaScript (generated)
|
||||||
|
│
|
||||||
|
├── frontend/ # React + Vite + TypeScript frontend
|
||||||
|
│ ├── src/
|
||||||
|
│ │ ├── main.tsx # React entry point
|
||||||
|
│ │ ├── App.tsx # Root component with routing
|
||||||
|
│ │ ├── components/ # React components (UI)
|
||||||
|
│ │ ├── services/ # API clients (documentService, authService)
|
||||||
|
│ │ ├── contexts/ # React Context (AuthContext)
|
||||||
|
│ │ ├── config/ # Configuration (env, firebase)
|
||||||
|
│ │ ├── types/ # TypeScript interfaces
|
||||||
|
│ │ ├── utils/ # Utilities (validation, cn, auth debug)
|
||||||
|
│ │ └── assets/ # Static images and icons
|
||||||
|
│ ├── package.json # Node dependencies
|
||||||
|
│ ├── tsconfig.json # TypeScript config
|
||||||
|
│ ├── vite.config.ts # Vite bundler config
|
||||||
|
│ ├── eslintrc.json # ESLint config
|
||||||
|
│ ├── tailwind.config.js # Tailwind CSS config
|
||||||
|
│ ├── postcss.config.js # PostCSS config
|
||||||
|
│ └── dist/ # Built static assets (generated)
|
||||||
|
│
|
||||||
|
├── .planning/ # GSD planning directory
|
||||||
|
│ └── codebase/ # Codebase analysis documents
|
||||||
|
│
|
||||||
|
├── package.json # Monorepo root package (if used)
|
||||||
|
├── .git/ # Git repository
|
||||||
|
├── .gitignore # Git ignore rules
|
||||||
|
├── .cursorrules # Cursor IDE configuration
|
||||||
|
├── README.md # Project overview
|
||||||
|
├── CONFIGURATION_GUIDE.md # Setup instructions
|
||||||
|
├── CODEBASE_ARCHITECTURE_SUMMARY.md # Existing architecture notes
|
||||||
|
└── [PDF documents] # Sample CIM documents for testing
|
||||||
|
```
|
||||||
|
|
||||||
|
## Directory Purposes
|
||||||
|
|
||||||
|
**backend/src/:**
|
||||||
|
- Purpose: All backend server code
|
||||||
|
- Contains: TypeScript source files
|
||||||
|
- Key files: `index.ts` (main app), routes, controllers, services, models
|
||||||
|
|
||||||
|
**backend/src/controllers/:**
|
||||||
|
- Purpose: HTTP request handlers
|
||||||
|
- Contains: `documentController.ts`, `authController.ts`
|
||||||
|
- Functions: Map HTTP requests to service calls, handle validation, construct responses
|
||||||
|
|
||||||
|
**backend/src/services/:**
|
||||||
|
- Purpose: Business logic and external integrations
|
||||||
|
- Contains: Document processing, LLM integration, file storage, database, job queue
|
||||||
|
- Key files:
|
||||||
|
- `unifiedDocumentProcessor.ts` - Orchestrator, strategy selection
|
||||||
|
- `singlePassProcessor.ts` - 2-LLM extraction (current default)
|
||||||
|
- `optimizedAgenticRAGProcessor.ts` - Advanced agentic processing (stub)
|
||||||
|
- `documentAiProcessor.ts` - Google Document AI OCR
|
||||||
|
- `llmService.ts` - LLM API calls (Anthropic/OpenAI/OpenRouter)
|
||||||
|
- `jobQueueService.ts` - Async job queue (in-memory, EventEmitter)
|
||||||
|
- `jobProcessorService.ts` - Dequeue and execute jobs
|
||||||
|
- `fileStorageService.ts` - GCS signed URLs and upload
|
||||||
|
- `vectorDatabaseService.ts` - Supabase pgvector operations
|
||||||
|
- `pdfGenerationService.ts` - Puppeteer PDF rendering
|
||||||
|
- `uploadProgressService.ts` - Track upload status
|
||||||
|
- `uploadMonitoringService.ts` - Monitor processing progress
|
||||||
|
- `llmSchemas.ts` - Zod schemas for LLM extraction (CIMReview, financial data)
|
||||||
|
|
||||||
|
**backend/src/models/:**
|
||||||
|
- Purpose: Database access layer and schema definitions
|
||||||
|
- Contains: Document, User, ProcessingJob, Feedback models
|
||||||
|
- Key files:
|
||||||
|
- `types.ts` - TypeScript interfaces (Document, ProcessingJob, ProcessingStatus)
|
||||||
|
- `DocumentModel.ts` - Document CRUD with retry logic
|
||||||
|
- `ProcessingJobModel.ts` - Job tracking in database
|
||||||
|
- `UserModel.ts` - User management
|
||||||
|
- `VectorDatabaseModel.ts` - Vector embedding queries
|
||||||
|
- `migrate.ts` - Database migrations
|
||||||
|
- `seed.ts` - Test data seeding
|
||||||
|
- `migrations/` - SQL migration files
|
||||||
|
|
||||||
|
**backend/src/routes/:**
|
||||||
|
- Purpose: Express route definitions
|
||||||
|
- Contains: Route handlers and middleware bindings
|
||||||
|
- Key files:
|
||||||
|
- `documents.ts` - GET/POST/PUT/DELETE document endpoints
|
||||||
|
- `vector.ts` - Vector search endpoints
|
||||||
|
- `monitoring.ts` - Health and status endpoints
|
||||||
|
- `documentAudit.ts` - Audit log endpoints
|
||||||
|
|
||||||
|
**backend/src/middleware/:**
|
||||||
|
- Purpose: Express middleware for cross-cutting concerns
|
||||||
|
- Contains: Authentication, validation, error handling
|
||||||
|
- Key files:
|
||||||
|
- `firebaseAuth.ts` - Firebase ID token verification
|
||||||
|
- `errorHandler.ts` - Global error handling + correlation ID
|
||||||
|
- `notFoundHandler.ts` - 404 handler
|
||||||
|
- `validation.ts` - Request validation (UUID, pagination)
|
||||||
|
|
||||||
|
**backend/src/config/:**
|
||||||
|
- Purpose: Configuration and initialization
|
||||||
|
- Contains: Environment setup, service initialization
|
||||||
|
- Key files:
|
||||||
|
- `env.ts` - Environment variable validation (Joi schema)
|
||||||
|
- `firebase.ts` - Firebase Admin SDK initialization
|
||||||
|
- `supabase.ts` - Supabase client and pool setup
|
||||||
|
- `database.ts` - PostgreSQL connection (legacy)
|
||||||
|
- `errorConfig.ts` - Error handling config
|
||||||
|
|
||||||
|
**backend/src/utils/:**
|
||||||
|
- Purpose: Shared utility functions
|
||||||
|
- Contains: Logging, validation, parsing
|
||||||
|
- Key files:
|
||||||
|
- `logger.ts` - Winston logger setup (console + file transports)
|
||||||
|
- `validation.ts` - UUID and pagination validators
|
||||||
|
- `googleServiceAccount.ts` - Google Cloud credentials resolution
|
||||||
|
- `financialExtractor.ts` - Financial data parsing (deprecated for single-pass)
|
||||||
|
- `templateParser.ts` - CIM template utilities
|
||||||
|
- `auth.ts` - Authentication helpers
|
||||||
|
|
||||||
|
**backend/src/scripts/:**
|
||||||
|
- Purpose: One-off CLI scripts for diagnostics and setup
|
||||||
|
- Contains: Database setup, testing, monitoring
|
||||||
|
- Key files:
|
||||||
|
- `setup-database.ts` - Initialize database schema
|
||||||
|
- `monitor-document-processing.ts` - Watch job queue status
|
||||||
|
- `check-current-job.ts` - Debug stuck jobs
|
||||||
|
- `test-full-llm-pipeline.ts` - End-to-end testing
|
||||||
|
- `comprehensive-diagnostic.ts` - System health check
|
||||||
|
|
||||||
|
**backend/src/__tests__/:**
|
||||||
|
- Purpose: Test suites
|
||||||
|
- Contains: Unit, integration, acceptance tests
|
||||||
|
- Subdirectories:
|
||||||
|
- `unit/` - Isolated component tests
|
||||||
|
- `integration/` - Multi-component tests
|
||||||
|
- `acceptance/` - End-to-end flow tests
|
||||||
|
- `mocks/` - Mock data and fixtures
|
||||||
|
- `utils/` - Test utilities
|
||||||
|
|
||||||
|
**frontend/src/:**
|
||||||
|
- Purpose: All frontend code
|
||||||
|
- Contains: React components, services, types
|
||||||
|
|
||||||
|
**frontend/src/components/:**
|
||||||
|
- Purpose: React UI components
|
||||||
|
- Contains: Page components, reusable widgets
|
||||||
|
- Key files:
|
||||||
|
- `DocumentUpload.tsx` - File upload UI with drag-and-drop
|
||||||
|
- `DocumentList.tsx` - List of processed documents
|
||||||
|
- `DocumentViewer.tsx` - View and edit extracted data
|
||||||
|
- `ProcessingProgress.tsx` - Real-time processing status
|
||||||
|
- `UploadMonitoringDashboard.tsx` - Admin view of active jobs
|
||||||
|
- `LoginForm.tsx` - Firebase auth login UI
|
||||||
|
- `ProtectedRoute.tsx` - Route guard for authenticated pages
|
||||||
|
- `Analytics.tsx` - Document analytics and statistics
|
||||||
|
- `CIMReviewTemplate.tsx` - Display extracted CIM review data
|
||||||
|
|
||||||
|
**frontend/src/services/:**
|
||||||
|
- Purpose: API clients and external service integration
|
||||||
|
- Contains: HTTP clients for backend
|
||||||
|
- Key files:
|
||||||
|
- `documentService.ts` - Document API calls (upload, list, process, status)
|
||||||
|
- `authService.ts` - Firebase authentication (login, logout, token)
|
||||||
|
- `adminService.ts` - Admin-only operations
|
||||||
|
|
||||||
|
**frontend/src/contexts/:**
|
||||||
|
- Purpose: React Context for global state
|
||||||
|
- Contains: AuthContext for user and authentication state
|
||||||
|
- Key files:
|
||||||
|
- `AuthContext.tsx` - User, token, login/logout state
|
||||||
|
|
||||||
|
**frontend/src/config/:**
|
||||||
|
- Purpose: Configuration
|
||||||
|
- Contains: Environment variables, Firebase setup
|
||||||
|
- Key files:
|
||||||
|
- `env.ts` - VITE_API_BASE_URL and other env vars
|
||||||
|
- `firebase.ts` - Firebase client initialization
|
||||||
|
|
||||||
|
**frontend/src/types/:**
|
||||||
|
- Purpose: TypeScript interfaces
|
||||||
|
- Contains: API response types, component props
|
||||||
|
- Key files:
|
||||||
|
- `auth.ts` - User, LoginCredentials, AuthContextType
|
||||||
|
|
||||||
|
**frontend/src/utils/:**
|
||||||
|
- Purpose: Shared utility functions
|
||||||
|
- Contains: Validation, CSS utilities
|
||||||
|
- Key files:
|
||||||
|
- `validation.ts` - Email, password validators
|
||||||
|
- `cn.ts` - Classname merger (clsx wrapper)
|
||||||
|
- `authDebug.ts` - Authentication debugging helpers
|
||||||
|
|
||||||
|
## Key File Locations
|
||||||
|
|
||||||
|
**Entry Points:**
|
||||||
|
- `backend/src/index.ts` - Main Express app and Firebase Functions exports
|
||||||
|
- `frontend/src/main.tsx` - React entry point
|
||||||
|
- `frontend/src/App.tsx` - Root component with routing
|
||||||
|
|
||||||
|
**Configuration:**
|
||||||
|
- `backend/src/config/env.ts` - Environment variable schema and validation
|
||||||
|
- `backend/src/config/firebase.ts` - Firebase Admin SDK setup
|
||||||
|
- `backend/src/config/supabase.ts` - Supabase client and connection pool
|
||||||
|
- `frontend/src/config/firebase.ts` - Firebase client configuration
|
||||||
|
- `frontend/src/config/env.ts` - Frontend environment variables
|
||||||
|
|
||||||
|
**Core Logic:**
|
||||||
|
- `backend/src/services/unifiedDocumentProcessor.ts` - Main document processing orchestrator
|
||||||
|
- `backend/src/services/singlePassProcessor.ts` - Single-pass 2-LLM strategy
|
||||||
|
- `backend/src/services/llmService.ts` - LLM API integration with retry
|
||||||
|
- `backend/src/services/jobQueueService.ts` - Background job queue
|
||||||
|
- `backend/src/services/vectorDatabaseService.ts` - Vector search implementation
|
||||||
|
|
||||||
|
**Testing:**
|
||||||
|
- `backend/src/__tests__/unit/` - Unit tests
|
||||||
|
- `backend/src/__tests__/integration/` - Integration tests
|
||||||
|
- `backend/src/__tests__/acceptance/` - End-to-end tests
|
||||||
|
|
||||||
|
**Database:**
|
||||||
|
- `backend/src/models/types.ts` - TypeScript type definitions
|
||||||
|
- `backend/src/models/DocumentModel.ts` - Document CRUD operations
|
||||||
|
- `backend/src/models/ProcessingJobModel.ts` - Job tracking
|
||||||
|
- `backend/src/models/migrations/` - SQL migration files
|
||||||
|
|
||||||
|
**Middleware:**
|
||||||
|
- `backend/src/middleware/firebaseAuth.ts` - JWT authentication
|
||||||
|
- `backend/src/middleware/errorHandler.ts` - Global error handling
|
||||||
|
- `backend/src/middleware/validation.ts` - Input validation
|
||||||
|
|
||||||
|
**Logging:**
|
||||||
|
- `backend/src/utils/logger.ts` - Winston logger configuration
|
||||||
|
|
||||||
|
## Naming Conventions
|
||||||
|
|
||||||
|
**Files:**
|
||||||
|
- Controllers: `{resource}Controller.ts` (e.g., `documentController.ts`)
|
||||||
|
- Services: `{service}Service.ts` or descriptive (e.g., `llmService.ts`, `singlePassProcessor.ts`)
|
||||||
|
- Models: `{Entity}Model.ts` (e.g., `DocumentModel.ts`)
|
||||||
|
- Routes: `{resource}.ts` (e.g., `documents.ts`)
|
||||||
|
- Middleware: `{purpose}Handler.ts` or `{purpose}.ts` (e.g., `firebaseAuth.ts`)
|
||||||
|
- Types/Interfaces: `types.ts` or `{name}Types.ts`
|
||||||
|
- Tests: `{file}.test.ts` or `{file}.spec.ts`
|
||||||
|
|
||||||
|
**Directories:**
|
||||||
|
- Plurals for collections: `services/`, `models/`, `utils/`, `routes/`, `controllers/`
|
||||||
|
- Singular for specific features: `config/`, `middleware/`, `types/`, `contexts/`
|
||||||
|
- Nested by feature in larger directories: `__tests__/unit/`, `models/migrations/`
|
||||||
|
|
||||||
|
**Functions/Variables:**
|
||||||
|
- Camel case: `processDocument()`, `getUserId()`, `documentId`
|
||||||
|
- Constants: UPPER_SNAKE_CASE: `MAX_RETRIES`, `TIMEOUT_MS`
|
||||||
|
- Private methods: Prefix with `_` or use TypeScript `private`: `_retryOperation()`
|
||||||
|
|
||||||
|
**Classes:**
|
||||||
|
- Pascal case: `DocumentModel`, `JobQueueService`, `SinglePassProcessor`
|
||||||
|
- Service instances exported as singletons: `export const llmService = new LLMService()`
|
||||||
|
|
||||||
|
**React Components:**
|
||||||
|
- Pascal case: `DocumentUpload.tsx`, `ProtectedRoute.tsx`
|
||||||
|
- Hooks: `use{Feature}` (e.g., `useAuth` from AuthContext)
|
||||||
|
|
||||||
|
## Where to Add New Code
|
||||||
|
|
||||||
|
**New Document Processing Strategy:**
|
||||||
|
- Primary code: `backend/src/services/{strategyName}Processor.ts`
|
||||||
|
- Schema: Add types to `backend/src/services/llmSchemas.ts`
|
||||||
|
- Integration: Register in `backend/src/services/unifiedDocumentProcessor.ts`
|
||||||
|
- Tests: `backend/src/__tests__/integration/{strategyName}.test.ts`
|
||||||
|
|
||||||
|
**New API Endpoint:**
|
||||||
|
- Route: `backend/src/routes/{resource}.ts`
|
||||||
|
- Controller: `backend/src/controllers/{resource}Controller.ts`
|
||||||
|
- Service: `backend/src/services/{resource}Service.ts` (if needed)
|
||||||
|
- Model: `backend/src/models/{Resource}Model.ts` (if database access)
|
||||||
|
- Tests: `backend/src/__tests__/integration/{endpoint}.test.ts`
|
||||||
|
|
||||||
|
**New React Component:**
|
||||||
|
- Component: `frontend/src/components/{ComponentName}.tsx`
|
||||||
|
- Types: Add to `frontend/src/types/` or inline in component
|
||||||
|
- Services: Use existing `frontend/src/services/documentService.ts`
|
||||||
|
- Tests: `frontend/src/__tests__/{ComponentName}.test.tsx` (if added)
|
||||||
|
|
||||||
|
**Shared Utilities:**
|
||||||
|
- Backend: `backend/src/utils/{utility}.ts`
|
||||||
|
- Frontend: `frontend/src/utils/{utility}.ts`
|
||||||
|
- Avoid code duplication - consider extracting common patterns
|
||||||
|
|
||||||
|
**Database Schema Changes:**
|
||||||
|
- Migration file: `backend/src/models/migrations/{timestamp}_{description}.sql`
|
||||||
|
- TypeScript interface: Update `backend/src/models/types.ts`
|
||||||
|
- Model methods: Update corresponding `*Model.ts` file
|
||||||
|
- Run: `npm run db:migrate` in backend
|
||||||
|
|
||||||
|
**Configuration Changes:**
|
||||||
|
- Environment: Update `backend/src/config/env.ts` (Joi schema)
|
||||||
|
- Frontend env: Update `frontend/src/config/env.ts`
|
||||||
|
- Firebase secrets: Use `firebase functions:secrets:set VAR_NAME`
|
||||||
|
- Local dev: Add to `.env` file (gitignored)
|
||||||
|
|
||||||
|
## Special Directories
|
||||||
|
|
||||||
|
**backend/src/__tests__/mocks/:**
|
||||||
|
- Purpose: Mock data and fixtures for testing
|
||||||
|
- Generated: No (manually maintained)
|
||||||
|
- Committed: Yes
|
||||||
|
- Usage: Import in tests for consistent test data
|
||||||
|
|
||||||
|
**backend/src/scripts/:**
|
||||||
|
- Purpose: One-off CLI utilities for development and operations
|
||||||
|
- Generated: No (manually maintained)
|
||||||
|
- Committed: Yes
|
||||||
|
- Execution: `ts-node src/scripts/{script}.ts` or `npm run {script}`
|
||||||
|
|
||||||
|
**backend/src/assets/:**
|
||||||
|
- Purpose: Static HTML templates for PDF generation
|
||||||
|
- Generated: No (manually maintained)
|
||||||
|
- Committed: Yes
|
||||||
|
- Usage: Rendered by Puppeteer in `pdfGenerationService.ts`
|
||||||
|
|
||||||
|
**backend/src/models/migrations/:**
|
||||||
|
- Purpose: Database schema migration SQL files
|
||||||
|
- Generated: No (manually created)
|
||||||
|
- Committed: Yes
|
||||||
|
- Execution: Run via `npm run db:migrate`
|
||||||
|
|
||||||
|
**frontend/src/assets/:**
|
||||||
|
- Purpose: Images, icons, logos
|
||||||
|
- Generated: No (manually added)
|
||||||
|
- Committed: Yes
|
||||||
|
- Usage: Import in components (e.g., `bluepoint-logo.png`)
|
||||||
|
|
||||||
|
**backend/dist/ and frontend/dist/:**
|
||||||
|
- Purpose: Compiled JavaScript and optimized bundles
|
||||||
|
- Generated: Yes (build output)
|
||||||
|
- Committed: No (gitignored)
|
||||||
|
- Regeneration: `npm run build` in respective directory
|
||||||
|
|
||||||
|
**backend/node_modules/ and frontend/node_modules/:**
|
||||||
|
- Purpose: Installed dependencies
|
||||||
|
- Generated: Yes (npm install)
|
||||||
|
- Committed: No (gitignored)
|
||||||
|
- Regeneration: `npm install`
|
||||||
|
|
||||||
|
**backend/logs/:**
|
||||||
|
- Purpose: Runtime log files
|
||||||
|
- Generated: Yes (runtime)
|
||||||
|
- Committed: No (gitignored)
|
||||||
|
- Contents: `error.log`, `upload.log`, combined logs
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*Structure analysis: 2026-02-24*
|
||||||
342
.planning/codebase/TESTING.md
Normal file
342
.planning/codebase/TESTING.md
Normal file
@@ -0,0 +1,342 @@
|
|||||||
|
# Testing Patterns
|
||||||
|
|
||||||
|
**Analysis Date:** 2026-02-24
|
||||||
|
|
||||||
|
## Test Framework
|
||||||
|
|
||||||
|
**Runner:**
|
||||||
|
- Vitest 2.1.0
|
||||||
|
- Config: No dedicated `vitest.config.ts` found (uses defaults)
|
||||||
|
- Node.js test environment
|
||||||
|
|
||||||
|
**Assertion Library:**
|
||||||
|
- Vitest native assertions via `expect()`
|
||||||
|
- Examples: `expect(value).toBe()`, `expect(value).toBeDefined()`, `expect(array).toContain()`
|
||||||
|
|
||||||
|
**Run Commands:**
|
||||||
|
```bash
|
||||||
|
npm test # Run all tests once
|
||||||
|
npm run test:watch # Watch mode for continuous testing
|
||||||
|
npm run test:coverage # Generate coverage report
|
||||||
|
```
|
||||||
|
|
||||||
|
**Coverage Tool:**
|
||||||
|
- `@vitest/coverage-v8` 2.1.0
|
||||||
|
- Tracks line, branch, function, and statement coverage
|
||||||
|
- V8 backend for accurate coverage metrics
|
||||||
|
|
||||||
|
## Test File Organization
|
||||||
|
|
||||||
|
**Location:**
|
||||||
|
- Co-located in `backend/src/__tests__/` directory
|
||||||
|
- Subdirectories for logical grouping:
|
||||||
|
- `backend/src/__tests__/utils/` - Utility function tests
|
||||||
|
- `backend/src/__tests__/mocks/` - Mock implementations
|
||||||
|
- `backend/src/__tests__/acceptance/` - Acceptance/integration tests
|
||||||
|
|
||||||
|
**Naming:**
|
||||||
|
- Pattern: `[feature].test.ts` or `[feature].spec.ts`
|
||||||
|
- Examples:
|
||||||
|
- `backend/src/__tests__/financial-summary.test.ts`
|
||||||
|
- `backend/src/__tests__/acceptance/handiFoods.acceptance.test.ts`
|
||||||
|
|
||||||
|
**Structure:**
|
||||||
|
```
|
||||||
|
backend/src/__tests__/
|
||||||
|
├── utils/
|
||||||
|
│ └── test-helpers.ts # Test utility functions
|
||||||
|
├── mocks/
|
||||||
|
│ └── logger.mock.ts # Mock implementations
|
||||||
|
└── acceptance/
|
||||||
|
└── handiFoods.acceptance.test.ts # Acceptance tests
|
||||||
|
```
|
||||||
|
|
||||||
|
## Test Structure
|
||||||
|
|
||||||
|
**Suite Organization:**
|
||||||
|
```typescript
|
||||||
|
import { describe, test, expect, beforeAll } from 'vitest';
|
||||||
|
|
||||||
|
describe('Feature Category', () => {
|
||||||
|
describe('Nested Behavior Group', () => {
|
||||||
|
test('should do specific thing', () => {
|
||||||
|
expect(result).toBe(expected);
|
||||||
|
});
|
||||||
|
|
||||||
|
test('should handle edge case', () => {
|
||||||
|
expect(edge).toBeDefined();
|
||||||
|
});
|
||||||
|
});
|
||||||
|
});
|
||||||
|
```
|
||||||
|
|
||||||
|
From `financial-summary.test.ts`:
|
||||||
|
```typescript
|
||||||
|
describe('Financial Summary Fixes', () => {
|
||||||
|
describe('Period Ordering', () => {
|
||||||
|
test('Summary table should display periods in chronological order (FY3 → FY2 → FY1 → LTM)', () => {
|
||||||
|
const periods = ['fy3', 'fy2', 'fy1', 'ltm'];
|
||||||
|
const expectedOrder = ['FY3', 'FY2', 'FY1', 'LTM'];
|
||||||
|
|
||||||
|
expect(periods[0]).toBe('fy3');
|
||||||
|
expect(periods[3]).toBe('ltm');
|
||||||
|
});
|
||||||
|
});
|
||||||
|
});
|
||||||
|
```
|
||||||
|
|
||||||
|
**Patterns:**
|
||||||
|
|
||||||
|
1. **Setup Pattern:**
|
||||||
|
- Use `beforeAll()` for shared test data initialization
|
||||||
|
- Example from `handiFoods.acceptance.test.ts`:
|
||||||
|
```typescript
|
||||||
|
beforeAll(() => {
|
||||||
|
const normalize = (text: string) => text.replace(/\s+/g, ' ').toLowerCase();
|
||||||
|
const cimRaw = fs.readFileSync(cimTextPath, 'utf-8');
|
||||||
|
const outputRaw = fs.readFileSync(outputTextPath, 'utf-8');
|
||||||
|
cimNormalized = normalize(cimRaw);
|
||||||
|
outputNormalized = normalize(outputRaw);
|
||||||
|
});
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Teardown Pattern:**
|
||||||
|
- Not explicitly shown in current tests
|
||||||
|
- Use `afterAll()` for resource cleanup if needed
|
||||||
|
|
||||||
|
3. **Assertion Pattern:**
|
||||||
|
- Descriptive test names that read as sentences: `'should display periods in chronological order'`
|
||||||
|
- Multiple assertions per test acceptable for related checks
|
||||||
|
- Use `expect().toContain()` for array/string membership
|
||||||
|
- Use `expect().toBeDefined()` for existence checks
|
||||||
|
- Use `expect().toBeGreaterThan()` for numeric comparisons
|
||||||
|
|
||||||
|
## Mocking
|
||||||
|
|
||||||
|
**Framework:** Vitest `vi` mock utilities
|
||||||
|
|
||||||
|
**Patterns:**
|
||||||
|
|
||||||
|
1. **Mock Logger:**
|
||||||
|
```typescript
|
||||||
|
import { vi } from 'vitest';
|
||||||
|
|
||||||
|
export const mockLogger = {
|
||||||
|
debug: vi.fn(),
|
||||||
|
info: vi.fn(),
|
||||||
|
warn: vi.fn(),
|
||||||
|
error: vi.fn(),
|
||||||
|
};
|
||||||
|
|
||||||
|
export const mockStructuredLogger = {
|
||||||
|
uploadStart: vi.fn(),
|
||||||
|
uploadSuccess: vi.fn(),
|
||||||
|
uploadError: vi.fn(),
|
||||||
|
processingStart: vi.fn(),
|
||||||
|
processingSuccess: vi.fn(),
|
||||||
|
processingError: vi.fn(),
|
||||||
|
storageOperation: vi.fn(),
|
||||||
|
jobQueueOperation: vi.fn(),
|
||||||
|
info: vi.fn(),
|
||||||
|
warn: vi.fn(),
|
||||||
|
error: vi.fn(),
|
||||||
|
debug: vi.fn(),
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Mock Service Pattern:**
|
||||||
|
- Create mock implementations in `backend/src/__tests__/mocks/`
|
||||||
|
- Export as named exports: `export const mockLogger`, `export const mockStructuredLogger`
|
||||||
|
- Use `vi.fn()` for all callable methods to track calls and arguments
|
||||||
|
|
||||||
|
3. **What to Mock:**
|
||||||
|
- External services: Firebase Auth, Supabase, Google Cloud APIs
|
||||||
|
- Logger: always mock to prevent log spam during tests
|
||||||
|
- File system operations (in unit tests; use real files in acceptance tests)
|
||||||
|
- LLM API calls: mock responses to avoid quota usage
|
||||||
|
|
||||||
|
4. **What NOT to Mock:**
|
||||||
|
- Core utility functions: use real implementations
|
||||||
|
- Type definitions: no need to mock types
|
||||||
|
- Pure functions: test directly without mocks
|
||||||
|
- Business logic calculations: test with real data
|
||||||
|
|
||||||
|
## Fixtures and Factories
|
||||||
|
|
||||||
|
**Test Data:**
|
||||||
|
|
||||||
|
1. **Helper Factory Pattern:**
|
||||||
|
From `backend/src/__tests__/utils/test-helpers.ts`:
|
||||||
|
```typescript
|
||||||
|
export function createMockCorrelationId(): string {
|
||||||
|
return `test-correlation-${Date.now()}-${Math.random().toString(36).substr(2, 9)}`;
|
||||||
|
}
|
||||||
|
|
||||||
|
export function createMockUserId(): string {
|
||||||
|
return `test-user-${Date.now()}-${Math.random().toString(36).substr(2, 9)}`;
|
||||||
|
}
|
||||||
|
|
||||||
|
export function createMockDocumentId(): string {
|
||||||
|
return `test-doc-${Date.now()}-${Math.random().toString(36).substr(2, 9)}`;
|
||||||
|
}
|
||||||
|
|
||||||
|
export function createMockJobId(): string {
|
||||||
|
return `test-job-${Date.now()}-${Math.random().toString(36).substr(2, 9)}`;
|
||||||
|
}
|
||||||
|
|
||||||
|
export function wait(ms: number): Promise<void> {
|
||||||
|
return new Promise((resolve) => setTimeout(resolve, ms));
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Acceptance Test Fixtures:**
|
||||||
|
- Located in `backend/test-fixtures/` directory
|
||||||
|
- Example: `backend/test-fixtures/handiFoods/` contains:
|
||||||
|
- `handi-foods-cim.txt` - Reference CIM content
|
||||||
|
- `handi-foods-output.txt` - Expected processor output
|
||||||
|
- Loaded via `fs.readFileSync()` in `beforeAll()` hooks
|
||||||
|
|
||||||
|
**Location:**
|
||||||
|
- Test helpers: `backend/src/__tests__/utils/test-helpers.ts`
|
||||||
|
- Acceptance fixtures: `backend/test-fixtures/` (outside src)
|
||||||
|
- Mocks: `backend/src/__tests__/mocks/`
|
||||||
|
|
||||||
|
## Coverage
|
||||||
|
|
||||||
|
**Requirements:**
|
||||||
|
- No automated coverage enforcement detected (no threshold in config)
|
||||||
|
- Manual review recommended for critical paths
|
||||||
|
|
||||||
|
**View Coverage:**
|
||||||
|
```bash
|
||||||
|
npm run test:coverage
|
||||||
|
```
|
||||||
|
|
||||||
|
## Test Types
|
||||||
|
|
||||||
|
**Unit Tests:**
|
||||||
|
- **Scope:** Individual functions, services, utilities
|
||||||
|
- **Approach:** Test in isolation with mocks for dependencies
|
||||||
|
- **Examples:**
|
||||||
|
- Financial parser tests: parse tables with various formats
|
||||||
|
- Period ordering tests: verify chronological order logic
|
||||||
|
- Validate UUID format tests: regex pattern matching
|
||||||
|
- **Location:** `backend/src/__tests__/[feature].test.ts`
|
||||||
|
|
||||||
|
**Integration Tests:**
|
||||||
|
- **Scope:** Multiple components working together
|
||||||
|
- **Approach:** May use real Supabase/Firebase or mocks depending on test level
|
||||||
|
- **Not heavily used:** minimal integration test infrastructure
|
||||||
|
- **Pattern:** Could use real database in test environment with cleanup
|
||||||
|
|
||||||
|
**Acceptance Tests:**
|
||||||
|
- **Scope:** End-to-end feature validation with real artifacts
|
||||||
|
- **Approach:** Load reference files, process through entire pipeline, verify output
|
||||||
|
- **Example:** `handiFoods.acceptance.test.ts`
|
||||||
|
- Loads CIM text file
|
||||||
|
- Loads processor output file
|
||||||
|
- Validates all reference facts exist in both
|
||||||
|
- Validates key fields resolved instead of fallback messages
|
||||||
|
- **Location:** `backend/src/__tests__/acceptance/`
|
||||||
|
|
||||||
|
**E2E Tests:**
|
||||||
|
- Not implemented in current setup
|
||||||
|
- Would require browser automation (no Playwright/Cypress config found)
|
||||||
|
- Frontend testing: not currently automated
|
||||||
|
|
||||||
|
## Common Patterns
|
||||||
|
|
||||||
|
**Async Testing:**
|
||||||
|
```typescript
|
||||||
|
test('should process document asynchronously', async () => {
|
||||||
|
const result = await processDocument(documentId, userId, text);
|
||||||
|
expect(result.success).toBe(true);
|
||||||
|
});
|
||||||
|
```
|
||||||
|
|
||||||
|
**Error Testing:**
|
||||||
|
```typescript
|
||||||
|
test('should validate UUID format', () => {
|
||||||
|
const id = 'invalid-id';
|
||||||
|
const uuidRegex = /^[0-9a-f]{8}-[0-9a-f]{4}-4[0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}$/i;
|
||||||
|
expect(uuidRegex.test(id)).toBe(false);
|
||||||
|
});
|
||||||
|
```
|
||||||
|
|
||||||
|
**Array/Collection Testing:**
|
||||||
|
```typescript
|
||||||
|
test('should extract all financial periods', () => {
|
||||||
|
const result = parseFinancialsFromText(tableText);
|
||||||
|
expect(result.data.fy3.revenue).toBeDefined();
|
||||||
|
expect(result.data.fy2.revenue).toBeDefined();
|
||||||
|
expect(result.data.fy1.revenue).toBeDefined();
|
||||||
|
expect(result.data.ltm.revenue).toBeDefined();
|
||||||
|
});
|
||||||
|
```
|
||||||
|
|
||||||
|
**Text/Content Testing (Acceptance):**
|
||||||
|
```typescript
|
||||||
|
test('verifies each reference fact exists in CIM and generated output', () => {
|
||||||
|
for (const fact of referenceFacts) {
|
||||||
|
for (const token of fact.tokens) {
|
||||||
|
expect(cimNormalized).toContain(token);
|
||||||
|
expect(outputNormalized).toContain(token);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
});
|
||||||
|
```
|
||||||
|
|
||||||
|
**Normalization for Content Testing:**
|
||||||
|
```typescript
|
||||||
|
// Normalize whitespace and case for robust text matching
|
||||||
|
const normalize = (text: string) => text.replace(/\s+/g, ' ').toLowerCase();
|
||||||
|
const normalizedCIM = normalize(cimRaw);
|
||||||
|
expect(normalizedCIM).toContain('reference-phrase');
|
||||||
|
```
|
||||||
|
|
||||||
|
## Test Coverage Priorities
|
||||||
|
|
||||||
|
**Critical Paths (Test First):**
|
||||||
|
1. Document upload and file storage operations
|
||||||
|
2. Firebase authentication and token validation
|
||||||
|
3. LLM service API interactions with retry logic
|
||||||
|
4. Error handling and correlation ID tracking
|
||||||
|
5. Financial data extraction and parsing
|
||||||
|
6. PDF generation pipeline
|
||||||
|
|
||||||
|
**Important Paths (Test Early):**
|
||||||
|
1. Vector embeddings and database operations
|
||||||
|
2. Job queue processing and timeout handling
|
||||||
|
3. Google Document AI text extraction
|
||||||
|
4. Supabase Row Level Security policies
|
||||||
|
|
||||||
|
**Nice-to-Have (Test Later):**
|
||||||
|
1. UI component rendering (would require React Testing Library)
|
||||||
|
2. CSS/styling validation
|
||||||
|
3. Frontend form submission flows
|
||||||
|
4. Analytics tracking
|
||||||
|
|
||||||
|
## Current Testing Gaps
|
||||||
|
|
||||||
|
**Untested Areas:**
|
||||||
|
- Backend services: Most services lack unit tests (llmService, fileStorageService, etc.)
|
||||||
|
- Database models: No model tests for Supabase operations
|
||||||
|
- Controllers/Endpoints: No API endpoint tests
|
||||||
|
- Frontend components: No React component tests
|
||||||
|
- Integration flows: Document upload through processing to PDF generation
|
||||||
|
|
||||||
|
**Missing Patterns:**
|
||||||
|
- No database integration test setup (fixtures, transactions)
|
||||||
|
- No API request/response validation tests
|
||||||
|
- No performance/load tests
|
||||||
|
- No security tests (auth bypass, XSS, injection)
|
||||||
|
|
||||||
|
## Deprecated Test Patterns (DO NOT USE)
|
||||||
|
|
||||||
|
- ❌ Jest test suite - Use Vitest instead
|
||||||
|
- ❌ Direct PostgreSQL connection tests - Use Supabase in test mode
|
||||||
|
- ❌ Legacy test files referencing removed services - Updated implementations used only
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*Testing analysis: 2026-02-24*
|
||||||
Reference in New Issue
Block a user