Files
cim_summary/.planning/codebase/ARCHITECTURE.md
2026-02-24 10:28:22 -05:00

11 KiB

Architecture

Analysis Date: 2026-02-24

Pattern Overview

Overall: Full-stack distributed system combining Express.js backend with React frontend, implementing a multi-stage document processing pipeline with queued background jobs and real-time monitoring.

Key Characteristics:

  • Server-rendered PDF generation with single-pass LLM processing
  • Asynchronous job queue for background document processing (max 3 concurrent)
  • Firebase authentication with Supabase PostgreSQL + pgvector for embeddings
  • Multi-language LLM support (Anthropic, OpenAI, OpenRouter)
  • Structured schema extraction using Zod and LLM-driven analysis
  • Google Document AI for OCR and text extraction
  • Real-time upload progress tracking via SSE/polling
  • Correlation ID tracking throughout distributed pipeline

Layers

API Layer (Express + TypeScript):

  • Purpose: HTTP request routing, authentication, and response handling
  • Location: backend/src/index.ts, backend/src/routes/, backend/src/controllers/
  • Contains: Route definitions, request validation, error handling
  • Depends on: Middleware (auth, validation), Services
  • Used by: Frontend and external clients

Authentication Layer:

  • Purpose: Firebase ID token verification and user identity validation
  • Location: backend/src/middleware/firebaseAuth.ts, backend/src/config/firebase.ts
  • Contains: Token verification, service account initialization, session recovery
  • Depends on: Firebase Admin SDK, configuration
  • Used by: All protected routes via verifyFirebaseToken middleware

Controller Layer:

  • Purpose: Request handling, input validation, service orchestration
  • Location: backend/src/controllers/documentController.ts, backend/src/controllers/authController.ts
  • Contains: getUploadUrl(), processDocument(), getDocumentStatus() handlers
  • Depends on: Models, Services, Middleware
  • Used by: Routes

Service Layer:

  • Purpose: Business logic, external API integration, document processing orchestration
  • Location: backend/src/services/
  • Contains:
    • unifiedDocumentProcessor.ts - Main orchestrator, strategy selection
    • singlePassProcessor.ts - 2-LLM-call extraction (pass 1 + quality check)
    • documentAiProcessor.ts - Google Document AI text extraction
    • llmService.ts - LLM API calls with retry logic (3 attempts, exponential backoff)
    • jobQueueService.ts - Background job processing (EventEmitter-based)
    • fileStorageService.ts - Google Cloud Storage signed URLs and uploads
    • vectorDatabaseService.ts - Supabase vector embeddings and search
    • pdfGenerationService.ts - Puppeteer-based PDF rendering
    • csvExportService.ts - Financial data export
  • Depends on: Models, Config, Utilities
  • Used by: Controllers, Job Queue

Model Layer (Data Access):

  • Purpose: Database interactions, query execution, schema validation
  • Location: backend/src/models/
  • Contains: DocumentModel.ts, ProcessingJobModel.ts, UserModel.ts, VectorDatabaseModel.ts
  • Depends on: Supabase client, configuration
  • Used by: Services, Controllers

Job Queue Layer:

  • Purpose: Asynchronous background processing with priority and retry handling
  • Location: backend/src/services/jobQueueService.ts, backend/src/services/jobProcessorService.ts
  • Contains: In-memory queue, worker pool (max 3 concurrent), Firebase scheduled function trigger
  • Depends on: Services (document processor), Models
  • Used by: Controllers (to enqueue work), Scheduled functions (to trigger processing)

Frontend Layer (React + TypeScript):

  • Purpose: User interface for document upload, processing monitoring, and review
  • Location: frontend/src/
  • Contains: Components (Upload, List, Viewer, Analytics), Services, Contexts
  • Depends on: Backend API, Firebase Auth, Axios
  • Used by: Web browsers

Data Flow

Document Upload & Processing Flow:

  1. Upload Initiation (Frontend)

    • User selects PDF file via DocumentUpload component
    • Calls documentService.getUploadUrl() → Backend /documents/upload-url endpoint
    • Backend creates document record (status: 'uploading') and generates signed GCS URL
  2. File Upload (Frontend → GCS)

    • Frontend uploads file directly to Google Cloud Storage via signed URL
    • Frontend polls documentService.getDocumentStatus() for upload completion
    • UploadMonitoringDashboard displays real-time progress
  3. Processing Trigger (Frontend → Backend)

    • Frontend calls POST /documents/{id}/process once upload complete
    • Controller creates processing job and enqueues to jobQueueService
    • Controller immediately returns job ID
  4. Background Job Execution (Job Queue)

    • Scheduled Firebase function (processDocumentJobs) runs every 1 minute
    • Calls jobProcessorService.processJobs() to dequeue and execute
    • For each queued document:
      • Fetch file from GCS
      • Update status to 'extracting_text'
      • Call unifiedDocumentProcessor.processDocument()
  5. Document Processing (Single-Pass Strategy)

    • Pass 1 - LLM Extraction:

      • documentAiProcessor.extractText() (if needed) - Google Document AI OCR
      • llmService.processCIMDocument() - Claude/OpenAI structured extraction
      • Produces CIMReview object with financial, market, management data
      • Updates document status to 'processing_llm'
    • Pass 2 - Quality Check:

      • llmService.validateCIMReview() - Verify completeness and accuracy
      • Updates status to 'quality_validation'
    • PDF Generation:

      • pdfGenerationService.generatePDF() - Puppeteer renders HTML template
      • Uploads PDF to GCS
      • Updates status to 'generating_pdf'
    • Vector Indexing (Background):

      • vectorDatabaseService.createDocumentEmbedding() - Generate 3072-dim embeddings
      • Chunk document semantically, store in Supabase with vector index
      • Status moves to 'vector_indexing' then 'completed'
  6. Result Delivery (Backend → Frontend)

    • Frontend polls GET /documents/{id} to check completion
    • When status = 'completed', fetches summary and analysis data
    • DocumentViewer displays results, allows regeneration with feedback

State Management:

  • Backend: Document status progresses through uploading → extracting_text → processing_llm → generating_pdf → vector_indexing → completed or failed at any step
  • Frontend: AuthContext manages user/token, component state tracks selected document and loading states
  • Job Queue: In-memory queue with EventEmitter for state transitions

Key Abstractions

Unified Processor:

  • Purpose: Strategy pattern for document processing (single-pass vs. agentic RAG vs. simple)
  • Examples: singlePassProcessor, simpleDocumentProcessor, optimizedAgenticRAGProcessor
  • Pattern: Pluggable strategies via ProcessingStrategy selection in config

LLM Service:

  • Purpose: Unified interface for multiple LLM providers with retry logic
  • Examples: backend/src/services/llmService.ts (Anthropic, OpenAI, OpenRouter)
  • Pattern: Provider-agnostic API with processCIMDocument() returning structured CIMReview

Vector Database Abstraction:

  • Purpose: PostgreSQL pgvector operations via Supabase for semantic search
  • Examples: backend/src/services/vectorDatabaseService.ts
  • Pattern: Embedding + chunking → vector search via cosine similarity

File Storage Abstraction:

  • Purpose: Google Cloud Storage operations with signed URLs
  • Examples: backend/src/services/fileStorageService.ts
  • Pattern: Signed upload/download URLs for temporary access without IAM burden

Job Queue Pattern:

  • Purpose: Async processing with retry and priority handling
  • Examples: backend/src/services/jobQueueService.ts (EventEmitter-based)
  • Pattern: Priority queue with exponential backoff retry

Entry Points

API Entry Point:

  • Location: backend/src/index.ts
  • Triggers: Process startup or Firebase Functions invocation
  • Responsibilities:
    • Initialize Express app
    • Set up middleware (CORS, helmet, rate limiting, authentication)
    • Register routes (/documents, /vector, /monitoring, /api/audit)
    • Start job queue service
    • Export Firebase Functions v2 handlers (api, processDocumentJobs)

Scheduled Job Processing:

  • Location: backend/src/index.ts (line 252: processDocumentJobs function export)
  • Triggers: Firebase Cloud Scheduler every 1 minute
  • Responsibilities:
    • Health check database connection
    • Detect stuck jobs (processing > 15 min, pending > 2 min)
    • Call jobProcessorService.processJobs()
    • Log metrics and errors

Frontend Entry Point:

  • Location: frontend/src/main.tsx
  • Triggers: Browser navigation
  • Responsibilities:
    • Initialize React app with AuthProvider
    • Set up Firebase client
    • Render routing structure (Login → Dashboard)

Document Processing Controller:

  • Location: backend/src/controllers/documentController.ts
  • Route: POST /documents/{id}/process
  • Responsibilities:
    • Validate user authentication
    • Enqueue processing job
    • Return job ID to client

Error Handling

Strategy: Multi-layer error recovery with structured logging and graceful degradation

Patterns:

  • Retry Logic: DocumentModel uses exponential backoff (1s → 2s → 4s) for network errors
  • LLM Retry: llmService retries API calls 3 times with exponential backoff
  • Firebase Auth Recovery: firebaseAuth.ts attempts session recovery on token verify failure
  • Job Queue Retry: Jobs retry up to 3 times with configurable backoff (5s → 300s max)
  • Structured Error Logging: All errors include correlation ID, stack trace, and context metadata
  • Circuit Breaker Pattern: Database health check in processDocumentJobs prevents cascading failures

Error Boundaries:

  • Global error handler at end of Express middleware chain (errorHandler)
  • Try/catch in all async functions with context-aware logging
  • Unhandled rejection listener at process level (line 24 of index.ts)

Cross-Cutting Concerns

Logging:

  • Framework: Winston (json + console in dev)
  • Approach: Structured logger with correlation IDs, Winston transports for error/upload logs
  • Location: backend/src/utils/logger.ts
  • Pattern: logger.info(), logger.error(), StructuredLogger for operations

Validation:

  • Approach: Joi schema in environment config, Zod for API request/response types
  • Location: backend/src/config/env.ts, backend/src/services/llmSchemas.ts
  • Pattern: Joi for config, Zod for runtime validation

Authentication:

  • Approach: Firebase ID tokens verified via verifyFirebaseToken middleware
  • Location: backend/src/middleware/firebaseAuth.ts
  • Pattern: Bearer token in Authorization header, cached in req.user

Correlation Tracking:

  • Approach: UUID correlation ID added to all requests, propagated through job processing
  • Location: backend/src/middleware/validation.ts (addCorrelationId)
  • Pattern: X-Correlation-ID header or generated UUID, included in all logs

Architecture analysis: 2026-02-24