admin/cim_summary

Fork 0

Files

admin e6e1b1fa6f docs: map existing codebase

2026-02-24 10:28:22 -05:00

11 KiB

Raw Blame History

Architecture

Analysis Date: 2026-02-24

Pattern Overview

Overall: Full-stack distributed system combining Express.js backend with React frontend, implementing a multi-stage document processing pipeline with queued background jobs and real-time monitoring.

Key Characteristics:

Server-rendered PDF generation with single-pass LLM processing
Asynchronous job queue for background document processing (max 3 concurrent)
Firebase authentication with Supabase PostgreSQL + pgvector for embeddings
Multi-language LLM support (Anthropic, OpenAI, OpenRouter)
Structured schema extraction using Zod and LLM-driven analysis
Google Document AI for OCR and text extraction
Real-time upload progress tracking via SSE/polling
Correlation ID tracking throughout distributed pipeline

Layers

API Layer (Express + TypeScript):

Purpose: HTTP request routing, authentication, and response handling
Location: backend/src/index.ts, backend/src/routes/, backend/src/controllers/
Contains: Route definitions, request validation, error handling
Depends on: Middleware (auth, validation), Services
Used by: Frontend and external clients

Authentication Layer:

Purpose: Firebase ID token verification and user identity validation
Location: backend/src/middleware/firebaseAuth.ts, backend/src/config/firebase.ts
Contains: Token verification, service account initialization, session recovery
Depends on: Firebase Admin SDK, configuration
Used by: All protected routes via verifyFirebaseToken middleware

Controller Layer:

Purpose: Request handling, input validation, service orchestration
Location: backend/src/controllers/documentController.ts, backend/src/controllers/authController.ts
Contains: getUploadUrl(), processDocument(), getDocumentStatus() handlers
Depends on: Models, Services, Middleware
Used by: Routes

Service Layer:

Purpose: Business logic, external API integration, document processing orchestration
Location: backend/src/services/
Contains:
- unifiedDocumentProcessor.ts - Main orchestrator, strategy selection
- singlePassProcessor.ts - 2-LLM-call extraction (pass 1 + quality check)
- documentAiProcessor.ts - Google Document AI text extraction
- llmService.ts - LLM API calls with retry logic (3 attempts, exponential backoff)
- jobQueueService.ts - Background job processing (EventEmitter-based)
- fileStorageService.ts - Google Cloud Storage signed URLs and uploads
- vectorDatabaseService.ts - Supabase vector embeddings and search
- pdfGenerationService.ts - Puppeteer-based PDF rendering
- csvExportService.ts - Financial data export
Depends on: Models, Config, Utilities
Used by: Controllers, Job Queue

Model Layer (Data Access):

Purpose: Database interactions, query execution, schema validation
Location: backend/src/models/
Contains: DocumentModel.ts, ProcessingJobModel.ts, UserModel.ts, VectorDatabaseModel.ts
Depends on: Supabase client, configuration
Used by: Services, Controllers

Job Queue Layer:

Purpose: Asynchronous background processing with priority and retry handling
Location: backend/src/services/jobQueueService.ts, backend/src/services/jobProcessorService.ts
Contains: In-memory queue, worker pool (max 3 concurrent), Firebase scheduled function trigger
Depends on: Services (document processor), Models
Used by: Controllers (to enqueue work), Scheduled functions (to trigger processing)

Frontend Layer (React + TypeScript):

Purpose: User interface for document upload, processing monitoring, and review
Location: frontend/src/
Contains: Components (Upload, List, Viewer, Analytics), Services, Contexts
Depends on: Backend API, Firebase Auth, Axios
Used by: Web browsers

Data Flow

Document Upload & Processing Flow:

Upload Initiation (Frontend)
- User selects PDF file via DocumentUpload component
- Calls documentService.getUploadUrl() → Backend /documents/upload-url endpoint
- Backend creates document record (status: 'uploading') and generates signed GCS URL
File Upload (Frontend → GCS)
- Frontend uploads file directly to Google Cloud Storage via signed URL
- Frontend polls documentService.getDocumentStatus() for upload completion
- UploadMonitoringDashboard displays real-time progress
Processing Trigger (Frontend → Backend)
- Frontend calls POST /documents/{id}/process once upload complete
- Controller creates processing job and enqueues to jobQueueService
- Controller immediately returns job ID
Background Job Execution (Job Queue)
- Scheduled Firebase function (processDocumentJobs) runs every 1 minute
- Calls jobProcessorService.processJobs() to dequeue and execute
- For each queued document:
  - Fetch file from GCS
  - Update status to 'extracting_text'
  - Call unifiedDocumentProcessor.processDocument()
Document Processing (Single-Pass Strategy)
- Pass 1 - LLM Extraction:
  - documentAiProcessor.extractText() (if needed) - Google Document AI OCR
  - llmService.processCIMDocument() - Claude/OpenAI structured extraction
  - Produces CIMReview object with financial, market, management data
  - Updates document status to 'processing_llm'
- Pass 2 - Quality Check:
  - llmService.validateCIMReview() - Verify completeness and accuracy
  - Updates status to 'quality_validation'
- PDF Generation:
  - pdfGenerationService.generatePDF() - Puppeteer renders HTML template
  - Uploads PDF to GCS
  - Updates status to 'generating_pdf'
- Vector Indexing (Background):
  - vectorDatabaseService.createDocumentEmbedding() - Generate 3072-dim embeddings
  - Chunk document semantically, store in Supabase with vector index
  - Status moves to 'vector_indexing' then 'completed'
Result Delivery (Backend → Frontend)
- Frontend polls GET /documents/{id} to check completion
- When status = 'completed', fetches summary and analysis data
- DocumentViewer displays results, allows regeneration with feedback

State Management:

Backend: Document status progresses through uploading → extracting_text → processing_llm → generating_pdf → vector_indexing → completed or failed at any step
Frontend: AuthContext manages user/token, component state tracks selected document and loading states
Job Queue: In-memory queue with EventEmitter for state transitions

Key Abstractions

Unified Processor:

Purpose: Strategy pattern for document processing (single-pass vs. agentic RAG vs. simple)
Examples: singlePassProcessor, simpleDocumentProcessor, optimizedAgenticRAGProcessor
Pattern: Pluggable strategies via ProcessingStrategy selection in config

LLM Service:

Purpose: Unified interface for multiple LLM providers with retry logic
Examples: backend/src/services/llmService.ts (Anthropic, OpenAI, OpenRouter)
Pattern: Provider-agnostic API with processCIMDocument() returning structured CIMReview

Vector Database Abstraction:

Purpose: PostgreSQL pgvector operations via Supabase for semantic search
Examples: backend/src/services/vectorDatabaseService.ts
Pattern: Embedding + chunking → vector search via cosine similarity

File Storage Abstraction:

Purpose: Google Cloud Storage operations with signed URLs
Examples: backend/src/services/fileStorageService.ts
Pattern: Signed upload/download URLs for temporary access without IAM burden

Job Queue Pattern:

Purpose: Async processing with retry and priority handling
Examples: backend/src/services/jobQueueService.ts (EventEmitter-based)
Pattern: Priority queue with exponential backoff retry

Entry Points

API Entry Point:

Location: backend/src/index.ts
Triggers: Process startup or Firebase Functions invocation
Responsibilities:
- Initialize Express app
- Set up middleware (CORS, helmet, rate limiting, authentication)
- Register routes (/documents, /vector, /monitoring, /api/audit)
- Start job queue service
- Export Firebase Functions v2 handlers (api, processDocumentJobs)

Scheduled Job Processing:

Location: backend/src/index.ts (line 252: processDocumentJobs function export)
Triggers: Firebase Cloud Scheduler every 1 minute
Responsibilities:
- Health check database connection
- Detect stuck jobs (processing > 15 min, pending > 2 min)
- Call jobProcessorService.processJobs()
- Log metrics and errors

Frontend Entry Point:

Location: frontend/src/main.tsx
Triggers: Browser navigation
Responsibilities:
- Initialize React app with AuthProvider
- Set up Firebase client
- Render routing structure (Login → Dashboard)

Document Processing Controller:

Location: backend/src/controllers/documentController.ts
Route: POST /documents/{id}/process
Responsibilities:
- Validate user authentication
- Enqueue processing job
- Return job ID to client

Error Handling

Strategy: Multi-layer error recovery with structured logging and graceful degradation

Patterns:

Retry Logic: DocumentModel uses exponential backoff (1s → 2s → 4s) for network errors
LLM Retry: llmService retries API calls 3 times with exponential backoff
Firebase Auth Recovery: firebaseAuth.ts attempts session recovery on token verify failure
Job Queue Retry: Jobs retry up to 3 times with configurable backoff (5s → 300s max)
Structured Error Logging: All errors include correlation ID, stack trace, and context metadata
Circuit Breaker Pattern: Database health check in processDocumentJobs prevents cascading failures

Error Boundaries:

Global error handler at end of Express middleware chain (errorHandler)
Try/catch in all async functions with context-aware logging
Unhandled rejection listener at process level (line 24 of index.ts)

Cross-Cutting Concerns

Logging:

Framework: Winston (json + console in dev)
Approach: Structured logger with correlation IDs, Winston transports for error/upload logs
Location: backend/src/utils/logger.ts
Pattern: logger.info(), logger.error(), StructuredLogger for operations

Validation:

Approach: Joi schema in environment config, Zod for API request/response types
Location: backend/src/config/env.ts, backend/src/services/llmSchemas.ts
Pattern: Joi for config, Zod for runtime validation

Authentication:

Approach: Firebase ID tokens verified via verifyFirebaseToken middleware
Location: backend/src/middleware/firebaseAuth.ts
Pattern: Bearer token in Authorization header, cached in req.user

Correlation Tracking:

Approach: UUID correlation ID added to all requests, propagated through job processing
Location: backend/src/middleware/validation.ts (addCorrelationId)
Pattern: X-Correlation-ID header or generated UUID, included in all logs

Architecture analysis: 2026-02-24

11 KiB Raw Blame History