11 KiB
Architecture
Analysis Date: 2026-02-24
Pattern Overview
Overall: Full-stack distributed system combining Express.js backend with React frontend, implementing a multi-stage document processing pipeline with queued background jobs and real-time monitoring.
Key Characteristics:
- Server-rendered PDF generation with single-pass LLM processing
- Asynchronous job queue for background document processing (max 3 concurrent)
- Firebase authentication with Supabase PostgreSQL + pgvector for embeddings
- Multi-language LLM support (Anthropic, OpenAI, OpenRouter)
- Structured schema extraction using Zod and LLM-driven analysis
- Google Document AI for OCR and text extraction
- Real-time upload progress tracking via SSE/polling
- Correlation ID tracking throughout distributed pipeline
Layers
API Layer (Express + TypeScript):
- Purpose: HTTP request routing, authentication, and response handling
- Location:
backend/src/index.ts,backend/src/routes/,backend/src/controllers/ - Contains: Route definitions, request validation, error handling
- Depends on: Middleware (auth, validation), Services
- Used by: Frontend and external clients
Authentication Layer:
- Purpose: Firebase ID token verification and user identity validation
- Location:
backend/src/middleware/firebaseAuth.ts,backend/src/config/firebase.ts - Contains: Token verification, service account initialization, session recovery
- Depends on: Firebase Admin SDK, configuration
- Used by: All protected routes via
verifyFirebaseTokenmiddleware
Controller Layer:
- Purpose: Request handling, input validation, service orchestration
- Location:
backend/src/controllers/documentController.ts,backend/src/controllers/authController.ts - Contains:
getUploadUrl(),processDocument(),getDocumentStatus()handlers - Depends on: Models, Services, Middleware
- Used by: Routes
Service Layer:
- Purpose: Business logic, external API integration, document processing orchestration
- Location:
backend/src/services/ - Contains:
unifiedDocumentProcessor.ts- Main orchestrator, strategy selectionsinglePassProcessor.ts- 2-LLM-call extraction (pass 1 + quality check)documentAiProcessor.ts- Google Document AI text extractionllmService.ts- LLM API calls with retry logic (3 attempts, exponential backoff)jobQueueService.ts- Background job processing (EventEmitter-based)fileStorageService.ts- Google Cloud Storage signed URLs and uploadsvectorDatabaseService.ts- Supabase vector embeddings and searchpdfGenerationService.ts- Puppeteer-based PDF renderingcsvExportService.ts- Financial data export
- Depends on: Models, Config, Utilities
- Used by: Controllers, Job Queue
Model Layer (Data Access):
- Purpose: Database interactions, query execution, schema validation
- Location:
backend/src/models/ - Contains:
DocumentModel.ts,ProcessingJobModel.ts,UserModel.ts,VectorDatabaseModel.ts - Depends on: Supabase client, configuration
- Used by: Services, Controllers
Job Queue Layer:
- Purpose: Asynchronous background processing with priority and retry handling
- Location:
backend/src/services/jobQueueService.ts,backend/src/services/jobProcessorService.ts - Contains: In-memory queue, worker pool (max 3 concurrent), Firebase scheduled function trigger
- Depends on: Services (document processor), Models
- Used by: Controllers (to enqueue work), Scheduled functions (to trigger processing)
Frontend Layer (React + TypeScript):
- Purpose: User interface for document upload, processing monitoring, and review
- Location:
frontend/src/ - Contains: Components (Upload, List, Viewer, Analytics), Services, Contexts
- Depends on: Backend API, Firebase Auth, Axios
- Used by: Web browsers
Data Flow
Document Upload & Processing Flow:
-
Upload Initiation (Frontend)
- User selects PDF file via
DocumentUploadcomponent - Calls
documentService.getUploadUrl()→ Backend/documents/upload-urlendpoint - Backend creates document record (status: 'uploading') and generates signed GCS URL
- User selects PDF file via
-
File Upload (Frontend → GCS)
- Frontend uploads file directly to Google Cloud Storage via signed URL
- Frontend polls
documentService.getDocumentStatus()for upload completion UploadMonitoringDashboarddisplays real-time progress
-
Processing Trigger (Frontend → Backend)
- Frontend calls
POST /documents/{id}/processonce upload complete - Controller creates processing job and enqueues to
jobQueueService - Controller immediately returns job ID
- Frontend calls
-
Background Job Execution (Job Queue)
- Scheduled Firebase function (
processDocumentJobs) runs every 1 minute - Calls
jobProcessorService.processJobs()to dequeue and execute - For each queued document:
- Fetch file from GCS
- Update status to 'extracting_text'
- Call
unifiedDocumentProcessor.processDocument()
- Scheduled Firebase function (
-
Document Processing (Single-Pass Strategy)
-
Pass 1 - LLM Extraction:
documentAiProcessor.extractText()(if needed) - Google Document AI OCRllmService.processCIMDocument()- Claude/OpenAI structured extraction- Produces
CIMReviewobject with financial, market, management data - Updates document status to 'processing_llm'
-
Pass 2 - Quality Check:
llmService.validateCIMReview()- Verify completeness and accuracy- Updates status to 'quality_validation'
-
PDF Generation:
pdfGenerationService.generatePDF()- Puppeteer renders HTML template- Uploads PDF to GCS
- Updates status to 'generating_pdf'
-
Vector Indexing (Background):
vectorDatabaseService.createDocumentEmbedding()- Generate 3072-dim embeddings- Chunk document semantically, store in Supabase with vector index
- Status moves to 'vector_indexing' then 'completed'
-
-
Result Delivery (Backend → Frontend)
- Frontend polls
GET /documents/{id}to check completion - When status = 'completed', fetches summary and analysis data
DocumentViewerdisplays results, allows regeneration with feedback
- Frontend polls
State Management:
- Backend: Document status progresses through
uploading → extracting_text → processing_llm → generating_pdf → vector_indexing → completedorfailedat any step - Frontend: AuthContext manages user/token, component state tracks selected document and loading states
- Job Queue: In-memory queue with EventEmitter for state transitions
Key Abstractions
Unified Processor:
- Purpose: Strategy pattern for document processing (single-pass vs. agentic RAG vs. simple)
- Examples:
singlePassProcessor,simpleDocumentProcessor,optimizedAgenticRAGProcessor - Pattern: Pluggable strategies via
ProcessingStrategyselection in config
LLM Service:
- Purpose: Unified interface for multiple LLM providers with retry logic
- Examples:
backend/src/services/llmService.ts(Anthropic, OpenAI, OpenRouter) - Pattern: Provider-agnostic API with
processCIMDocument()returning structuredCIMReview
Vector Database Abstraction:
- Purpose: PostgreSQL pgvector operations via Supabase for semantic search
- Examples:
backend/src/services/vectorDatabaseService.ts - Pattern: Embedding + chunking → vector search via cosine similarity
File Storage Abstraction:
- Purpose: Google Cloud Storage operations with signed URLs
- Examples:
backend/src/services/fileStorageService.ts - Pattern: Signed upload/download URLs for temporary access without IAM burden
Job Queue Pattern:
- Purpose: Async processing with retry and priority handling
- Examples:
backend/src/services/jobQueueService.ts(EventEmitter-based) - Pattern: Priority queue with exponential backoff retry
Entry Points
API Entry Point:
- Location:
backend/src/index.ts - Triggers: Process startup or Firebase Functions invocation
- Responsibilities:
- Initialize Express app
- Set up middleware (CORS, helmet, rate limiting, authentication)
- Register routes (
/documents,/vector,/monitoring,/api/audit) - Start job queue service
- Export Firebase Functions v2 handlers (
api,processDocumentJobs)
Scheduled Job Processing:
- Location:
backend/src/index.ts(line 252:processDocumentJobsfunction export) - Triggers: Firebase Cloud Scheduler every 1 minute
- Responsibilities:
- Health check database connection
- Detect stuck jobs (processing > 15 min, pending > 2 min)
- Call
jobProcessorService.processJobs() - Log metrics and errors
Frontend Entry Point:
- Location:
frontend/src/main.tsx - Triggers: Browser navigation
- Responsibilities:
- Initialize React app with AuthProvider
- Set up Firebase client
- Render routing structure (Login → Dashboard)
Document Processing Controller:
- Location:
backend/src/controllers/documentController.ts - Route:
POST /documents/{id}/process - Responsibilities:
- Validate user authentication
- Enqueue processing job
- Return job ID to client
Error Handling
Strategy: Multi-layer error recovery with structured logging and graceful degradation
Patterns:
- Retry Logic: DocumentModel uses exponential backoff (1s → 2s → 4s) for network errors
- LLM Retry:
llmServiceretries API calls 3 times with exponential backoff - Firebase Auth Recovery:
firebaseAuth.tsattempts session recovery on token verify failure - Job Queue Retry: Jobs retry up to 3 times with configurable backoff (5s → 300s max)
- Structured Error Logging: All errors include correlation ID, stack trace, and context metadata
- Circuit Breaker Pattern: Database health check in
processDocumentJobsprevents cascading failures
Error Boundaries:
- Global error handler at end of Express middleware chain (
errorHandler) - Try/catch in all async functions with context-aware logging
- Unhandled rejection listener at process level (line 24 of
index.ts)
Cross-Cutting Concerns
Logging:
- Framework: Winston (json + console in dev)
- Approach: Structured logger with correlation IDs, Winston transports for error/upload logs
- Location:
backend/src/utils/logger.ts - Pattern:
logger.info(),logger.error(),StructuredLoggerfor operations
Validation:
- Approach: Joi schema in environment config, Zod for API request/response types
- Location:
backend/src/config/env.ts,backend/src/services/llmSchemas.ts - Pattern: Joi for config, Zod for runtime validation
Authentication:
- Approach: Firebase ID tokens verified via
verifyFirebaseTokenmiddleware - Location:
backend/src/middleware/firebaseAuth.ts - Pattern: Bearer token in Authorization header, cached in req.user
Correlation Tracking:
- Approach: UUID correlation ID added to all requests, propagated through job processing
- Location:
backend/src/middleware/validation.ts(addCorrelationId) - Pattern: X-Correlation-ID header or generated UUID, included in all logs
Architecture analysis: 2026-02-24