Files

Jon 3d94fcbeb5 Pre Kiro

2025-08-01 15:46:43 -04:00

17 KiB

Raw Blame History

CIM Document Processor - Application Design Documentation

Overview

The CIM Document Processor is a web application that processes Confidential Information Memorandums (CIMs) using AI to extract key business information and generate structured analysis reports. The system uses Google Document AI for text extraction and an optimized Agentic RAG (Retrieval-Augmented Generation) approach for intelligent document analysis.

Architecture Overview

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Frontend      │    │   Backend       │    │   External      │
│   (React)       │◄──►│   (Node.js)     │◄──►│   Services      │
└─────────────────┘    └─────────────────┘    └─────────────────┘
                              │                        │
                              ▼                        ▼
                       ┌─────────────────┐    ┌─────────────────┐
                       │   Database      │    │   Google Cloud  │
                       │   (Supabase)    │    │   Services      │
                       └─────────────────┘    └─────────────────┘

Core Components

1. Frontend (React + TypeScript)

Location: frontend/src/

Key Components:

App.tsx: Main application with tabbed interface
DocumentUpload: File upload with Firebase Storage integration
DocumentList: Display and manage uploaded documents
DocumentViewer: View processed documents and analysis
Analytics: Dashboard for processing statistics
UploadMonitoringDashboard: Real-time upload monitoring

Authentication: Firebase Authentication with protected routes

2. Backend (Node.js + Express + TypeScript)

Location: backend/src/

Key Services:

unifiedDocumentProcessor: Main orchestrator for document processing
optimizedAgenticRAGProcessor: Core AI processing engine
llmService: LLM interaction service (Claude AI/OpenAI)
pdfGenerationService: PDF report generation using Puppeteer
fileStorageService: Google Cloud Storage operations
uploadMonitoringService: Real-time upload tracking
agenticRAGDatabaseService: Analytics and session management
sessionService: User session management
jobQueueService: Background job processing
uploadProgressService: Upload progress tracking

Data Flow

1. Document Upload Process

User Uploads PDF
       │
       ▼
┌─────────────────┐
│ 1. Get Upload   │ ──► Generate signed URL from Google Cloud Storage
│    URL          │
└─────────┬───────┘
          │
          ▼
┌─────────────────┐
│ 2. Upload to    │ ──► Direct upload to GCS bucket
│    GCS          │
└─────────┬───────┘
          │
          ▼
┌─────────────────┐
│ 3. Confirm      │ ──► Update database, create processing job
│    Upload       │
└─────────┬───────┘

2. Document Processing Pipeline

Document Uploaded
       │
       ▼
┌─────────────────┐
│ 1. Text         │ ──► Google Document AI extracts text from PDF
│ Extraction      │    (documentAiProcessor or direct Document AI)
└─────────┬───────┘
          │
          ▼
┌─────────────────┐
│ 2. Intelligent  │ ──► Split text into semantic chunks (4000 chars)
│ Chunking        │    with 200 char overlap
└─────────┬───────┘
          │
          ▼
┌─────────────────┐
│ 3. Vector       │ ──► Generate embeddings for each chunk
│ Embedding       │    (rate-limited to 5 concurrent calls)
└─────────┬───────┘
          │
          ▼
┌─────────────────┐
│ 4. LLM Analysis │ ──► llmService → Claude AI analyzes chunks
│                 │    and generates structured CIM review data
└─────────┬───────┘
          │
          ▼
┌─────────────────┐
│ 5. PDF          │ ──► pdfGenerationService generates summary PDF
│ Generation      │    using Puppeteer
└─────────┬───────┘
          │
          ▼
┌─────────────────┐
│ 6. Database     │ ──► Store analysis data, update document status
│ Storage         │
└─────────┬───────┘
          │
          ▼
┌─────────────────┐
│ 7. Complete     │ ──► Update session, notify user, cleanup
│ Processing      │
└─────────────────┘

3. Error Handling Flow

Processing Error
       │
       ▼
┌─────────────────┐
│ Error Logging   │ ──► Log error with correlation ID
└─────────┬───────┘
          │
          ▼
┌─────────────────┐
│ Retry Logic     │ ──► Retry failed operation (up to 3 times)
└─────────┬───────┘
          │
          ▼
┌─────────────────┐
│ Graceful        │ ──► Return partial results or error message
│ Degradation     │
└─────────────────┘

Key Services Explained

1. Unified Document Processor (`unifiedDocumentProcessor.ts`)

Purpose: Main orchestrator that routes documents to the appropriate processing strategy.

Current Strategy: optimized_agentic_rag (only active strategy)

Methods:

processDocument(): Main processing entry point
processWithOptimizedAgenticRAG(): Current active processing method
getProcessingStats(): Returns processing statistics

2. Optimized Agentic RAG Processor (`optimizedAgenticRAGProcessor.ts`)

Purpose: Core AI processing engine that handles large documents efficiently.

Key Features:

Intelligent Chunking: Splits text at semantic boundaries (sections, paragraphs)
Batch Processing: Processes chunks in batches of 10 to manage memory
Rate Limiting: Limits concurrent API calls to 5
Memory Optimization: Tracks memory usage and processes efficiently

Processing Steps:

Create Intelligent Chunks: Split text into 4000-char chunks with semantic boundaries
Process Chunks in Batches: Generate embeddings and metadata for each chunk
Store Chunks Optimized: Save to vector database with batching
Generate LLM Analysis: Use llmService to analyze and create structured data

3. LLM Service (`llmService.ts`)

Purpose: Handles all LLM interactions with Claude AI and OpenAI.

Key Features:

Model Selection: Automatically selects optimal model based on task complexity
Retry Logic: Implements retry mechanism for failed API calls
Cost Tracking: Tracks token usage and API costs
Error Handling: Graceful error handling with fallback options

Methods:

processCIMDocument(): Main CIM analysis method
callLLM(): Generic LLM call method
callAnthropic(): Claude AI specific calls
callOpenAI(): OpenAI specific calls

4. PDF Generation Service (`pdfGenerationService.ts`)

Purpose: Generates PDF reports from analysis data using Puppeteer.

Key Features:

HTML to PDF: Converts HTML content to PDF using Puppeteer
Markdown Support: Converts markdown to HTML then to PDF
Custom Styling: Professional PDF formatting with CSS
CIM Review Templates: Specialized templates for CIM analysis reports

Methods:

generateCIMReviewPDF(): Generate CIM review PDF from analysis data
generatePDFFromMarkdown(): Convert markdown to PDF
generatePDFBuffer(): Generate PDF as buffer for immediate download

5. File Storage Service (`fileStorageService.ts`)

Purpose: Handles all Google Cloud Storage operations.

Key Operations:

generateSignedUploadUrl(): Creates secure upload URLs
getFile(): Downloads files from GCS
uploadFile(): Uploads files to GCS
deleteFile(): Removes files from GCS

6. Upload Monitoring Service (`uploadMonitoringService.ts`)

Purpose: Tracks upload progress and provides real-time monitoring.

Key Features:

Real-time upload tracking
Error analysis and reporting
Performance metrics
Health status monitoring

7. Session Service (`sessionService.ts`)

Purpose: Manages user sessions and authentication state.

Key Features:

Session storage and retrieval
Token management
Session cleanup
Security token blacklisting

8. Job Queue Service (`jobQueueService.ts`)

Purpose: Manages background job processing and queuing.

Key Features:

Job queuing and scheduling
Background processing
Job status tracking
Error recovery

Service Dependencies

unifiedDocumentProcessor
├── optimizedAgenticRAGProcessor
│   ├── llmService (for AI processing)
│   ├── vectorDatabaseService (for embeddings)
│   └── fileStorageService (for file operations)
├── pdfGenerationService (for PDF creation)
├── uploadMonitoringService (for tracking)
├── sessionService (for session management)
└── jobQueueService (for background processing)

Database Schema

Core Tables

1. Documents Table

CREATE TABLE documents (
  id UUID PRIMARY KEY,
  user_id TEXT NOT NULL,
  original_file_name TEXT NOT NULL,
  file_path TEXT NOT NULL,
  file_size INTEGER NOT NULL,
  status TEXT NOT NULL,
  extracted_text TEXT,
  generated_summary TEXT,
  summary_pdf_path TEXT,
  analysis_data JSONB,
  created_at TIMESTAMP DEFAULT NOW(),
  updated_at TIMESTAMP DEFAULT NOW()
);

2. Agentic RAG Sessions Table

CREATE TABLE agentic_rag_sessions (
  id UUID PRIMARY KEY,
  document_id UUID REFERENCES documents(id),
  strategy TEXT NOT NULL,
  status TEXT NOT NULL,
  total_agents INTEGER,
  completed_agents INTEGER,
  failed_agents INTEGER,
  overall_validation_score DECIMAL,
  processing_time_ms INTEGER,
  api_calls_count INTEGER,
  total_cost DECIMAL,
  created_at TIMESTAMP DEFAULT NOW(),
  completed_at TIMESTAMP
);

3. Vector Database Tables

CREATE TABLE document_chunks (
  id UUID PRIMARY KEY,
  document_id UUID REFERENCES documents(id),
  content TEXT NOT NULL,
  embedding VECTOR(1536),
  chunk_index INTEGER,
  metadata JSONB,
  created_at TIMESTAMP DEFAULT NOW()
);

API Endpoints

Active Endpoints

Document Management

POST /documents/upload-url - Get signed upload URL
POST /documents/:id/confirm-upload - Confirm upload and start processing
POST /documents/:id/process-optimized-agentic-rag - Trigger AI processing
GET /documents/:id/download - Download processed PDF
DELETE /documents/:id - Delete document

Analytics & Monitoring

GET /documents/analytics - Get processing analytics
GET /documents/:id/agentic-rag-sessions - Get processing sessions
GET /monitoring/dashboard - Get monitoring dashboard
GET /vector/stats - Get vector database statistics

Legacy Endpoints (Kept for Backward Compatibility)

POST /documents/upload - Multipart file upload (legacy)
GET /documents - List documents (basic CRUD)

Configuration

Environment Variables

Backend (backend/src/config/env.ts):

// Google Cloud
GOOGLE_CLOUD_PROJECT_ID
GOOGLE_CLOUD_STORAGE_BUCKET
GOOGLE_APPLICATION_CREDENTIALS

// Document AI
GOOGLE_DOCUMENT_AI_LOCATION
GOOGLE_DOCUMENT_AI_PROCESSOR_ID

// Database
DATABASE_URL
SUPABASE_URL
SUPABASE_ANON_KEY

// AI Services
ANTHROPIC_API_KEY
OPENAI_API_KEY

// Processing
AGENTIC_RAG_ENABLED=true
PROCESSING_STRATEGY=optimized_agentic_rag

// LLM Configuration
LLM_PROVIDER=anthropic
LLM_MODEL=claude-3-opus-20240229
LLM_MAX_TOKENS=4000
LLM_TEMPERATURE=0.1

Frontend (frontend/src/config/env.ts):

// API
VITE_API_BASE_URL
VITE_FIREBASE_API_KEY
VITE_FIREBASE_AUTH_DOMAIN

Processing Strategy Details

Current Strategy: Optimized Agentic RAG

Why This Strategy:

Handles large documents efficiently
Provides structured analysis output
Optimizes memory usage and API costs
Generates high-quality summaries

How It Works:

Text Extraction: Google Document AI extracts text from PDF
Semantic Chunking: Splits text at natural boundaries (sections, paragraphs)
Vector Embedding: Creates embeddings for each chunk
LLM Analysis: llmService calls Claude AI to analyze chunks and generate structured data
PDF Generation: pdfGenerationService creates summary PDF with analysis results

Output Format: Structured CIM Review data including:

Deal Overview
Business Description
Market Analysis
Financial Summary
Management Team
Investment Thesis
Key Questions & Next Steps

Error Handling

Frontend Error Handling

Network Errors: Automatic retry with exponential backoff
Authentication Errors: Automatic token refresh or redirect to login
Upload Errors: User-friendly error messages with retry options
Processing Errors: Real-time error display with retry functionality

Backend Error Handling

Validation Errors: Input validation with detailed error messages
Processing Errors: Graceful degradation with error logging
Storage Errors: Retry logic for transient failures
Database Errors: Connection pooling and retry mechanisms
LLM API Errors: Retry logic with exponential backoff
PDF Generation Errors: Fallback to text-only output

Error Recovery Mechanisms

LLM API Failures: Up to 3 retry attempts with different models
Processing Timeouts: Graceful timeout handling with partial results
Memory Issues: Automatic garbage collection and memory cleanup
File Storage Errors: Retry with exponential backoff

Monitoring & Analytics

Real-time Monitoring

Upload progress tracking
Processing status updates
Error rate monitoring
Performance metrics
API usage tracking
Cost monitoring

Analytics Dashboard

Processing success rates
Average processing times
API usage statistics
Cost tracking
User activity metrics
Error analysis reports

Security

Authentication

Firebase Authentication
JWT token validation
Protected API endpoints
User-specific data isolation
Session management with secure token handling

File Security

Signed URLs for secure uploads
File type validation (PDF only)
File size limits (50MB max)
User-specific file storage paths
Secure file deletion

API Security

Rate limiting (1000 requests per 15 minutes)
CORS configuration
Input validation
SQL injection prevention
Request correlation IDs for tracking

Performance Optimization

Memory Management

Batch processing to limit memory usage
Garbage collection optimization
Connection pooling for database
Efficient chunking to minimize memory footprint

API Optimization

Rate limiting to prevent API quota exhaustion
Caching for frequently accessed data
Efficient chunking to minimize API calls
Model selection based on task complexity

Processing Optimization

Concurrent processing with limits
Intelligent chunking for optimal processing
Background job processing
Progress tracking for user feedback

Deployment

Backend Deployment

Firebase Functions: Serverless deployment
Google Cloud Run: Containerized deployment
Docker: Container support

Frontend Deployment

Firebase Hosting: Static hosting
Vite: Build tool
TypeScript: Type safety

Development Workflow

Local Development

Backend: npm run dev (runs on port 5001)
Frontend: npm run dev (runs on port 5173)
Database: Supabase local development
Storage: Google Cloud Storage (development bucket)

Testing

Unit Tests: Jest for backend, Vitest for frontend
Integration Tests: End-to-end testing
API Tests: Supertest for backend endpoints

Troubleshooting

Common Issues

Upload Failures: Check GCS permissions and bucket configuration
Processing Timeouts: Increase timeout limits for large documents
Memory Issues: Monitor memory usage and adjust batch sizes
API Quotas: Check API usage and implement rate limiting
PDF Generation Failures: Check Puppeteer installation and memory
LLM API Errors: Verify API keys and check rate limits

Debug Tools

Real-time logging with correlation IDs
Upload monitoring dashboard
Processing session details
Error analysis reports
Performance metrics dashboard

This documentation provides a comprehensive overview of the CIM Document Processor architecture, helping junior programmers understand the system's design, data flow, and key components.

17 KiB Raw Blame History

CIM Document Processor - Application Design Documentation

Overview

Architecture Overview

Core Components

1. Frontend (React + TypeScript)

2. Backend (Node.js + Express + TypeScript)

Data Flow

1. Document Upload Process

2. Document Processing Pipeline

3. Error Handling Flow

Key Services Explained

1. Unified Document Processor (unifiedDocumentProcessor.ts)

2. Optimized Agentic RAG Processor (optimizedAgenticRAGProcessor.ts)

3. LLM Service (llmService.ts)

4. PDF Generation Service (pdfGenerationService.ts)

5. File Storage Service (fileStorageService.ts)

6. Upload Monitoring Service (uploadMonitoringService.ts)

7. Session Service (sessionService.ts)

8. Job Queue Service (jobQueueService.ts)

Service Dependencies

Database Schema

Core Tables

1. Documents Table

2. Agentic RAG Sessions Table

3. Vector Database Tables

API Endpoints

Active Endpoints

Document Management

Analytics & Monitoring

Legacy Endpoints (Kept for Backward Compatibility)

Configuration

Environment Variables

Processing Strategy Details

Current Strategy: Optimized Agentic RAG

Error Handling

Frontend Error Handling

Backend Error Handling

Error Recovery Mechanisms

Monitoring & Analytics

Real-time Monitoring

Analytics Dashboard

Security

Authentication

File Security

API Security

Performance Optimization

Memory Management

API Optimization

Processing Optimization

Deployment

Backend Deployment

Frontend Deployment

Development Workflow

Local Development

Testing

Troubleshooting

Common Issues

Debug Tools

17 KiB

Raw Blame History

1. Unified Document Processor (`unifiedDocumentProcessor.ts`)

2. Optimized Agentic RAG Processor (`optimizedAgenticRAGProcessor.ts`)

3. LLM Service (`llmService.ts`)

4. PDF Generation Service (`pdfGenerationService.ts`)

5. File Storage Service (`fileStorageService.ts`)

6. Upload Monitoring Service (`uploadMonitoringService.ts`)

7. Session Service (`sessionService.ts`)

8. Job Queue Service (`jobQueueService.ts`)