Files
cim_summary/APP_DESIGN_DOCUMENTATION.md
2025-08-01 15:46:43 -04:00

17 KiB

CIM Document Processor - Application Design Documentation

Overview

The CIM Document Processor is a web application that processes Confidential Information Memorandums (CIMs) using AI to extract key business information and generate structured analysis reports. The system uses Google Document AI for text extraction and an optimized Agentic RAG (Retrieval-Augmented Generation) approach for intelligent document analysis.

Architecture Overview

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Frontend      │    │   Backend       │    │   External      │
│   (React)       │◄──►│   (Node.js)     │◄──►│   Services      │
└─────────────────┘    └─────────────────┘    └─────────────────┘
                              │                        │
                              ▼                        ▼
                       ┌─────────────────┐    ┌─────────────────┐
                       │   Database      │    │   Google Cloud  │
                       │   (Supabase)    │    │   Services      │
                       └─────────────────┘    └─────────────────┘

Core Components

1. Frontend (React + TypeScript)

Location: frontend/src/

Key Components:

  • App.tsx: Main application with tabbed interface
  • DocumentUpload: File upload with Firebase Storage integration
  • DocumentList: Display and manage uploaded documents
  • DocumentViewer: View processed documents and analysis
  • Analytics: Dashboard for processing statistics
  • UploadMonitoringDashboard: Real-time upload monitoring

Authentication: Firebase Authentication with protected routes

2. Backend (Node.js + Express + TypeScript)

Location: backend/src/

Key Services:

  • unifiedDocumentProcessor: Main orchestrator for document processing
  • optimizedAgenticRAGProcessor: Core AI processing engine
  • llmService: LLM interaction service (Claude AI/OpenAI)
  • pdfGenerationService: PDF report generation using Puppeteer
  • fileStorageService: Google Cloud Storage operations
  • uploadMonitoringService: Real-time upload tracking
  • agenticRAGDatabaseService: Analytics and session management
  • sessionService: User session management
  • jobQueueService: Background job processing
  • uploadProgressService: Upload progress tracking

Data Flow

1. Document Upload Process

User Uploads PDF
       │
       ▼
┌─────────────────┐
│ 1. Get Upload   │ ──► Generate signed URL from Google Cloud Storage
│    URL          │
└─────────┬───────┘
          │
          ▼
┌─────────────────┐
│ 2. Upload to    │ ──► Direct upload to GCS bucket
│    GCS          │
└─────────┬───────┘
          │
          ▼
┌─────────────────┐
│ 3. Confirm      │ ──► Update database, create processing job
│    Upload       │
└─────────┬───────┘

2. Document Processing Pipeline

Document Uploaded
       │
       ▼
┌─────────────────┐
│ 1. Text         │ ──► Google Document AI extracts text from PDF
│ Extraction      │    (documentAiProcessor or direct Document AI)
└─────────┬───────┘
          │
          ▼
┌─────────────────┐
│ 2. Intelligent  │ ──► Split text into semantic chunks (4000 chars)
│ Chunking        │    with 200 char overlap
└─────────┬───────┘
          │
          ▼
┌─────────────────┐
│ 3. Vector       │ ──► Generate embeddings for each chunk
│ Embedding       │    (rate-limited to 5 concurrent calls)
└─────────┬───────┘
          │
          ▼
┌─────────────────┐
│ 4. LLM Analysis │ ──► llmService → Claude AI analyzes chunks
│                 │    and generates structured CIM review data
└─────────┬───────┘
          │
          ▼
┌─────────────────┐
│ 5. PDF          │ ──► pdfGenerationService generates summary PDF
│ Generation      │    using Puppeteer
└─────────┬───────┘
          │
          ▼
┌─────────────────┐
│ 6. Database     │ ──► Store analysis data, update document status
│ Storage         │
└─────────┬───────┘
          │
          ▼
┌─────────────────┐
│ 7. Complete     │ ──► Update session, notify user, cleanup
│ Processing      │
└─────────────────┘

3. Error Handling Flow

Processing Error
       │
       ▼
┌─────────────────┐
│ Error Logging   │ ──► Log error with correlation ID
└─────────┬───────┘
          │
          ▼
┌─────────────────┐
│ Retry Logic     │ ──► Retry failed operation (up to 3 times)
└─────────┬───────┘
          │
          ▼
┌─────────────────┐
│ Graceful        │ ──► Return partial results or error message
│ Degradation     │
└─────────────────┘

Key Services Explained

1. Unified Document Processor (unifiedDocumentProcessor.ts)

Purpose: Main orchestrator that routes documents to the appropriate processing strategy.

Current Strategy: optimized_agentic_rag (only active strategy)

Methods:

  • processDocument(): Main processing entry point
  • processWithOptimizedAgenticRAG(): Current active processing method
  • getProcessingStats(): Returns processing statistics

2. Optimized Agentic RAG Processor (optimizedAgenticRAGProcessor.ts)

Purpose: Core AI processing engine that handles large documents efficiently.

Key Features:

  • Intelligent Chunking: Splits text at semantic boundaries (sections, paragraphs)
  • Batch Processing: Processes chunks in batches of 10 to manage memory
  • Rate Limiting: Limits concurrent API calls to 5
  • Memory Optimization: Tracks memory usage and processes efficiently

Processing Steps:

  1. Create Intelligent Chunks: Split text into 4000-char chunks with semantic boundaries
  2. Process Chunks in Batches: Generate embeddings and metadata for each chunk
  3. Store Chunks Optimized: Save to vector database with batching
  4. Generate LLM Analysis: Use llmService to analyze and create structured data

3. LLM Service (llmService.ts)

Purpose: Handles all LLM interactions with Claude AI and OpenAI.

Key Features:

  • Model Selection: Automatically selects optimal model based on task complexity
  • Retry Logic: Implements retry mechanism for failed API calls
  • Cost Tracking: Tracks token usage and API costs
  • Error Handling: Graceful error handling with fallback options

Methods:

  • processCIMDocument(): Main CIM analysis method
  • callLLM(): Generic LLM call method
  • callAnthropic(): Claude AI specific calls
  • callOpenAI(): OpenAI specific calls

4. PDF Generation Service (pdfGenerationService.ts)

Purpose: Generates PDF reports from analysis data using Puppeteer.

Key Features:

  • HTML to PDF: Converts HTML content to PDF using Puppeteer
  • Markdown Support: Converts markdown to HTML then to PDF
  • Custom Styling: Professional PDF formatting with CSS
  • CIM Review Templates: Specialized templates for CIM analysis reports

Methods:

  • generateCIMReviewPDF(): Generate CIM review PDF from analysis data
  • generatePDFFromMarkdown(): Convert markdown to PDF
  • generatePDFBuffer(): Generate PDF as buffer for immediate download

5. File Storage Service (fileStorageService.ts)

Purpose: Handles all Google Cloud Storage operations.

Key Operations:

  • generateSignedUploadUrl(): Creates secure upload URLs
  • getFile(): Downloads files from GCS
  • uploadFile(): Uploads files to GCS
  • deleteFile(): Removes files from GCS

6. Upload Monitoring Service (uploadMonitoringService.ts)

Purpose: Tracks upload progress and provides real-time monitoring.

Key Features:

  • Real-time upload tracking
  • Error analysis and reporting
  • Performance metrics
  • Health status monitoring

7. Session Service (sessionService.ts)

Purpose: Manages user sessions and authentication state.

Key Features:

  • Session storage and retrieval
  • Token management
  • Session cleanup
  • Security token blacklisting

8. Job Queue Service (jobQueueService.ts)

Purpose: Manages background job processing and queuing.

Key Features:

  • Job queuing and scheduling
  • Background processing
  • Job status tracking
  • Error recovery

Service Dependencies

unifiedDocumentProcessor
├── optimizedAgenticRAGProcessor
│   ├── llmService (for AI processing)
│   ├── vectorDatabaseService (for embeddings)
│   └── fileStorageService (for file operations)
├── pdfGenerationService (for PDF creation)
├── uploadMonitoringService (for tracking)
├── sessionService (for session management)
└── jobQueueService (for background processing)

Database Schema

Core Tables

1. Documents Table

CREATE TABLE documents (
  id UUID PRIMARY KEY,
  user_id TEXT NOT NULL,
  original_file_name TEXT NOT NULL,
  file_path TEXT NOT NULL,
  file_size INTEGER NOT NULL,
  status TEXT NOT NULL,
  extracted_text TEXT,
  generated_summary TEXT,
  summary_pdf_path TEXT,
  analysis_data JSONB,
  created_at TIMESTAMP DEFAULT NOW(),
  updated_at TIMESTAMP DEFAULT NOW()
);

2. Agentic RAG Sessions Table

CREATE TABLE agentic_rag_sessions (
  id UUID PRIMARY KEY,
  document_id UUID REFERENCES documents(id),
  strategy TEXT NOT NULL,
  status TEXT NOT NULL,
  total_agents INTEGER,
  completed_agents INTEGER,
  failed_agents INTEGER,
  overall_validation_score DECIMAL,
  processing_time_ms INTEGER,
  api_calls_count INTEGER,
  total_cost DECIMAL,
  created_at TIMESTAMP DEFAULT NOW(),
  completed_at TIMESTAMP
);

3. Vector Database Tables

CREATE TABLE document_chunks (
  id UUID PRIMARY KEY,
  document_id UUID REFERENCES documents(id),
  content TEXT NOT NULL,
  embedding VECTOR(1536),
  chunk_index INTEGER,
  metadata JSONB,
  created_at TIMESTAMP DEFAULT NOW()
);

API Endpoints

Active Endpoints

Document Management

  • POST /documents/upload-url - Get signed upload URL
  • POST /documents/:id/confirm-upload - Confirm upload and start processing
  • POST /documents/:id/process-optimized-agentic-rag - Trigger AI processing
  • GET /documents/:id/download - Download processed PDF
  • DELETE /documents/:id - Delete document

Analytics & Monitoring

  • GET /documents/analytics - Get processing analytics
  • GET /documents/:id/agentic-rag-sessions - Get processing sessions
  • GET /monitoring/dashboard - Get monitoring dashboard
  • GET /vector/stats - Get vector database statistics

Legacy Endpoints (Kept for Backward Compatibility)

  • POST /documents/upload - Multipart file upload (legacy)
  • GET /documents - List documents (basic CRUD)

Configuration

Environment Variables

Backend (backend/src/config/env.ts):

// Google Cloud
GOOGLE_CLOUD_PROJECT_ID
GOOGLE_CLOUD_STORAGE_BUCKET
GOOGLE_APPLICATION_CREDENTIALS

// Document AI
GOOGLE_DOCUMENT_AI_LOCATION
GOOGLE_DOCUMENT_AI_PROCESSOR_ID

// Database
DATABASE_URL
SUPABASE_URL
SUPABASE_ANON_KEY

// AI Services
ANTHROPIC_API_KEY
OPENAI_API_KEY

// Processing
AGENTIC_RAG_ENABLED=true
PROCESSING_STRATEGY=optimized_agentic_rag

// LLM Configuration
LLM_PROVIDER=anthropic
LLM_MODEL=claude-3-opus-20240229
LLM_MAX_TOKENS=4000
LLM_TEMPERATURE=0.1

Frontend (frontend/src/config/env.ts):

// API
VITE_API_BASE_URL
VITE_FIREBASE_API_KEY
VITE_FIREBASE_AUTH_DOMAIN

Processing Strategy Details

Current Strategy: Optimized Agentic RAG

Why This Strategy:

  • Handles large documents efficiently
  • Provides structured analysis output
  • Optimizes memory usage and API costs
  • Generates high-quality summaries

How It Works:

  1. Text Extraction: Google Document AI extracts text from PDF
  2. Semantic Chunking: Splits text at natural boundaries (sections, paragraphs)
  3. Vector Embedding: Creates embeddings for each chunk
  4. LLM Analysis: llmService calls Claude AI to analyze chunks and generate structured data
  5. PDF Generation: pdfGenerationService creates summary PDF with analysis results

Output Format: Structured CIM Review data including:

  • Deal Overview
  • Business Description
  • Market Analysis
  • Financial Summary
  • Management Team
  • Investment Thesis
  • Key Questions & Next Steps

Error Handling

Frontend Error Handling

  • Network Errors: Automatic retry with exponential backoff
  • Authentication Errors: Automatic token refresh or redirect to login
  • Upload Errors: User-friendly error messages with retry options
  • Processing Errors: Real-time error display with retry functionality

Backend Error Handling

  • Validation Errors: Input validation with detailed error messages
  • Processing Errors: Graceful degradation with error logging
  • Storage Errors: Retry logic for transient failures
  • Database Errors: Connection pooling and retry mechanisms
  • LLM API Errors: Retry logic with exponential backoff
  • PDF Generation Errors: Fallback to text-only output

Error Recovery Mechanisms

  • LLM API Failures: Up to 3 retry attempts with different models
  • Processing Timeouts: Graceful timeout handling with partial results
  • Memory Issues: Automatic garbage collection and memory cleanup
  • File Storage Errors: Retry with exponential backoff

Monitoring & Analytics

Real-time Monitoring

  • Upload progress tracking
  • Processing status updates
  • Error rate monitoring
  • Performance metrics
  • API usage tracking
  • Cost monitoring

Analytics Dashboard

  • Processing success rates
  • Average processing times
  • API usage statistics
  • Cost tracking
  • User activity metrics
  • Error analysis reports

Security

Authentication

  • Firebase Authentication
  • JWT token validation
  • Protected API endpoints
  • User-specific data isolation
  • Session management with secure token handling

File Security

  • Signed URLs for secure uploads
  • File type validation (PDF only)
  • File size limits (50MB max)
  • User-specific file storage paths
  • Secure file deletion

API Security

  • Rate limiting (1000 requests per 15 minutes)
  • CORS configuration
  • Input validation
  • SQL injection prevention
  • Request correlation IDs for tracking

Performance Optimization

Memory Management

  • Batch processing to limit memory usage
  • Garbage collection optimization
  • Connection pooling for database
  • Efficient chunking to minimize memory footprint

API Optimization

  • Rate limiting to prevent API quota exhaustion
  • Caching for frequently accessed data
  • Efficient chunking to minimize API calls
  • Model selection based on task complexity

Processing Optimization

  • Concurrent processing with limits
  • Intelligent chunking for optimal processing
  • Background job processing
  • Progress tracking for user feedback

Deployment

Backend Deployment

  • Firebase Functions: Serverless deployment
  • Google Cloud Run: Containerized deployment
  • Docker: Container support

Frontend Deployment

  • Firebase Hosting: Static hosting
  • Vite: Build tool
  • TypeScript: Type safety

Development Workflow

Local Development

  1. Backend: npm run dev (runs on port 5001)
  2. Frontend: npm run dev (runs on port 5173)
  3. Database: Supabase local development
  4. Storage: Google Cloud Storage (development bucket)

Testing

  • Unit Tests: Jest for backend, Vitest for frontend
  • Integration Tests: End-to-end testing
  • API Tests: Supertest for backend endpoints

Troubleshooting

Common Issues

  1. Upload Failures: Check GCS permissions and bucket configuration
  2. Processing Timeouts: Increase timeout limits for large documents
  3. Memory Issues: Monitor memory usage and adjust batch sizes
  4. API Quotas: Check API usage and implement rate limiting
  5. PDF Generation Failures: Check Puppeteer installation and memory
  6. LLM API Errors: Verify API keys and check rate limits

Debug Tools

  • Real-time logging with correlation IDs
  • Upload monitoring dashboard
  • Processing session details
  • Error analysis reports
  • Performance metrics dashboard

This documentation provides a comprehensive overview of the CIM Document Processor architecture, helping junior programmers understand the system's design, data flow, and key components.