Pre-cleanup commit: Current state before service layer consolidation

2025-08-01 14:57:56 -04:00
parent 95c92946de
commit f453efb0f8
21 changed files with 2560 additions and 363 deletions
--- a/APP_DESIGN_DOCUMENTATION.md
+++ b/APP_DESIGN_DOCUMENTATION.md
@@ -0,0 +1,533 @@
+# CIM Document Processor - Application Design Documentation
+
+## Overview
+
+The CIM Document Processor is a web application that processes Confidential Information Memorandums (CIMs) using AI to extract key business information and generate structured analysis reports. The system uses Google Document AI for text extraction and an optimized Agentic RAG (Retrieval-Augmented Generation) approach for intelligent document analysis.
+
+## Architecture Overview
+
+```
+┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
+│   Frontend      │    │   Backend       │    │   External      │
+│   (React)       │◄──►│   (Node.js)     │◄──►│   Services      │
+└─────────────────┘    └─────────────────┘    └─────────────────┘
+                              │                        │
+                              ▼                        ▼
+                       ┌─────────────────┐    ┌─────────────────┐
+                       │   Database      │    │   Google Cloud  │
+                       │   (Supabase)    │    │   Services      │
+                       └─────────────────┘    └─────────────────┘
+```
+
+## Core Components
+
+### 1. Frontend (React + TypeScript)
+
+**Location**: `frontend/src/`
+
+**Key Components**:
+- **App.tsx**: Main application with tabbed interface
+- **DocumentUpload**: File upload with Firebase Storage integration
+- **DocumentList**: Display and manage uploaded documents
+- **DocumentViewer**: View processed documents and analysis
+- **Analytics**: Dashboard for processing statistics
+- **UploadMonitoringDashboard**: Real-time upload monitoring
+
+**Authentication**: Firebase Authentication with protected routes
+
+### 2. Backend (Node.js + Express + TypeScript)
+
+**Location**: `backend/src/`
+
+**Key Services**:
+- **unifiedDocumentProcessor**: Main orchestrator for document processing
+- **optimizedAgenticRAGProcessor**: Core AI processing engine
+- **llmService**: LLM interaction service (Claude AI/OpenAI)
+- **pdfGenerationService**: PDF report generation using Puppeteer
+- **fileStorageService**: Google Cloud Storage operations
+- **uploadMonitoringService**: Real-time upload tracking
+- **agenticRAGDatabaseService**: Analytics and session management
+- **sessionService**: User session management
+- **jobQueueService**: Background job processing
+- **uploadProgressService**: Upload progress tracking
+
+## Data Flow
+
+### 1. Document Upload Process
+
+```
+User Uploads PDF
+       │
+       ▼
+┌─────────────────┐
+│ 1. Get Upload   │ ──► Generate signed URL from Google Cloud Storage
+│    URL          │
+└─────────┬───────┘
+          │
+          ▼
+┌─────────────────┐
+│ 2. Upload to    │ ──► Direct upload to GCS bucket
+│    GCS          │
+└─────────┬───────┘
+          │
+          ▼
+┌─────────────────┐
+│ 3. Confirm      │ ──► Update database, create processing job
+│    Upload       │
+└─────────┬───────┘
+```
+
+### 2. Document Processing Pipeline
+
+```
+Document Uploaded
+       │
+       ▼
+┌─────────────────┐
+│ 1. Text         │ ──► Google Document AI extracts text from PDF
+│ Extraction      │    (documentAiGenkitProcessor or direct Document AI)
+└─────────┬───────┘
+          │
+          ▼
+┌─────────────────┐
+│ 2. Intelligent  │ ──► Split text into semantic chunks (4000 chars)
+│ Chunking        │    with 200 char overlap
+└─────────┬───────┘
+          │
+          ▼
+┌─────────────────┐
+│ 3. Vector       │ ──► Generate embeddings for each chunk
+│ Embedding       │    (rate-limited to 5 concurrent calls)
+└─────────┬───────┘
+          │
+          ▼
+┌─────────────────┐
+│ 4. LLM Analysis │ ──► llmService → Claude AI analyzes chunks
+│                 │    and generates structured CIM review data
+└─────────┬───────┘
+          │
+          ▼
+┌─────────────────┐
+│ 5. PDF          │ ──► pdfGenerationService generates summary PDF
+│ Generation      │    using Puppeteer
+└─────────┬───────┘
+          │
+          ▼
+┌─────────────────┐
+│ 6. Database     │ ──► Store analysis data, update document status
+│ Storage         │
+└─────────┬───────┘
+          │
+          ▼
+┌─────────────────┐
+│ 7. Complete     │ ──► Update session, notify user, cleanup
+│ Processing      │
+└─────────────────┘
+```
+
+### 3. Error Handling Flow
+
+```
+Processing Error
+       │
+       ▼
+┌─────────────────┐
+│ Error Logging   │ ──► Log error with correlation ID
+└─────────┬───────┘
+          │
+          ▼
+┌─────────────────┐
+│ Retry Logic     │ ──► Retry failed operation (up to 3 times)
+└─────────┬───────┘
+          │
+          ▼
+┌─────────────────┐
+│ Graceful        │ ──► Return partial results or error message
+│ Degradation     │
+└─────────────────┘
+```
+
+## Key Services Explained
+
+### 1. Unified Document Processor (`unifiedDocumentProcessor.ts`)
+
+**Purpose**: Main orchestrator that routes documents to the appropriate processing strategy.
+
+**Current Strategy**: `optimized_agentic_rag` (only active strategy)
+
+**Methods**:
+- `processDocument()`: Main processing entry point
+- `processWithOptimizedAgenticRAG()`: Current active processing method
+- `getProcessingStats()`: Returns processing statistics
+
+### 2. Optimized Agentic RAG Processor (`optimizedAgenticRAGProcessor.ts`)
+
+**Purpose**: Core AI processing engine that handles large documents efficiently.
+
+**Key Features**:
+- **Intelligent Chunking**: Splits text at semantic boundaries (sections, paragraphs)
+- **Batch Processing**: Processes chunks in batches of 10 to manage memory
+- **Rate Limiting**: Limits concurrent API calls to 5
+- **Memory Optimization**: Tracks memory usage and processes efficiently
+
+**Processing Steps**:
+1. **Create Intelligent Chunks**: Split text into 4000-char chunks with semantic boundaries
+2. **Process Chunks in Batches**: Generate embeddings and metadata for each chunk
+3. **Store Chunks Optimized**: Save to vector database with batching
+4. **Generate LLM Analysis**: Use llmService to analyze and create structured data
+
+### 3. LLM Service (`llmService.ts`)
+
+**Purpose**: Handles all LLM interactions with Claude AI and OpenAI.
+
+**Key Features**:
+- **Model Selection**: Automatically selects optimal model based on task complexity
+- **Retry Logic**: Implements retry mechanism for failed API calls
+- **Cost Tracking**: Tracks token usage and API costs
+- **Error Handling**: Graceful error handling with fallback options
+
+**Methods**:
+- `processCIMDocument()`: Main CIM analysis method
+- `callLLM()`: Generic LLM call method
+- `callAnthropic()`: Claude AI specific calls
+- `callOpenAI()`: OpenAI specific calls
+
+### 4. PDF Generation Service (`pdfGenerationService.ts`)
+
+**Purpose**: Generates PDF reports from analysis data using Puppeteer.
+
+**Key Features**:
+- **HTML to PDF**: Converts HTML content to PDF using Puppeteer
+- **Markdown Support**: Converts markdown to HTML then to PDF
+- **Custom Styling**: Professional PDF formatting with CSS
+- **CIM Review Templates**: Specialized templates for CIM analysis reports
+
+**Methods**:
+- `generateCIMReviewPDF()`: Generate CIM review PDF from analysis data
+- `generatePDFFromMarkdown()`: Convert markdown to PDF
+- `generatePDFBuffer()`: Generate PDF as buffer for immediate download
+
+### 5. File Storage Service (`fileStorageService.ts`)
+
+**Purpose**: Handles all Google Cloud Storage operations.
+
+**Key Operations**:
+- `generateSignedUploadUrl()`: Creates secure upload URLs
+- `getFile()`: Downloads files from GCS
+- `uploadFile()`: Uploads files to GCS
+- `deleteFile()`: Removes files from GCS
+
+### 6. Upload Monitoring Service (`uploadMonitoringService.ts`)
+
+**Purpose**: Tracks upload progress and provides real-time monitoring.
+
+**Key Features**:
+- Real-time upload tracking
+- Error analysis and reporting
+- Performance metrics
+- Health status monitoring
+
+### 7. Session Service (`sessionService.ts`)
+
+**Purpose**: Manages user sessions and authentication state.
+
+**Key Features**:
+- Session storage and retrieval
+- Token management
+- Session cleanup
+- Security token blacklisting
+
+### 8. Job Queue Service (`jobQueueService.ts`)
+
+**Purpose**: Manages background job processing and queuing.
+
+**Key Features**:
+- Job queuing and scheduling
+- Background processing
+- Job status tracking
+- Error recovery
+
+## Service Dependencies
+
+```
+unifiedDocumentProcessor
+├── optimizedAgenticRAGProcessor
+│   ├── llmService (for AI processing)
+│   ├── vectorDatabaseService (for embeddings)
+│   └── fileStorageService (for file operations)
+├── pdfGenerationService (for PDF creation)
+├── uploadMonitoringService (for tracking)
+├── sessionService (for session management)
+└── jobQueueService (for background processing)
+```
+
+## Database Schema
+
+### Core Tables
+
+#### 1. Documents Table
+```sql
+CREATE TABLE documents (
+  id UUID PRIMARY KEY,
+  user_id TEXT NOT NULL,
+  original_file_name TEXT NOT NULL,
+  file_path TEXT NOT NULL,
+  file_size INTEGER NOT NULL,
+  status TEXT NOT NULL,
+  extracted_text TEXT,
+  generated_summary TEXT,
+  summary_pdf_path TEXT,
+  analysis_data JSONB,
+  created_at TIMESTAMP DEFAULT NOW(),
+  updated_at TIMESTAMP DEFAULT NOW()
+);
+```
+
+#### 2. Agentic RAG Sessions Table
+```sql
+CREATE TABLE agentic_rag_sessions (
+  id UUID PRIMARY KEY,
+  document_id UUID REFERENCES documents(id),
+  strategy TEXT NOT NULL,
+  status TEXT NOT NULL,
+  total_agents INTEGER,
+  completed_agents INTEGER,
+  failed_agents INTEGER,
+  overall_validation_score DECIMAL,
+  processing_time_ms INTEGER,
+  api_calls_count INTEGER,
+  total_cost DECIMAL,
+  created_at TIMESTAMP DEFAULT NOW(),
+  completed_at TIMESTAMP
+);
+```
+
+#### 3. Vector Database Tables
+```sql
+CREATE TABLE document_chunks (
+  id UUID PRIMARY KEY,
+  document_id UUID REFERENCES documents(id),
+  content TEXT NOT NULL,
+  embedding VECTOR(1536),
+  chunk_index INTEGER,
+  metadata JSONB,
+  created_at TIMESTAMP DEFAULT NOW()
+);
+```
+
+## API Endpoints
+
+### Active Endpoints
+
+#### Document Management
+- `POST /documents/upload-url` - Get signed upload URL
+- `POST /documents/:id/confirm-upload` - Confirm upload and start processing
+- `POST /documents/:id/process-optimized-agentic-rag` - Trigger AI processing
+- `GET /documents/:id/download` - Download processed PDF
+- `DELETE /documents/:id` - Delete document
+
+#### Analytics & Monitoring
+- `GET /documents/analytics` - Get processing analytics
+- `GET /documents/:id/agentic-rag-sessions` - Get processing sessions
+- `GET /monitoring/dashboard` - Get monitoring dashboard
+- `GET /vector/stats` - Get vector database statistics
+
+### Legacy Endpoints (Kept for Backward Compatibility)
+- `POST /documents/upload` - Multipart file upload (legacy)
+- `GET /documents` - List documents (basic CRUD)
+
+## Configuration
+
+### Environment Variables
+
+**Backend** (`backend/src/config/env.ts`):
+```typescript
+// Google Cloud
+GOOGLE_CLOUD_PROJECT_ID
+GOOGLE_CLOUD_STORAGE_BUCKET
+GOOGLE_APPLICATION_CREDENTIALS
+
+// Document AI
+GOOGLE_DOCUMENT_AI_LOCATION
+GOOGLE_DOCUMENT_AI_PROCESSOR_ID
+
+// Database
+DATABASE_URL
+SUPABASE_URL
+SUPABASE_ANON_KEY
+
+// AI Services
+ANTHROPIC_API_KEY
+OPENAI_API_KEY
+
+// Processing
+AGENTIC_RAG_ENABLED=true
+PROCESSING_STRATEGY=optimized_agentic_rag
+
+// LLM Configuration
+LLM_PROVIDER=anthropic
+LLM_MODEL=claude-3-opus-20240229
+LLM_MAX_TOKENS=4000
+LLM_TEMPERATURE=0.1
+```
+
+**Frontend** (`frontend/src/config/env.ts`):
+```typescript
+// API
+VITE_API_BASE_URL
+VITE_FIREBASE_API_KEY
+VITE_FIREBASE_AUTH_DOMAIN
+```
+
+## Processing Strategy Details
+
+### Current Strategy: Optimized Agentic RAG
+
+**Why This Strategy**:
+- Handles large documents efficiently
+- Provides structured analysis output
+- Optimizes memory usage and API costs
+- Generates high-quality summaries
+
+**How It Works**:
+1. **Text Extraction**: Google Document AI extracts text from PDF
+2. **Semantic Chunking**: Splits text at natural boundaries (sections, paragraphs)
+3. **Vector Embedding**: Creates embeddings for each chunk
+4. **LLM Analysis**: llmService calls Claude AI to analyze chunks and generate structured data
+5. **PDF Generation**: pdfGenerationService creates summary PDF with analysis results
+
+**Output Format**: Structured CIM Review data including:
+- Deal Overview
+- Business Description
+- Market Analysis
+- Financial Summary
+- Management Team
+- Investment Thesis
+- Key Questions & Next Steps
+
+## Error Handling
+
+### Frontend Error Handling
+- **Network Errors**: Automatic retry with exponential backoff
+- **Authentication Errors**: Automatic token refresh or redirect to login
+- **Upload Errors**: User-friendly error messages with retry options
+- **Processing Errors**: Real-time error display with retry functionality
+
+### Backend Error Handling
+- **Validation Errors**: Input validation with detailed error messages
+- **Processing Errors**: Graceful degradation with error logging
+- **Storage Errors**: Retry logic for transient failures
+- **Database Errors**: Connection pooling and retry mechanisms
+- **LLM API Errors**: Retry logic with exponential backoff
+- **PDF Generation Errors**: Fallback to text-only output
+
+### Error Recovery Mechanisms
+- **LLM API Failures**: Up to 3 retry attempts with different models
+- **Processing Timeouts**: Graceful timeout handling with partial results
+- **Memory Issues**: Automatic garbage collection and memory cleanup
+- **File Storage Errors**: Retry with exponential backoff
+
+## Monitoring & Analytics
+
+### Real-time Monitoring
+- Upload progress tracking
+- Processing status updates
+- Error rate monitoring
+- Performance metrics
+- API usage tracking
+- Cost monitoring
+
+### Analytics Dashboard
+- Processing success rates
+- Average processing times
+- API usage statistics
+- Cost tracking
+- User activity metrics
+- Error analysis reports
+
+## Security
+
+### Authentication
+- Firebase Authentication
+- JWT token validation
+- Protected API endpoints
+- User-specific data isolation
+- Session management with secure token handling
+
+### File Security
+- Signed URLs for secure uploads
+- File type validation (PDF only)
+- File size limits (50MB max)
+- User-specific file storage paths
+- Secure file deletion
+
+### API Security
+- Rate limiting (1000 requests per 15 minutes)
+- CORS configuration
+- Input validation
+- SQL injection prevention
+- Request correlation IDs for tracking
+
+## Performance Optimization
+
+### Memory Management
+- Batch processing to limit memory usage
+- Garbage collection optimization
+- Connection pooling for database
+- Efficient chunking to minimize memory footprint
+
+### API Optimization
+- Rate limiting to prevent API quota exhaustion
+- Caching for frequently accessed data
+- Efficient chunking to minimize API calls
+- Model selection based on task complexity
+
+### Processing Optimization
+- Concurrent processing with limits
+- Intelligent chunking for optimal processing
+- Background job processing
+- Progress tracking for user feedback
+
+## Deployment
+
+### Backend Deployment
+- **Firebase Functions**: Serverless deployment
+- **Google Cloud Run**: Containerized deployment
+- **Docker**: Container support
+
+### Frontend Deployment
+- **Firebase Hosting**: Static hosting
+- **Vite**: Build tool
+- **TypeScript**: Type safety
+
+## Development Workflow
+
+### Local Development
+1. **Backend**: `npm run dev` (runs on port 5001)
+2. **Frontend**: `npm run dev` (runs on port 5173)
+3. **Database**: Supabase local development
+4. **Storage**: Google Cloud Storage (development bucket)
+
+### Testing
+- **Unit Tests**: Jest for backend, Vitest for frontend
+- **Integration Tests**: End-to-end testing
+- **API Tests**: Supertest for backend endpoints
+
+## Troubleshooting
+
+### Common Issues
+1. **Upload Failures**: Check GCS permissions and bucket configuration
+2. **Processing Timeouts**: Increase timeout limits for large documents
+3. **Memory Issues**: Monitor memory usage and adjust batch sizes
+4. **API Quotas**: Check API usage and implement rate limiting
+5. **PDF Generation Failures**: Check Puppeteer installation and memory
+6. **LLM API Errors**: Verify API keys and check rate limits
+
+### Debug Tools
+- Real-time logging with correlation IDs
+- Upload monitoring dashboard
+- Processing session details
+- Error analysis reports
+- Performance metrics dashboard
+
+This documentation provides a comprehensive overview of the CIM Document Processor architecture, helping junior programmers understand the system's design, data flow, and key components.