# CIM Document Processor - AI-Powered CIM Analysis System ## ๐ŸŽฏ Project Overview **Purpose**: Automated processing and analysis of Confidential Information Memorandums (CIMs) using AI-powered document understanding and structured data extraction. **Core Technology Stack**: - **Frontend**: React + TypeScript + Vite - **Backend**: Node.js + Express + TypeScript - **Database**: Supabase (PostgreSQL) + Vector Database - **AI Services**: Google Document AI + Claude AI + OpenAI - **Storage**: Google Cloud Storage - **Authentication**: Firebase Auth ## ๐Ÿ—๏ธ Architecture Summary ``` โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Frontend โ”‚ โ”‚ Backend โ”‚ โ”‚ External โ”‚ โ”‚ (React) โ”‚โ—„โ”€โ”€โ–บโ”‚ (Node.js) โ”‚โ—„โ”€โ”€โ–บโ”‚ Services โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ–ผ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Database โ”‚ โ”‚ Google Cloud โ”‚ โ”‚ (Supabase) โ”‚ โ”‚ Services โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` ## ๐Ÿ“ Key Directories & Files ### Core Application - `frontend/src/` - React frontend application - `backend/src/` - Node.js backend services - `backend/src/services/` - Core business logic services - `backend/src/models/` - Database models and types - `backend/src/routes/` - API route definitions ### Documentation - `APP_DESIGN_DOCUMENTATION.md` - Complete system architecture - `PDF_GENERATION_ANALYSIS.md` - PDF generation optimization - `DEPLOYMENT_GUIDE.md` - Deployment instructions - `ARCHITECTURE_DIAGRAMS.md` - Visual architecture documentation - `QUICK_START.md` - Quick start guide - `TESTING_STRATEGY_DOCUMENTATION.md` - Testing guidelines - `TROUBLESHOOTING_GUIDE.md` - Troubleshooting guide ### Configuration - `backend/src/config/` - Environment and service configuration - `frontend/src/config/` - Frontend configuration - `backend/scripts/` - Setup and utility scripts ## ๐Ÿš€ Quick Start ### Prerequisites - Node.js 18+ - Google Cloud Platform account - Supabase account - Firebase project ### Environment Setup ```bash # Backend cd backend npm install cp .env.example .env # Configure environment variables # Frontend cd frontend npm install cp .env.example .env # Configure environment variables ``` ### Development ```bash # Backend (port 5001) cd backend && npm run dev # Frontend (port 5173) cd frontend && npm run dev ``` ## ๐Ÿ”ง Core Services ### 1. Document Processing Pipeline - **unifiedDocumentProcessor.ts** - Main orchestrator - **optimizedAgenticRAGProcessor.ts** - AI-powered analysis - **documentAiProcessor.ts** - Google Document AI integration - **llmService.ts** - LLM interactions (Claude AI/OpenAI) ### 2. File Management - **fileStorageService.ts** - Google Cloud Storage operations - **pdfGenerationService.ts** - PDF report generation - **uploadMonitoringService.ts** - Real-time upload tracking ### 3. Data Management - **vectorDatabaseService.ts** - Vector embeddings and search - **jobQueueService.ts** - Background job processing - **jobProcessorService.ts** - Job execution logic ## ๐Ÿ“Š Processing Strategies ### Current Active Strategy: Optimized Agentic RAG 1. **Text Extraction** - Google Document AI extracts text from PDF 2. **Semantic Chunking** - Split text into 4000-char chunks with overlap 3. **Vector Embedding** - Generate embeddings for each chunk 4. **LLM Analysis** - Claude AI analyzes chunks and generates structured data 5. **PDF Generation** - Create summary PDF with analysis results ### Output Format Structured CIM Review data including: - Deal Overview - Business Description - Market Analysis - Financial Summary - Management Team - Investment Thesis - Key Questions & Next Steps ## ๐Ÿ”Œ API Endpoints ### Document Management - `POST /documents/upload-url` - Get signed upload URL - `POST /documents/:id/confirm-upload` - Confirm upload and start processing - `POST /documents/:id/process-optimized-agentic-rag` - Trigger AI processing - `GET /documents/:id/download` - Download processed PDF - `DELETE /documents/:id` - Delete document ### Analytics & Monitoring - `GET /documents/analytics` - Get processing analytics - `GET /documents/processing-stats` - Get processing statistics - `GET /documents/:id/agentic-rag-sessions` - Get processing sessions - `GET /monitoring/upload-metrics` - Get upload metrics - `GET /monitoring/upload-health` - Get upload health status - `GET /monitoring/real-time-stats` - Get real-time statistics - `GET /vector/stats` - Get vector database statistics ## ๐Ÿ—„๏ธ Database Schema ### Core Tables - **documents** - Document metadata and processing status - **agentic_rag_sessions** - AI processing session tracking - **document_chunks** - Vector embeddings and chunk data - **processing_jobs** - Background job management - **users** - User authentication and profiles ## ๐Ÿ” Security - Firebase Authentication with JWT validation - Protected API endpoints with user-specific data isolation - Signed URLs for secure file uploads - Rate limiting and input validation - CORS configuration for cross-origin requests ## ๐Ÿ“ˆ Performance & Monitoring ### Real-time Monitoring - Upload progress tracking - Processing status updates - Error rate monitoring - Performance metrics - API usage tracking - Cost monitoring ### Analytics Dashboard - Processing success rates - Average processing times - API usage statistics - Cost tracking - User activity metrics - Error analysis reports ## ๐Ÿšจ Error Handling ### Frontend Error Handling - Network errors with automatic retry - Authentication errors with token refresh - Upload errors with user-friendly messages - Processing errors with real-time display ### Backend Error Handling - Validation errors with detailed messages - Processing errors with graceful degradation - Storage errors with retry logic - Database errors with connection pooling - LLM API errors with exponential backoff ## ๐Ÿงช Testing ### Test Structure - **Unit Tests**: Vitest for backend and frontend - **Integration Tests**: End-to-end testing - **API Tests**: Supertest for backend endpoints ### Test Coverage - Service layer testing - API endpoint testing - Error handling scenarios - Performance testing - Security testing ## ๐Ÿ“š Documentation Index ### Technical Documentation - [Application Design Documentation](APP_DESIGN_DOCUMENTATION.md) - Complete system architecture - [PDF Generation Analysis](PDF_GENERATION_ANALYSIS.md) - PDF optimization details - [Architecture Diagrams](ARCHITECTURE_DIAGRAMS.md) - Visual system design - [Deployment Guide](DEPLOYMENT_GUIDE.md) - Deployment instructions - [Quick Start Guide](QUICK_START.md) - Getting started - [Testing Strategy](TESTING_STRATEGY_DOCUMENTATION.md) - Testing guidelines - [Troubleshooting Guide](TROUBLESHOOTING_GUIDE.md) - Common issues and solutions ## ๐Ÿค Contributing ### Development Workflow 1. Create feature branch from main 2. Implement changes with tests 3. Update documentation 4. Submit pull request 5. Code review and approval 6. Merge to main ### Code Standards - TypeScript for type safety - ESLint for code quality - Prettier for formatting - Jest for testing - Conventional commits for version control ## ๐Ÿ“ž Support ### Common Issues 1. **Upload Failures** - Check GCS permissions and bucket configuration 2. **Processing Timeouts** - Increase timeout limits for large documents 3. **Memory Issues** - Monitor memory usage and adjust batch sizes 4. **API Quotas** - Check API usage and implement rate limiting 5. **PDF Generation Failures** - Check Puppeteer installation and memory 6. **LLM API Errors** - Verify API keys and check rate limits ### Debug Tools - Real-time logging with correlation IDs - Upload monitoring dashboard - Processing session details - Error analysis reports - Performance metrics dashboard ## ๐Ÿ“„ License This project is proprietary software developed for BPCP. All rights reserved. --- **Last Updated**: December 2024 **Version**: 1.0.0 **Status**: Production Ready