Files
cim_summary/README.md

258 lines
8.5 KiB
Markdown

# CIM Document Processor - AI-Powered CIM Analysis System
## 🎯 Project Overview
**Purpose**: Automated processing and analysis of Confidential Information Memorandums (CIMs) using AI-powered document understanding and structured data extraction.
**Core Technology Stack**:
- **Frontend**: React + TypeScript + Vite
- **Backend**: Node.js + Express + TypeScript
- **Database**: Supabase (PostgreSQL) + Vector Database
- **AI Services**: Google Document AI + Claude AI + OpenAI
- **Storage**: Google Cloud Storage
- **Authentication**: Firebase Auth
## 🏗️ Architecture Summary
```
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Frontend │ │ Backend │ │ External │
│ (React) │◄──►│ (Node.js) │◄──►│ Services │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ Database │ │ Google Cloud │
│ (Supabase) │ │ Services │
└─────────────────┘ └─────────────────┘
```
## 📁 Key Directories & Files
### Core Application
- `frontend/src/` - React frontend application
- `backend/src/` - Node.js backend services
- `backend/src/services/` - Core business logic services
- `backend/src/models/` - Database models and types
- `backend/src/routes/` - API route definitions
### Documentation
- `APP_DESIGN_DOCUMENTATION.md` - Complete system architecture
- `AGENTIC_RAG_IMPLEMENTATION_PLAN.md` - AI processing strategy
- `PDF_GENERATION_ANALYSIS.md` - PDF generation optimization
- `DEPLOYMENT_GUIDE.md` - Deployment instructions
- `ARCHITECTURE_DIAGRAMS.md` - Visual architecture documentation
### Configuration
- `backend/src/config/` - Environment and service configuration
- `frontend/src/config/` - Frontend configuration
- `backend/scripts/` - Setup and utility scripts
## 🚀 Quick Start
### Prerequisites
- Node.js 18+
- Google Cloud Platform account
- Supabase account
- Firebase project
### Environment Setup
```bash
# Backend
cd backend
npm install
cp .env.example .env
# Configure environment variables
# Frontend
cd frontend
npm install
cp .env.example .env
# Configure environment variables
```
### Development
```bash
# Backend (port 5001)
cd backend && npm run dev
# Frontend (port 5173)
cd frontend && npm run dev
```
## 🔧 Core Services
### 1. Document Processing Pipeline
- **unifiedDocumentProcessor.ts** - Main orchestrator
- **optimizedAgenticRAGProcessor.ts** - AI-powered analysis
- **documentAiProcessor.ts** - Google Document AI integration
- **llmService.ts** - LLM interactions (Claude AI/OpenAI)
### 2. File Management
- **fileStorageService.ts** - Google Cloud Storage operations
- **pdfGenerationService.ts** - PDF report generation
- **uploadMonitoringService.ts** - Real-time upload tracking
### 3. Data Management
- **agenticRAGDatabaseService.ts** - Analytics and session management
- **vectorDatabaseService.ts** - Vector embeddings and search
- **sessionService.ts** - User session management
## 📊 Processing Strategies
### Current Active Strategy: Optimized Agentic RAG
1. **Text Extraction** - Google Document AI extracts text from PDF
2. **Semantic Chunking** - Split text into 4000-char chunks with overlap
3. **Vector Embedding** - Generate embeddings for each chunk
4. **LLM Analysis** - Claude AI analyzes chunks and generates structured data
5. **PDF Generation** - Create summary PDF with analysis results
### Output Format
Structured CIM Review data including:
- Deal Overview
- Business Description
- Market Analysis
- Financial Summary
- Management Team
- Investment Thesis
- Key Questions & Next Steps
## 🔌 API Endpoints
### Document Management
- `POST /documents/upload-url` - Get signed upload URL
- `POST /documents/:id/confirm-upload` - Confirm upload and start processing
- `POST /documents/:id/process-optimized-agentic-rag` - Trigger AI processing
- `GET /documents/:id/download` - Download processed PDF
- `DELETE /documents/:id` - Delete document
### Analytics & Monitoring
- `GET /documents/analytics` - Get processing analytics
- `GET /documents/processing-stats` - Get processing statistics
- `GET /documents/:id/agentic-rag-sessions` - Get processing sessions
- `GET /monitoring/upload-metrics` - Get upload metrics
- `GET /monitoring/upload-health` - Get upload health status
- `GET /monitoring/real-time-stats` - Get real-time statistics
- `GET /vector/stats` - Get vector database statistics
## 🗄️ Database Schema
### Core Tables
- **documents** - Document metadata and processing status
- **agentic_rag_sessions** - AI processing session tracking
- **document_chunks** - Vector embeddings and chunk data
- **processing_jobs** - Background job management
- **users** - User authentication and profiles
## 🔐 Security
- Firebase Authentication with JWT validation
- Protected API endpoints with user-specific data isolation
- Signed URLs for secure file uploads
- Rate limiting and input validation
- CORS configuration for cross-origin requests
## 📈 Performance & Monitoring
### Real-time Monitoring
- Upload progress tracking
- Processing status updates
- Error rate monitoring
- Performance metrics
- API usage tracking
- Cost monitoring
### Analytics Dashboard
- Processing success rates
- Average processing times
- API usage statistics
- Cost tracking
- User activity metrics
- Error analysis reports
## 🚨 Error Handling
### Frontend Error Handling
- Network errors with automatic retry
- Authentication errors with token refresh
- Upload errors with user-friendly messages
- Processing errors with real-time display
### Backend Error Handling
- Validation errors with detailed messages
- Processing errors with graceful degradation
- Storage errors with retry logic
- Database errors with connection pooling
- LLM API errors with exponential backoff
## 🧪 Testing
### Test Structure
- **Unit Tests**: Jest for backend, Vitest for frontend
- **Integration Tests**: End-to-end testing
- **API Tests**: Supertest for backend endpoints
### Test Coverage
- Service layer testing
- API endpoint testing
- Error handling scenarios
- Performance testing
- Security testing
## 📚 Documentation Index
### Technical Documentation
- [Application Design Documentation](APP_DESIGN_DOCUMENTATION.md) - Complete system architecture
- [Agentic RAG Implementation Plan](AGENTIC_RAG_IMPLEMENTATION_PLAN.md) - AI processing strategy
- [PDF Generation Analysis](PDF_GENERATION_ANALYSIS.md) - PDF optimization details
- [Architecture Diagrams](ARCHITECTURE_DIAGRAMS.md) - Visual system design
- [Deployment Guide](DEPLOYMENT_GUIDE.md) - Deployment instructions
### Analysis Reports
- [Codebase Audit Report](codebase-audit-report.md) - Code quality analysis
- [Dependency Analysis Report](DEPENDENCY_ANALYSIS_REPORT.md) - Dependency management
- [Document AI Integration Summary](DOCUMENT_AI_INTEGRATION_SUMMARY.md) - Google Document AI setup
## 🤝 Contributing
### Development Workflow
1. Create feature branch from main
2. Implement changes with tests
3. Update documentation
4. Submit pull request
5. Code review and approval
6. Merge to main
### Code Standards
- TypeScript for type safety
- ESLint for code quality
- Prettier for formatting
- Jest for testing
- Conventional commits for version control
## 📞 Support
### Common Issues
1. **Upload Failures** - Check GCS permissions and bucket configuration
2. **Processing Timeouts** - Increase timeout limits for large documents
3. **Memory Issues** - Monitor memory usage and adjust batch sizes
4. **API Quotas** - Check API usage and implement rate limiting
5. **PDF Generation Failures** - Check Puppeteer installation and memory
6. **LLM API Errors** - Verify API keys and check rate limits
### Debug Tools
- Real-time logging with correlation IDs
- Upload monitoring dashboard
- Processing session details
- Error analysis reports
- Performance metrics dashboard
## 📄 License
This project is proprietary software developed for BPCP. All rights reserved.
---
**Last Updated**: December 2024
**Version**: 1.0.0
**Status**: Production Ready