258 lines
8.5 KiB
Markdown
258 lines
8.5 KiB
Markdown
# CIM Document Processor - AI-Powered CIM Analysis System
|
|
|
|
## 🎯 Project Overview
|
|
|
|
**Purpose**: Automated processing and analysis of Confidential Information Memorandums (CIMs) using AI-powered document understanding and structured data extraction.
|
|
|
|
**Core Technology Stack**:
|
|
- **Frontend**: React + TypeScript + Vite
|
|
- **Backend**: Node.js + Express + TypeScript
|
|
- **Database**: Supabase (PostgreSQL) + Vector Database
|
|
- **AI Services**: Google Document AI + Claude AI + OpenAI
|
|
- **Storage**: Google Cloud Storage
|
|
- **Authentication**: Firebase Auth
|
|
|
|
## 🏗️ Architecture Summary
|
|
|
|
```
|
|
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
|
│ Frontend │ │ Backend │ │ External │
|
|
│ (React) │◄──►│ (Node.js) │◄──►│ Services │
|
|
└─────────────────┘ └─────────────────┘ └─────────────────┘
|
|
│ │
|
|
▼ ▼
|
|
┌─────────────────┐ ┌─────────────────┐
|
|
│ Database │ │ Google Cloud │
|
|
│ (Supabase) │ │ Services │
|
|
└─────────────────┘ └─────────────────┘
|
|
```
|
|
|
|
## 📁 Key Directories & Files
|
|
|
|
### Core Application
|
|
- `frontend/src/` - React frontend application
|
|
- `backend/src/` - Node.js backend services
|
|
- `backend/src/services/` - Core business logic services
|
|
- `backend/src/models/` - Database models and types
|
|
- `backend/src/routes/` - API route definitions
|
|
|
|
### Documentation
|
|
- `APP_DESIGN_DOCUMENTATION.md` - Complete system architecture
|
|
- `AGENTIC_RAG_IMPLEMENTATION_PLAN.md` - AI processing strategy
|
|
- `PDF_GENERATION_ANALYSIS.md` - PDF generation optimization
|
|
- `DEPLOYMENT_GUIDE.md` - Deployment instructions
|
|
- `ARCHITECTURE_DIAGRAMS.md` - Visual architecture documentation
|
|
|
|
### Configuration
|
|
- `backend/src/config/` - Environment and service configuration
|
|
- `frontend/src/config/` - Frontend configuration
|
|
- `backend/scripts/` - Setup and utility scripts
|
|
|
|
## 🚀 Quick Start
|
|
|
|
### Prerequisites
|
|
- Node.js 18+
|
|
- Google Cloud Platform account
|
|
- Supabase account
|
|
- Firebase project
|
|
|
|
### Environment Setup
|
|
```bash
|
|
# Backend
|
|
cd backend
|
|
npm install
|
|
cp .env.example .env
|
|
# Configure environment variables
|
|
|
|
# Frontend
|
|
cd frontend
|
|
npm install
|
|
cp .env.example .env
|
|
# Configure environment variables
|
|
```
|
|
|
|
### Development
|
|
```bash
|
|
# Backend (port 5001)
|
|
cd backend && npm run dev
|
|
|
|
# Frontend (port 5173)
|
|
cd frontend && npm run dev
|
|
```
|
|
|
|
## 🔧 Core Services
|
|
|
|
### 1. Document Processing Pipeline
|
|
- **unifiedDocumentProcessor.ts** - Main orchestrator
|
|
- **optimizedAgenticRAGProcessor.ts** - AI-powered analysis
|
|
- **documentAiProcessor.ts** - Google Document AI integration
|
|
- **llmService.ts** - LLM interactions (Claude AI/OpenAI)
|
|
|
|
### 2. File Management
|
|
- **fileStorageService.ts** - Google Cloud Storage operations
|
|
- **pdfGenerationService.ts** - PDF report generation
|
|
- **uploadMonitoringService.ts** - Real-time upload tracking
|
|
|
|
### 3. Data Management
|
|
- **agenticRAGDatabaseService.ts** - Analytics and session management
|
|
- **vectorDatabaseService.ts** - Vector embeddings and search
|
|
- **sessionService.ts** - User session management
|
|
|
|
## 📊 Processing Strategies
|
|
|
|
### Current Active Strategy: Optimized Agentic RAG
|
|
1. **Text Extraction** - Google Document AI extracts text from PDF
|
|
2. **Semantic Chunking** - Split text into 4000-char chunks with overlap
|
|
3. **Vector Embedding** - Generate embeddings for each chunk
|
|
4. **LLM Analysis** - Claude AI analyzes chunks and generates structured data
|
|
5. **PDF Generation** - Create summary PDF with analysis results
|
|
|
|
### Output Format
|
|
Structured CIM Review data including:
|
|
- Deal Overview
|
|
- Business Description
|
|
- Market Analysis
|
|
- Financial Summary
|
|
- Management Team
|
|
- Investment Thesis
|
|
- Key Questions & Next Steps
|
|
|
|
## 🔌 API Endpoints
|
|
|
|
### Document Management
|
|
- `POST /documents/upload-url` - Get signed upload URL
|
|
- `POST /documents/:id/confirm-upload` - Confirm upload and start processing
|
|
- `POST /documents/:id/process-optimized-agentic-rag` - Trigger AI processing
|
|
- `GET /documents/:id/download` - Download processed PDF
|
|
- `DELETE /documents/:id` - Delete document
|
|
|
|
### Analytics & Monitoring
|
|
- `GET /documents/analytics` - Get processing analytics
|
|
- `GET /documents/processing-stats` - Get processing statistics
|
|
- `GET /documents/:id/agentic-rag-sessions` - Get processing sessions
|
|
- `GET /monitoring/upload-metrics` - Get upload metrics
|
|
- `GET /monitoring/upload-health` - Get upload health status
|
|
- `GET /monitoring/real-time-stats` - Get real-time statistics
|
|
- `GET /vector/stats` - Get vector database statistics
|
|
|
|
## 🗄️ Database Schema
|
|
|
|
### Core Tables
|
|
- **documents** - Document metadata and processing status
|
|
- **agentic_rag_sessions** - AI processing session tracking
|
|
- **document_chunks** - Vector embeddings and chunk data
|
|
- **processing_jobs** - Background job management
|
|
- **users** - User authentication and profiles
|
|
|
|
## 🔐 Security
|
|
|
|
- Firebase Authentication with JWT validation
|
|
- Protected API endpoints with user-specific data isolation
|
|
- Signed URLs for secure file uploads
|
|
- Rate limiting and input validation
|
|
- CORS configuration for cross-origin requests
|
|
|
|
## 📈 Performance & Monitoring
|
|
|
|
### Real-time Monitoring
|
|
- Upload progress tracking
|
|
- Processing status updates
|
|
- Error rate monitoring
|
|
- Performance metrics
|
|
- API usage tracking
|
|
- Cost monitoring
|
|
|
|
### Analytics Dashboard
|
|
- Processing success rates
|
|
- Average processing times
|
|
- API usage statistics
|
|
- Cost tracking
|
|
- User activity metrics
|
|
- Error analysis reports
|
|
|
|
## 🚨 Error Handling
|
|
|
|
### Frontend Error Handling
|
|
- Network errors with automatic retry
|
|
- Authentication errors with token refresh
|
|
- Upload errors with user-friendly messages
|
|
- Processing errors with real-time display
|
|
|
|
### Backend Error Handling
|
|
- Validation errors with detailed messages
|
|
- Processing errors with graceful degradation
|
|
- Storage errors with retry logic
|
|
- Database errors with connection pooling
|
|
- LLM API errors with exponential backoff
|
|
|
|
## 🧪 Testing
|
|
|
|
### Test Structure
|
|
- **Unit Tests**: Jest for backend, Vitest for frontend
|
|
- **Integration Tests**: End-to-end testing
|
|
- **API Tests**: Supertest for backend endpoints
|
|
|
|
### Test Coverage
|
|
- Service layer testing
|
|
- API endpoint testing
|
|
- Error handling scenarios
|
|
- Performance testing
|
|
- Security testing
|
|
|
|
## 📚 Documentation Index
|
|
|
|
### Technical Documentation
|
|
- [Application Design Documentation](APP_DESIGN_DOCUMENTATION.md) - Complete system architecture
|
|
- [Agentic RAG Implementation Plan](AGENTIC_RAG_IMPLEMENTATION_PLAN.md) - AI processing strategy
|
|
- [PDF Generation Analysis](PDF_GENERATION_ANALYSIS.md) - PDF optimization details
|
|
- [Architecture Diagrams](ARCHITECTURE_DIAGRAMS.md) - Visual system design
|
|
- [Deployment Guide](DEPLOYMENT_GUIDE.md) - Deployment instructions
|
|
|
|
### Analysis Reports
|
|
- [Codebase Audit Report](codebase-audit-report.md) - Code quality analysis
|
|
- [Dependency Analysis Report](DEPENDENCY_ANALYSIS_REPORT.md) - Dependency management
|
|
- [Document AI Integration Summary](DOCUMENT_AI_INTEGRATION_SUMMARY.md) - Google Document AI setup
|
|
|
|
## 🤝 Contributing
|
|
|
|
### Development Workflow
|
|
1. Create feature branch from main
|
|
2. Implement changes with tests
|
|
3. Update documentation
|
|
4. Submit pull request
|
|
5. Code review and approval
|
|
6. Merge to main
|
|
|
|
### Code Standards
|
|
- TypeScript for type safety
|
|
- ESLint for code quality
|
|
- Prettier for formatting
|
|
- Jest for testing
|
|
- Conventional commits for version control
|
|
|
|
## 📞 Support
|
|
|
|
### Common Issues
|
|
1. **Upload Failures** - Check GCS permissions and bucket configuration
|
|
2. **Processing Timeouts** - Increase timeout limits for large documents
|
|
3. **Memory Issues** - Monitor memory usage and adjust batch sizes
|
|
4. **API Quotas** - Check API usage and implement rate limiting
|
|
5. **PDF Generation Failures** - Check Puppeteer installation and memory
|
|
6. **LLM API Errors** - Verify API keys and check rate limits
|
|
|
|
### Debug Tools
|
|
- Real-time logging with correlation IDs
|
|
- Upload monitoring dashboard
|
|
- Processing session details
|
|
- Error analysis reports
|
|
- Performance metrics dashboard
|
|
|
|
## 📄 License
|
|
|
|
This project is proprietary software developed for BPCP. All rights reserved.
|
|
|
|
---
|
|
|
|
**Last Updated**: December 2024
|
|
**Version**: 1.0.0
|
|
**Status**: Production Ready |