8b15732a982218ba3bdf3ec12942f7a2914b2411
- Add pre-deploy-check.sh script to validate .env doesn't contain secrets - Add clean-env-secrets.sh script to remove secrets from .env before deployment - Update deploy:firebase script to run validation automatically - Add sync-secrets npm script for local development - Add deploy:firebase:force for deployments that skip validation This prevents 'Secret environment variable overlaps non secret environment variable' errors by ensuring secrets defined via defineSecret() are not also in .env file. ## Completed Todos - ✅ Test financial extraction with Stax Holding Company CIM - All values correct (FY-3: $64M, FY-2: $71M, FY-1: $71M, LTM: $76M) - ✅ Implement deterministic parser fallback - Integrated into simpleDocumentProcessor - ✅ Implement few-shot examples - Added comprehensive examples for PRIMARY table identification - ✅ Fix primary table identification - Financial extraction now correctly identifies PRIMARY table (millions) vs subsidiary tables (thousands) ## Pending Todos 1. Review older commits (1-2 months ago) to see how financial extraction was working then - Check commits:185c780(Claude 3.7),5b3b1bf(Document AI fixes),0ec3d14(multi-pass extraction) - Compare prompt simplicity - older versions may have had simpler, more effective prompts - Check if deterministic parser was being used more effectively 2. Review best practices for structured financial data extraction from PDFs/CIMs - Research: LLM prompt engineering for tabular data (few-shot examples, chain-of-thought) - Period identification strategies - Validation techniques - Hybrid approaches (deterministic + LLM) - Error handling patterns - Check academic papers and industry case studies 3. Determine how to reduce processing time without sacrificing accuracy - Options: 1) Use Claude Haiku 4.5 for initial extraction, Sonnet 4.5 for validation - 2) Parallel extraction of different sections - 3) Caching common patterns - 4) Streaming responses - 5) Incremental processing with early validation - 6) Reduce prompt verbosity while maintaining clarity 4. Add unit tests for financial extraction validation logic - Test: invalid value rejection, cross-period validation, numeric extraction - Period identification from various formats (years, FY-X, mixed) - Include edge cases: missing periods, projections mixed with historical, inconsistent formatting 5. Monitor production financial extraction accuracy - Track: extraction success rate, validation rejection rate, common error patterns - User feedback on extracted financial data - Set up alerts for validation failures and extraction inconsistencies 6. Optimize prompt size for financial extraction - Current prompts may be too verbose - Test shorter, more focused prompts that maintain accuracy - Consider: removing redundant instructions, using more concise examples, focusing on critical rules only 7. Add financial data visualization - Consider adding a financial data preview/validation step in the UI - Allow users to verify/correct extracted values if needed - Provides human-in-the-loop validation for critical financial data 8. Document extraction strategies - Document the different financial table formats found in CIMs - Create a reference guide for common patterns (years format, FY-X format, mixed format, etc.) - This will help with prompt engineering and parser improvements 9. Compare RAG-based extraction vs simple full-document extraction for financial accuracy - Determine which approach produces more accurate financial data and why - May need to hybrid approach 10. Add confidence scores to financial extraction results - Flag low-confidence extractions for manual review - Helps identify when extraction may be incorrect and needs human validation
CIM Document Processor - AI-Powered CIM Analysis System
🎯 Project Overview
Purpose: Automated processing and analysis of Confidential Information Memorandums (CIMs) using AI-powered document understanding and structured data extraction.
Core Technology Stack:
- Frontend: React + TypeScript + Vite
- Backend: Node.js + Express + TypeScript
- Database: Supabase (PostgreSQL) + Vector Database
- AI Services: Google Document AI + Claude AI + OpenAI
- Storage: Google Cloud Storage
- Authentication: Firebase Auth
🏗️ Architecture Summary
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Frontend │ │ Backend │ │ External │
│ (React) │◄──►│ (Node.js) │◄──►│ Services │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ Database │ │ Google Cloud │
│ (Supabase) │ │ Services │
└─────────────────┘ └─────────────────┘
📁 Key Directories & Files
Core Application
frontend/src/- React frontend applicationbackend/src/- Node.js backend servicesbackend/src/services/- Core business logic servicesbackend/src/models/- Database models and typesbackend/src/routes/- API route definitions
Documentation
APP_DESIGN_DOCUMENTATION.md- Complete system architectureAGENTIC_RAG_IMPLEMENTATION_PLAN.md- AI processing strategyPDF_GENERATION_ANALYSIS.md- PDF generation optimizationDEPLOYMENT_GUIDE.md- Deployment instructionsARCHITECTURE_DIAGRAMS.md- Visual architecture documentation
Configuration
backend/src/config/- Environment and service configurationfrontend/src/config/- Frontend configurationbackend/scripts/- Setup and utility scripts
🚀 Quick Start
Prerequisites
- Node.js 18+
- Google Cloud Platform account
- Supabase account
- Firebase project
Environment Setup
# Backend
cd backend
npm install
cp .env.example .env
# Configure environment variables
# Frontend
cd frontend
npm install
cp .env.example .env
# Configure environment variables
Development
# Backend (port 5001)
cd backend && npm run dev
# Frontend (port 5173)
cd frontend && npm run dev
🔧 Core Services
1. Document Processing Pipeline
- unifiedDocumentProcessor.ts - Main orchestrator
- optimizedAgenticRAGProcessor.ts - AI-powered analysis
- documentAiProcessor.ts - Google Document AI integration
- llmService.ts - LLM interactions (Claude AI/OpenAI)
2. File Management
- fileStorageService.ts - Google Cloud Storage operations
- pdfGenerationService.ts - PDF report generation
- uploadMonitoringService.ts - Real-time upload tracking
3. Data Management
- agenticRAGDatabaseService.ts - Analytics and session management
- vectorDatabaseService.ts - Vector embeddings and search
- sessionService.ts - User session management
📊 Processing Strategies
Current Active Strategy: Optimized Agentic RAG
- Text Extraction - Google Document AI extracts text from PDF
- Semantic Chunking - Split text into 4000-char chunks with overlap
- Vector Embedding - Generate embeddings for each chunk
- LLM Analysis - Claude AI analyzes chunks and generates structured data
- PDF Generation - Create summary PDF with analysis results
Output Format
Structured CIM Review data including:
- Deal Overview
- Business Description
- Market Analysis
- Financial Summary
- Management Team
- Investment Thesis
- Key Questions & Next Steps
🔌 API Endpoints
Document Management
POST /documents/upload-url- Get signed upload URLPOST /documents/:id/confirm-upload- Confirm upload and start processingPOST /documents/:id/process-optimized-agentic-rag- Trigger AI processingGET /documents/:id/download- Download processed PDFDELETE /documents/:id- Delete document
Analytics & Monitoring
GET /documents/analytics- Get processing analyticsGET /documents/processing-stats- Get processing statisticsGET /documents/:id/agentic-rag-sessions- Get processing sessionsGET /monitoring/upload-metrics- Get upload metricsGET /monitoring/upload-health- Get upload health statusGET /monitoring/real-time-stats- Get real-time statisticsGET /vector/stats- Get vector database statistics
🗄️ Database Schema
Core Tables
- documents - Document metadata and processing status
- agentic_rag_sessions - AI processing session tracking
- document_chunks - Vector embeddings and chunk data
- processing_jobs - Background job management
- users - User authentication and profiles
🔐 Security
- Firebase Authentication with JWT validation
- Protected API endpoints with user-specific data isolation
- Signed URLs for secure file uploads
- Rate limiting and input validation
- CORS configuration for cross-origin requests
📈 Performance & Monitoring
Real-time Monitoring
- Upload progress tracking
- Processing status updates
- Error rate monitoring
- Performance metrics
- API usage tracking
- Cost monitoring
Analytics Dashboard
- Processing success rates
- Average processing times
- API usage statistics
- Cost tracking
- User activity metrics
- Error analysis reports
🚨 Error Handling
Frontend Error Handling
- Network errors with automatic retry
- Authentication errors with token refresh
- Upload errors with user-friendly messages
- Processing errors with real-time display
Backend Error Handling
- Validation errors with detailed messages
- Processing errors with graceful degradation
- Storage errors with retry logic
- Database errors with connection pooling
- LLM API errors with exponential backoff
🧪 Testing
Test Structure
- Unit Tests: Jest for backend, Vitest for frontend
- Integration Tests: End-to-end testing
- API Tests: Supertest for backend endpoints
Test Coverage
- Service layer testing
- API endpoint testing
- Error handling scenarios
- Performance testing
- Security testing
📚 Documentation Index
Technical Documentation
- Application Design Documentation - Complete system architecture
- Agentic RAG Implementation Plan - AI processing strategy
- PDF Generation Analysis - PDF optimization details
- Architecture Diagrams - Visual system design
- Deployment Guide - Deployment instructions
Analysis Reports
- Codebase Audit Report - Code quality analysis
- Dependency Analysis Report - Dependency management
- Document AI Integration Summary - Google Document AI setup
🤝 Contributing
Development Workflow
- Create feature branch from main
- Implement changes with tests
- Update documentation
- Submit pull request
- Code review and approval
- Merge to main
Code Standards
- TypeScript for type safety
- ESLint for code quality
- Prettier for formatting
- Jest for testing
- Conventional commits for version control
📞 Support
Common Issues
- Upload Failures - Check GCS permissions and bucket configuration
- Processing Timeouts - Increase timeout limits for large documents
- Memory Issues - Monitor memory usage and adjust batch sizes
- API Quotas - Check API usage and implement rate limiting
- PDF Generation Failures - Check Puppeteer installation and memory
- LLM API Errors - Verify API keys and check rate limits
Debug Tools
- Real-time logging with correlation IDs
- Upload monitoring dashboard
- Processing session details
- Error analysis reports
- Performance metrics dashboard
📄 License
This project is proprietary software developed for BPCP. All rights reserved.
Last Updated: December 2024 Version: 1.0.0 Status: Production Ready
Description
Languages
TypeScript
92.2%
JavaScript
3.7%
PLpgSQL
3.1%
Shell
1%