admin 8b15732a98 feat: Add pre-deployment validation and deployment automation
- Add pre-deploy-check.sh script to validate .env doesn't contain secrets
- Add clean-env-secrets.sh script to remove secrets from .env before deployment
- Update deploy:firebase script to run validation automatically
- Add sync-secrets npm script for local development
- Add deploy:firebase:force for deployments that skip validation

This prevents 'Secret environment variable overlaps non secret environment variable' errors
by ensuring secrets defined via defineSecret() are not also in .env file.

## Completed Todos
-  Test financial extraction with Stax Holding Company CIM - All values correct (FY-3: $64M, FY-2: $71M, FY-1: $71M, LTM: $76M)
-  Implement deterministic parser fallback - Integrated into simpleDocumentProcessor
-  Implement few-shot examples - Added comprehensive examples for PRIMARY table identification
-  Fix primary table identification - Financial extraction now correctly identifies PRIMARY table (millions) vs subsidiary tables (thousands)

## Pending Todos
1. Review older commits (1-2 months ago) to see how financial extraction was working then
   - Check commits: 185c780 (Claude 3.7), 5b3b1bf (Document AI fixes), 0ec3d14 (multi-pass extraction)
   - Compare prompt simplicity - older versions may have had simpler, more effective prompts
   - Check if deterministic parser was being used more effectively

2. Review best practices for structured financial data extraction from PDFs/CIMs
   - Research: LLM prompt engineering for tabular data (few-shot examples, chain-of-thought)
   - Period identification strategies
   - Validation techniques
   - Hybrid approaches (deterministic + LLM)
   - Error handling patterns
   - Check academic papers and industry case studies

3. Determine how to reduce processing time without sacrificing accuracy
   - Options: 1) Use Claude Haiku 4.5 for initial extraction, Sonnet 4.5 for validation
   - 2) Parallel extraction of different sections
   - 3) Caching common patterns
   - 4) Streaming responses
   - 5) Incremental processing with early validation
   - 6) Reduce prompt verbosity while maintaining clarity

4. Add unit tests for financial extraction validation logic
   - Test: invalid value rejection, cross-period validation, numeric extraction
   - Period identification from various formats (years, FY-X, mixed)
   - Include edge cases: missing periods, projections mixed with historical, inconsistent formatting

5. Monitor production financial extraction accuracy
   - Track: extraction success rate, validation rejection rate, common error patterns
   - User feedback on extracted financial data
   - Set up alerts for validation failures and extraction inconsistencies

6. Optimize prompt size for financial extraction
   - Current prompts may be too verbose
   - Test shorter, more focused prompts that maintain accuracy
   - Consider: removing redundant instructions, using more concise examples, focusing on critical rules only

7. Add financial data visualization
   - Consider adding a financial data preview/validation step in the UI
   - Allow users to verify/correct extracted values if needed
   - Provides human-in-the-loop validation for critical financial data

8. Document extraction strategies
   - Document the different financial table formats found in CIMs
   - Create a reference guide for common patterns (years format, FY-X format, mixed format, etc.)
   - This will help with prompt engineering and parser improvements

9. Compare RAG-based extraction vs simple full-document extraction for financial accuracy
   - Determine which approach produces more accurate financial data and why
   - May need to hybrid approach

10. Add confidence scores to financial extraction results
    - Flag low-confidence extractions for manual review
    - Helps identify when extraction may be incorrect and needs human validation
2025-11-10 02:43:47 -05:00
2025-08-01 15:46:43 -04:00
2025-08-01 15:46:43 -04:00

CIM Document Processor - AI-Powered CIM Analysis System

🎯 Project Overview

Purpose: Automated processing and analysis of Confidential Information Memorandums (CIMs) using AI-powered document understanding and structured data extraction.

Core Technology Stack:

  • Frontend: React + TypeScript + Vite
  • Backend: Node.js + Express + TypeScript
  • Database: Supabase (PostgreSQL) + Vector Database
  • AI Services: Google Document AI + Claude AI + OpenAI
  • Storage: Google Cloud Storage
  • Authentication: Firebase Auth

🏗️ Architecture Summary

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Frontend      │    │   Backend       │    │   External      │
│   (React)       │◄──►│   (Node.js)     │◄──►│   Services      │
└─────────────────┘    └─────────────────┘    └─────────────────┘
                              │                        │
                              ▼                        ▼
                       ┌─────────────────┐    ┌─────────────────┐
                       │   Database      │    │   Google Cloud  │
                       │   (Supabase)    │    │   Services      │
                       └─────────────────┘    └─────────────────┘

📁 Key Directories & Files

Core Application

  • frontend/src/ - React frontend application
  • backend/src/ - Node.js backend services
  • backend/src/services/ - Core business logic services
  • backend/src/models/ - Database models and types
  • backend/src/routes/ - API route definitions

Documentation

  • APP_DESIGN_DOCUMENTATION.md - Complete system architecture
  • AGENTIC_RAG_IMPLEMENTATION_PLAN.md - AI processing strategy
  • PDF_GENERATION_ANALYSIS.md - PDF generation optimization
  • DEPLOYMENT_GUIDE.md - Deployment instructions
  • ARCHITECTURE_DIAGRAMS.md - Visual architecture documentation

Configuration

  • backend/src/config/ - Environment and service configuration
  • frontend/src/config/ - Frontend configuration
  • backend/scripts/ - Setup and utility scripts

🚀 Quick Start

Prerequisites

  • Node.js 18+
  • Google Cloud Platform account
  • Supabase account
  • Firebase project

Environment Setup

# Backend
cd backend
npm install
cp .env.example .env
# Configure environment variables

# Frontend
cd frontend
npm install
cp .env.example .env
# Configure environment variables

Development

# Backend (port 5001)
cd backend && npm run dev

# Frontend (port 5173)
cd frontend && npm run dev

🔧 Core Services

1. Document Processing Pipeline

  • unifiedDocumentProcessor.ts - Main orchestrator
  • optimizedAgenticRAGProcessor.ts - AI-powered analysis
  • documentAiProcessor.ts - Google Document AI integration
  • llmService.ts - LLM interactions (Claude AI/OpenAI)

2. File Management

  • fileStorageService.ts - Google Cloud Storage operations
  • pdfGenerationService.ts - PDF report generation
  • uploadMonitoringService.ts - Real-time upload tracking

3. Data Management

  • agenticRAGDatabaseService.ts - Analytics and session management
  • vectorDatabaseService.ts - Vector embeddings and search
  • sessionService.ts - User session management

📊 Processing Strategies

Current Active Strategy: Optimized Agentic RAG

  1. Text Extraction - Google Document AI extracts text from PDF
  2. Semantic Chunking - Split text into 4000-char chunks with overlap
  3. Vector Embedding - Generate embeddings for each chunk
  4. LLM Analysis - Claude AI analyzes chunks and generates structured data
  5. PDF Generation - Create summary PDF with analysis results

Output Format

Structured CIM Review data including:

  • Deal Overview
  • Business Description
  • Market Analysis
  • Financial Summary
  • Management Team
  • Investment Thesis
  • Key Questions & Next Steps

🔌 API Endpoints

Document Management

  • POST /documents/upload-url - Get signed upload URL
  • POST /documents/:id/confirm-upload - Confirm upload and start processing
  • POST /documents/:id/process-optimized-agentic-rag - Trigger AI processing
  • GET /documents/:id/download - Download processed PDF
  • DELETE /documents/:id - Delete document

Analytics & Monitoring

  • GET /documents/analytics - Get processing analytics
  • GET /documents/processing-stats - Get processing statistics
  • GET /documents/:id/agentic-rag-sessions - Get processing sessions
  • GET /monitoring/upload-metrics - Get upload metrics
  • GET /monitoring/upload-health - Get upload health status
  • GET /monitoring/real-time-stats - Get real-time statistics
  • GET /vector/stats - Get vector database statistics

🗄️ Database Schema

Core Tables

  • documents - Document metadata and processing status
  • agentic_rag_sessions - AI processing session tracking
  • document_chunks - Vector embeddings and chunk data
  • processing_jobs - Background job management
  • users - User authentication and profiles

🔐 Security

  • Firebase Authentication with JWT validation
  • Protected API endpoints with user-specific data isolation
  • Signed URLs for secure file uploads
  • Rate limiting and input validation
  • CORS configuration for cross-origin requests

📈 Performance & Monitoring

Real-time Monitoring

  • Upload progress tracking
  • Processing status updates
  • Error rate monitoring
  • Performance metrics
  • API usage tracking
  • Cost monitoring

Analytics Dashboard

  • Processing success rates
  • Average processing times
  • API usage statistics
  • Cost tracking
  • User activity metrics
  • Error analysis reports

🚨 Error Handling

Frontend Error Handling

  • Network errors with automatic retry
  • Authentication errors with token refresh
  • Upload errors with user-friendly messages
  • Processing errors with real-time display

Backend Error Handling

  • Validation errors with detailed messages
  • Processing errors with graceful degradation
  • Storage errors with retry logic
  • Database errors with connection pooling
  • LLM API errors with exponential backoff

🧪 Testing

Test Structure

  • Unit Tests: Jest for backend, Vitest for frontend
  • Integration Tests: End-to-end testing
  • API Tests: Supertest for backend endpoints

Test Coverage

  • Service layer testing
  • API endpoint testing
  • Error handling scenarios
  • Performance testing
  • Security testing

📚 Documentation Index

Technical Documentation

Analysis Reports

🤝 Contributing

Development Workflow

  1. Create feature branch from main
  2. Implement changes with tests
  3. Update documentation
  4. Submit pull request
  5. Code review and approval
  6. Merge to main

Code Standards

  • TypeScript for type safety
  • ESLint for code quality
  • Prettier for formatting
  • Jest for testing
  • Conventional commits for version control

📞 Support

Common Issues

  1. Upload Failures - Check GCS permissions and bucket configuration
  2. Processing Timeouts - Increase timeout limits for large documents
  3. Memory Issues - Monitor memory usage and adjust batch sizes
  4. API Quotas - Check API usage and implement rate limiting
  5. PDF Generation Failures - Check Puppeteer installation and memory
  6. LLM API Errors - Verify API keys and check rate limits

Debug Tools

  • Real-time logging with correlation IDs
  • Upload monitoring dashboard
  • Processing session details
  • Error analysis reports
  • Performance metrics dashboard

📄 License

This project is proprietary software developed for BPCP. All rights reserved.


Last Updated: December 2024 Version: 1.0.0 Status: Production Ready

Description
CIM Document Processor with Hybrid LLM Analysis
Readme 8.3 MiB
Languages
TypeScript 92.2%
JavaScript 3.7%
PLpgSQL 3.1%
Shell 1%