admin 7acd1297bb feat: Implement separate financial extraction with few-shot examples
- Add processFinancialsOnly() method for focused financial extraction
- Integrate deterministic parser into simpleDocumentProcessor
- Add comprehensive few-shot examples showing PRIMARY vs subsidiary tables
- Enhance prompt with explicit PRIMARY table identification rules
- Fix maxTokens default from 3500 to 16000 to prevent truncation
- Add test script for Stax Holding Company CIM validation

Test Results:
 FY-3: 4M revenue, cd /home/jonathan/Coding/cim_summary && git commit -m "feat: Implement separate financial extraction with few-shot examples

- Add processFinancialsOnly() method for focused financial extraction
- Integrate deterministic parser into simpleDocumentProcessor
- Add comprehensive few-shot examples showing PRIMARY vs subsidiary tables
- Enhance prompt with explicit PRIMARY table identification rules
- Fix maxTokens default from 3500 to 16000 to prevent truncation
- Add test script for Stax Holding Company CIM validation

Test Results:
 FY-3: $64M revenue, $19M EBITDA (correct)
 FY-2: $71M revenue, $24M EBITDA (correct)
 FY-1: $71M revenue, $24M EBITDA (correct)
 LTM: $76M revenue, $27M EBITDA (correct)

All financial values now correctly extracted from PRIMARY table (millions format)
instead of subsidiary tables (thousands format)."9M EBITDA (correct)
 FY-2: 1M revenue, 4M EBITDA (correct)
 FY-1: 1M revenue, 4M EBITDA (correct)
 LTM: 6M revenue, 7M EBITDA (correct)

All financial values now correctly extracted from PRIMARY table (millions format)
instead of subsidiary tables (thousands format).
2025-11-10 02:17:40 -05:00
2025-08-01 15:46:43 -04:00
2025-08-01 15:46:43 -04:00

CIM Document Processor - AI-Powered CIM Analysis System

🎯 Project Overview

Purpose: Automated processing and analysis of Confidential Information Memorandums (CIMs) using AI-powered document understanding and structured data extraction.

Core Technology Stack:

  • Frontend: React + TypeScript + Vite
  • Backend: Node.js + Express + TypeScript
  • Database: Supabase (PostgreSQL) + Vector Database
  • AI Services: Google Document AI + Claude AI + OpenAI
  • Storage: Google Cloud Storage
  • Authentication: Firebase Auth

🏗️ Architecture Summary

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Frontend      │    │   Backend       │    │   External      │
│   (React)       │◄──►│   (Node.js)     │◄──►│   Services      │
└─────────────────┘    └─────────────────┘    └─────────────────┘
                              │                        │
                              ▼                        ▼
                       ┌─────────────────┐    ┌─────────────────┐
                       │   Database      │    │   Google Cloud  │
                       │   (Supabase)    │    │   Services      │
                       └─────────────────┘    └─────────────────┘

📁 Key Directories & Files

Core Application

  • frontend/src/ - React frontend application
  • backend/src/ - Node.js backend services
  • backend/src/services/ - Core business logic services
  • backend/src/models/ - Database models and types
  • backend/src/routes/ - API route definitions

Documentation

  • APP_DESIGN_DOCUMENTATION.md - Complete system architecture
  • AGENTIC_RAG_IMPLEMENTATION_PLAN.md - AI processing strategy
  • PDF_GENERATION_ANALYSIS.md - PDF generation optimization
  • DEPLOYMENT_GUIDE.md - Deployment instructions
  • ARCHITECTURE_DIAGRAMS.md - Visual architecture documentation

Configuration

  • backend/src/config/ - Environment and service configuration
  • frontend/src/config/ - Frontend configuration
  • backend/scripts/ - Setup and utility scripts

🚀 Quick Start

Prerequisites

  • Node.js 18+
  • Google Cloud Platform account
  • Supabase account
  • Firebase project

Environment Setup

# Backend
cd backend
npm install
cp .env.example .env
# Configure environment variables

# Frontend
cd frontend
npm install
cp .env.example .env
# Configure environment variables

Development

# Backend (port 5001)
cd backend && npm run dev

# Frontend (port 5173)
cd frontend && npm run dev

🔧 Core Services

1. Document Processing Pipeline

  • unifiedDocumentProcessor.ts - Main orchestrator
  • optimizedAgenticRAGProcessor.ts - AI-powered analysis
  • documentAiProcessor.ts - Google Document AI integration
  • llmService.ts - LLM interactions (Claude AI/OpenAI)

2. File Management

  • fileStorageService.ts - Google Cloud Storage operations
  • pdfGenerationService.ts - PDF report generation
  • uploadMonitoringService.ts - Real-time upload tracking

3. Data Management

  • agenticRAGDatabaseService.ts - Analytics and session management
  • vectorDatabaseService.ts - Vector embeddings and search
  • sessionService.ts - User session management

📊 Processing Strategies

Current Active Strategy: Optimized Agentic RAG

  1. Text Extraction - Google Document AI extracts text from PDF
  2. Semantic Chunking - Split text into 4000-char chunks with overlap
  3. Vector Embedding - Generate embeddings for each chunk
  4. LLM Analysis - Claude AI analyzes chunks and generates structured data
  5. PDF Generation - Create summary PDF with analysis results

Output Format

Structured CIM Review data including:

  • Deal Overview
  • Business Description
  • Market Analysis
  • Financial Summary
  • Management Team
  • Investment Thesis
  • Key Questions & Next Steps

🔌 API Endpoints

Document Management

  • POST /documents/upload-url - Get signed upload URL
  • POST /documents/:id/confirm-upload - Confirm upload and start processing
  • POST /documents/:id/process-optimized-agentic-rag - Trigger AI processing
  • GET /documents/:id/download - Download processed PDF
  • DELETE /documents/:id - Delete document

Analytics & Monitoring

  • GET /documents/analytics - Get processing analytics
  • GET /documents/processing-stats - Get processing statistics
  • GET /documents/:id/agentic-rag-sessions - Get processing sessions
  • GET /monitoring/upload-metrics - Get upload metrics
  • GET /monitoring/upload-health - Get upload health status
  • GET /monitoring/real-time-stats - Get real-time statistics
  • GET /vector/stats - Get vector database statistics

🗄️ Database Schema

Core Tables

  • documents - Document metadata and processing status
  • agentic_rag_sessions - AI processing session tracking
  • document_chunks - Vector embeddings and chunk data
  • processing_jobs - Background job management
  • users - User authentication and profiles

🔐 Security

  • Firebase Authentication with JWT validation
  • Protected API endpoints with user-specific data isolation
  • Signed URLs for secure file uploads
  • Rate limiting and input validation
  • CORS configuration for cross-origin requests

📈 Performance & Monitoring

Real-time Monitoring

  • Upload progress tracking
  • Processing status updates
  • Error rate monitoring
  • Performance metrics
  • API usage tracking
  • Cost monitoring

Analytics Dashboard

  • Processing success rates
  • Average processing times
  • API usage statistics
  • Cost tracking
  • User activity metrics
  • Error analysis reports

🚨 Error Handling

Frontend Error Handling

  • Network errors with automatic retry
  • Authentication errors with token refresh
  • Upload errors with user-friendly messages
  • Processing errors with real-time display

Backend Error Handling

  • Validation errors with detailed messages
  • Processing errors with graceful degradation
  • Storage errors with retry logic
  • Database errors with connection pooling
  • LLM API errors with exponential backoff

🧪 Testing

Test Structure

  • Unit Tests: Jest for backend, Vitest for frontend
  • Integration Tests: End-to-end testing
  • API Tests: Supertest for backend endpoints

Test Coverage

  • Service layer testing
  • API endpoint testing
  • Error handling scenarios
  • Performance testing
  • Security testing

📚 Documentation Index

Technical Documentation

Analysis Reports

🤝 Contributing

Development Workflow

  1. Create feature branch from main
  2. Implement changes with tests
  3. Update documentation
  4. Submit pull request
  5. Code review and approval
  6. Merge to main

Code Standards

  • TypeScript for type safety
  • ESLint for code quality
  • Prettier for formatting
  • Jest for testing
  • Conventional commits for version control

📞 Support

Common Issues

  1. Upload Failures - Check GCS permissions and bucket configuration
  2. Processing Timeouts - Increase timeout limits for large documents
  3. Memory Issues - Monitor memory usage and adjust batch sizes
  4. API Quotas - Check API usage and implement rate limiting
  5. PDF Generation Failures - Check Puppeteer installation and memory
  6. LLM API Errors - Verify API keys and check rate limits

Debug Tools

  • Real-time logging with correlation IDs
  • Upload monitoring dashboard
  • Processing session details
  • Error analysis reports
  • Performance metrics dashboard

📄 License

This project is proprietary software developed for BPCP. All rights reserved.


Last Updated: December 2024 Version: 1.0.0 Status: Production Ready

Description
CIM Document Processor with Hybrid LLM Analysis
Readme 8.3 MiB
Languages
TypeScript 92.2%
JavaScript 3.7%
PLpgSQL 3.1%
Shell 1%