Go to file

admin 8b15732a98 feat: Add pre-deployment validation and deployment automation

- Add pre-deploy-check.sh script to validate .env doesn't contain secrets
- Add clean-env-secrets.sh script to remove secrets from .env before deployment
- Update deploy:firebase script to run validation automatically
- Add sync-secrets npm script for local development
- Add deploy:firebase:force for deployments that skip validation

This prevents 'Secret environment variable overlaps non secret environment variable' errors
by ensuring secrets defined via defineSecret() are not also in .env file.

## Completed Todos
- ✅ Test financial extraction with Stax Holding Company CIM - All values correct (FY-3: $64M, FY-2: $71M, FY-1: $71M, LTM: $76M)
- ✅ Implement deterministic parser fallback - Integrated into simpleDocumentProcessor
- ✅ Implement few-shot examples - Added comprehensive examples for PRIMARY table identification
- ✅ Fix primary table identification - Financial extraction now correctly identifies PRIMARY table (millions) vs subsidiary tables (thousands)

## Pending Todos
1. Review older commits (1-2 months ago) to see how financial extraction was working then
   - Check commits: 185c780 (Claude 3.7), 5b3b1bf (Document AI fixes), 0ec3d14 (multi-pass extraction)
   - Compare prompt simplicity - older versions may have had simpler, more effective prompts
   - Check if deterministic parser was being used more effectively

2. Review best practices for structured financial data extraction from PDFs/CIMs
   - Research: LLM prompt engineering for tabular data (few-shot examples, chain-of-thought)
   - Period identification strategies
   - Validation techniques
   - Hybrid approaches (deterministic + LLM)
   - Error handling patterns
   - Check academic papers and industry case studies

3. Determine how to reduce processing time without sacrificing accuracy
   - Options: 1) Use Claude Haiku 4.5 for initial extraction, Sonnet 4.5 for validation
   - 2) Parallel extraction of different sections
   - 3) Caching common patterns
   - 4) Streaming responses
   - 5) Incremental processing with early validation
   - 6) Reduce prompt verbosity while maintaining clarity

4. Add unit tests for financial extraction validation logic
   - Test: invalid value rejection, cross-period validation, numeric extraction
   - Period identification from various formats (years, FY-X, mixed)
   - Include edge cases: missing periods, projections mixed with historical, inconsistent formatting

5. Monitor production financial extraction accuracy
   - Track: extraction success rate, validation rejection rate, common error patterns
   - User feedback on extracted financial data
   - Set up alerts for validation failures and extraction inconsistencies

6. Optimize prompt size for financial extraction
   - Current prompts may be too verbose
   - Test shorter, more focused prompts that maintain accuracy
   - Consider: removing redundant instructions, using more concise examples, focusing on critical rules only

7. Add financial data visualization
   - Consider adding a financial data preview/validation step in the UI
   - Allow users to verify/correct extracted values if needed
   - Provides human-in-the-loop validation for critical financial data

8. Document extraction strategies
   - Document the different financial table formats found in CIMs
   - Create a reference guide for common patterns (years format, FY-X format, mixed format, etc.)
   - This will help with prompt engineering and parser improvements

9. Compare RAG-based extraction vs simple full-document extraction for financial accuracy
   - Determine which approach produces more accurate financial data and why
   - May need to hybrid approach

10. Add confidence scores to financial extraction results
    - Flag low-confidence extractions for manual review
    - Helps identify when extraction may be incorrect and needs human validation

2025-11-10 02:43:47 -05:00

backend

feat: Add pre-deployment validation and deployment automation

2025-11-10 02:43:47 -05:00

frontend

feat: Production release v2.0.0 - Simple Document Processor

2025-11-09 21:07:22 -05:00

.cursorignore

feat: Production release v2.0.0 - Simple Document Processor

2025-11-09 21:07:22 -05:00

.cursorrules

feat: Production release v2.0.0 - Simple Document Processor

2025-11-09 21:07:22 -05:00

.gcloudignore

temp: firebase deployment progress

2025-07-30 22:02:17 -04:00

.gitignore

feat: Complete implementation of Tasks 1-5 - CIM Document Processor

2025-07-27 13:29:26 -04:00

AGENTIC_RAG_IMPLEMENTATION_PLAN.md

feat: Implement hybrid LLM approach with enhanced prompts for CIM analysis

2025-07-28 16:46:06 -04:00

API_DOCUMENTATION_GUIDE.md

Add Bluepoint logo integration to PDF reports and web navigation

2025-08-02 15:12:33 -04:00

APP_DESIGN_DOCUMENTATION.md

Pre Kiro

2025-08-01 15:46:43 -04:00

ARCHITECTURE_DIAGRAMS.md

Pre Kiro

2025-08-01 15:46:43 -04:00

Best Practices for Debugging with Cursor_ Becoming.md

feat: Production release v2.0.0 - Simple Document Processor

2025-11-09 21:07:22 -05:00

BPCP CIM REVIEW TEMPLATE.md

feat: Complete implementation of Tasks 1-5 - CIM Document Processor

2025-07-27 13:29:26 -04:00

CIM_REVIEW_PDF_TEMPLATE.md

Add Bluepoint logo integration to PDF reports and web navigation

2025-08-02 15:12:33 -04:00

CLEANUP_PLAN.md

feat: Production release v2.0.0 - Simple Document Processor

2025-11-09 21:07:22 -05:00

CLEANUP_SUMMARY.md

feat: Production release v2.0.0 - Simple Document Processor

2025-11-09 21:07:22 -05:00

CODE_SUMMARY_TEMPLATE.md

Add Bluepoint logo integration to PDF reports and web navigation

2025-08-02 15:12:33 -04:00

CODEBASE_ARCHITECTURE_SUMMARY.md

feat: Production release v2.0.0 - Simple Document Processor

2025-11-09 21:07:22 -05:00

CONFIGURATION_GUIDE.md

Add Bluepoint logo integration to PDF reports and web navigation

2025-08-02 15:12:33 -04:00

DATABASE_SCHEMA_DOCUMENTATION.md

Add Bluepoint logo integration to PDF reports and web navigation

2025-08-02 15:12:33 -04:00

deploy.sh

🔧 Fix authentication and document upload issues

2025-07-31 16:18:53 -04:00

DEPLOYMENT_GUIDE.md

🔧 Fix authentication and document upload issues

2025-07-31 16:18:53 -04:00

DOCUMENT_AI_AGENTIC_RAG_INTEGRATION.md

Pre Kiro

2025-08-01 15:46:43 -04:00

FINANCIAL_EXTRACTION_ANALYSIS.md

feat: Production release v2.0.0 - Simple Document Processor

2025-11-09 21:07:22 -05:00

FULL_DOCUMENTATION_PLAN.md

Add Bluepoint logo integration to PDF reports and web navigation

2025-08-02 15:12:33 -04:00

HYBRID_SOLUTION.md

feat: Production release v2.0.0 - Simple Document Processor

2025-11-09 21:07:22 -05:00

IMPLEMENTATION_PLAN.md

feat: Production release v2.0.0 - Simple Document Processor

2025-11-09 21:07:22 -05:00

LLM_AGENT_DOCUMENTATION_GUIDE.md

Add Bluepoint logo integration to PDF reports and web navigation

2025-08-02 15:12:33 -04:00

M36c8GK0diLVtWRxuKRQmeiC3vP1735258363472_200x200.png

Add Bluepoint logo integration to PDF reports and web navigation

2025-08-02 15:12:33 -04:00

MONITORING_AND_ALERTING_GUIDE.md

Add Bluepoint logo integration to PDF reports and web navigation

2025-08-02 15:12:33 -04:00

PDF_GENERATION_ANALYSIS.md

Fix financial table rendering and enhance PDF generation

2025-08-01 20:33:16 -04:00

QUICK_FIX_SUMMARY.md

feat: Production release v2.0.0 - Simple Document Processor

2025-11-09 21:07:22 -05:00

QUICK_START.md

feat: Production release v2.0.0 - Simple Document Processor

2025-11-09 21:07:22 -05:00

README.md

Add Bluepoint logo integration to PDF reports and web navigation

2025-08-02 15:12:33 -04:00

TESTING_STRATEGY_DOCUMENTATION.md

Add Bluepoint logo integration to PDF reports and web navigation

2025-08-02 15:12:33 -04:00

TROUBLESHOOTING_GUIDE.md

Add Bluepoint logo integration to PDF reports and web navigation

2025-08-02 15:12:33 -04:00

README.md

CIM Document Processor - AI-Powered CIM Analysis System

🎯 Project Overview

Purpose: Automated processing and analysis of Confidential Information Memorandums (CIMs) using AI-powered document understanding and structured data extraction.

Core Technology Stack:

Frontend: React + TypeScript + Vite
Backend: Node.js + Express + TypeScript
Database: Supabase (PostgreSQL) + Vector Database
AI Services: Google Document AI + Claude AI + OpenAI
Storage: Google Cloud Storage
Authentication: Firebase Auth

🏗️ Architecture Summary

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Frontend      │    │   Backend       │    │   External      │
│   (React)       │◄──►│   (Node.js)     │◄──►│   Services      │
└─────────────────┘    └─────────────────┘    └─────────────────┘
                              │                        │
                              ▼                        ▼
                       ┌─────────────────┐    ┌─────────────────┐
                       │   Database      │    │   Google Cloud  │
                       │   (Supabase)    │    │   Services      │
                       └─────────────────┘    └─────────────────┘

📁 Key Directories & Files

Core Application

frontend/src/ - React frontend application
backend/src/ - Node.js backend services
backend/src/services/ - Core business logic services
backend/src/models/ - Database models and types
backend/src/routes/ - API route definitions

Documentation

APP_DESIGN_DOCUMENTATION.md - Complete system architecture
AGENTIC_RAG_IMPLEMENTATION_PLAN.md - AI processing strategy
PDF_GENERATION_ANALYSIS.md - PDF generation optimization
DEPLOYMENT_GUIDE.md - Deployment instructions
ARCHITECTURE_DIAGRAMS.md - Visual architecture documentation

Configuration

backend/src/config/ - Environment and service configuration
frontend/src/config/ - Frontend configuration
backend/scripts/ - Setup and utility scripts

🚀 Quick Start

Prerequisites

Node.js 18+
Google Cloud Platform account
Supabase account
Firebase project

Environment Setup

# Backend
cd backend
npm install
cp .env.example .env
# Configure environment variables

# Frontend
cd frontend
npm install
cp .env.example .env
# Configure environment variables

Development

# Backend (port 5001)
cd backend && npm run dev

# Frontend (port 5173)
cd frontend && npm run dev

🔧 Core Services

1. Document Processing Pipeline

unifiedDocumentProcessor.ts - Main orchestrator
optimizedAgenticRAGProcessor.ts - AI-powered analysis
documentAiProcessor.ts - Google Document AI integration
llmService.ts - LLM interactions (Claude AI/OpenAI)

2. File Management

fileStorageService.ts - Google Cloud Storage operations
pdfGenerationService.ts - PDF report generation
uploadMonitoringService.ts - Real-time upload tracking

3. Data Management

agenticRAGDatabaseService.ts - Analytics and session management
vectorDatabaseService.ts - Vector embeddings and search
sessionService.ts - User session management

📊 Processing Strategies

Current Active Strategy: Optimized Agentic RAG

Text Extraction - Google Document AI extracts text from PDF
Semantic Chunking - Split text into 4000-char chunks with overlap
Vector Embedding - Generate embeddings for each chunk
LLM Analysis - Claude AI analyzes chunks and generates structured data
PDF Generation - Create summary PDF with analysis results

Output Format

Structured CIM Review data including:

Deal Overview
Business Description
Market Analysis
Financial Summary
Management Team
Investment Thesis
Key Questions & Next Steps

🔌 API Endpoints

Document Management

POST /documents/upload-url - Get signed upload URL
POST /documents/:id/confirm-upload - Confirm upload and start processing
POST /documents/:id/process-optimized-agentic-rag - Trigger AI processing
GET /documents/:id/download - Download processed PDF
DELETE /documents/:id - Delete document

Analytics & Monitoring

GET /documents/analytics - Get processing analytics
GET /documents/processing-stats - Get processing statistics
GET /documents/:id/agentic-rag-sessions - Get processing sessions
GET /monitoring/upload-metrics - Get upload metrics
GET /monitoring/upload-health - Get upload health status
GET /monitoring/real-time-stats - Get real-time statistics
GET /vector/stats - Get vector database statistics

🗄️ Database Schema

Core Tables

documents - Document metadata and processing status
agentic_rag_sessions - AI processing session tracking
document_chunks - Vector embeddings and chunk data
processing_jobs - Background job management
users - User authentication and profiles

🔐 Security

Firebase Authentication with JWT validation
Protected API endpoints with user-specific data isolation
Signed URLs for secure file uploads
Rate limiting and input validation
CORS configuration for cross-origin requests

📈 Performance & Monitoring

Real-time Monitoring

Upload progress tracking
Processing status updates
Error rate monitoring
Performance metrics
API usage tracking
Cost monitoring

Analytics Dashboard

Processing success rates
Average processing times
API usage statistics
Cost tracking
User activity metrics
Error analysis reports

🚨 Error Handling

Frontend Error Handling

Network errors with automatic retry
Authentication errors with token refresh
Upload errors with user-friendly messages
Processing errors with real-time display

Backend Error Handling

Validation errors with detailed messages
Processing errors with graceful degradation
Storage errors with retry logic
Database errors with connection pooling
LLM API errors with exponential backoff

🧪 Testing

Test Structure

Unit Tests: Jest for backend, Vitest for frontend
Integration Tests: End-to-end testing
API Tests: Supertest for backend endpoints

Test Coverage

Service layer testing
API endpoint testing
Error handling scenarios
Performance testing
Security testing

📚 Documentation Index

Technical Documentation

Application Design Documentation - Complete system architecture
Agentic RAG Implementation Plan - AI processing strategy
PDF Generation Analysis - PDF optimization details
Architecture Diagrams - Visual system design
Deployment Guide - Deployment instructions

Analysis Reports

Codebase Audit Report - Code quality analysis
Dependency Analysis Report - Dependency management
Document AI Integration Summary - Google Document AI setup

🤝 Contributing

Development Workflow

Create feature branch from main
Implement changes with tests
Update documentation
Submit pull request
Code review and approval
Merge to main

Code Standards

TypeScript for type safety
ESLint for code quality
Prettier for formatting
Jest for testing
Conventional commits for version control

📞 Support

Common Issues

Upload Failures - Check GCS permissions and bucket configuration
Processing Timeouts - Increase timeout limits for large documents
Memory Issues - Monitor memory usage and adjust batch sizes
API Quotas - Check API usage and implement rate limiting
PDF Generation Failures - Check Puppeteer installation and memory
LLM API Errors - Verify API keys and check rate limits

Debug Tools

Real-time logging with correlation IDs
Upload monitoring dashboard
Processing session details
Error analysis reports
Performance metrics dashboard

📄 License

Last Updated: December 2024 Version: 1.0.0 Status: Production Ready