diff --git a/.kiro/specs/codebase-cleanup-and-upload-fix/design.md b/.kiro/specs/codebase-cleanup-and-upload-fix/design.md index ba32903..e177499 100644 --- a/.kiro/specs/codebase-cleanup-and-upload-fix/design.md +++ b/.kiro/specs/codebase-cleanup-and-upload-fix/design.md @@ -2,304 +2,416 @@ ## Overview -This design addresses the systematic cleanup of a document processing application that has accumulated technical debt during migration from local deployment to Firebase/GCloud infrastructure. The application currently suffers from configuration inconsistencies, redundant files, and document upload errors that need to be resolved through a structured cleanup and debugging approach. - -### Current Architecture Analysis - -The application consists of: -- **Backend**: Node.js/TypeScript API deployed on Google Cloud Run -- **Frontend**: React/TypeScript SPA deployed on Firebase Hosting -- **Database**: Supabase (PostgreSQL) for document metadata -- **Storage**: Currently using local file storage (MUST migrate to GCS) -- **Processing**: Document AI + Agentic RAG pipeline -- **Authentication**: Firebase Auth - -### Key Issues Identified - -1. **Configuration Drift**: Multiple environment files with conflicting settings -2. **Local Dependencies**: Still using local file storage and local PostgreSQL references (MUST use only Supabase) -3. **Upload Errors**: Invalid UUID errors in document retrieval -4. **Deployment Complexity**: Mixed local/cloud deployment artifacts -5. **Error Handling**: Insufficient error logging and debugging capabilities -6. **Architecture Inconsistency**: Local storage and database incompatible with cloud deployment +This design addresses the systematic cleanup and stabilization of the CIM Document Processor backend, with a focus on fixing the processing pipeline and integrating Firebase Storage as the primary file storage solution. The design identifies critical issues in the current codebase and provides a comprehensive solution to ensure reliable document processing from upload through final PDF generation. ## Architecture +### Current Issues Identified + +Based on error analysis and code review, the following critical issues have been identified: + +1. **Database Query Issues**: UUID validation errors when non-UUID strings are passed to document queries +2. **Service Dependencies**: Circular dependencies and missing service imports +3. **Firebase Storage Integration**: Incomplete migration from Google Cloud Storage to Firebase Storage +4. **Error Handling**: Insufficient error handling and logging throughout the pipeline +5. **Configuration Management**: Environment variable validation issues in serverless environments +6. **Processing Pipeline**: Broken service orchestration in the document processing flow + ### Target Architecture -```mermaid -graph TB - subgraph "Frontend (Firebase Hosting)" - A[React App] --> B[Document Upload Component] - B --> C[Auth Context] - end - - subgraph "Backend (Cloud Run)" - D[Express API] --> E[Document Controller] - E --> F[Upload Middleware] - F --> G[File Storage Service] - G --> H[GCS Bucket] - E --> I[Document Model] - I --> J[Supabase DB] - end - - subgraph "Processing Pipeline" - K[Job Queue] --> L[Document AI] - L --> M[Agentic RAG] - M --> N[PDF Generation] - end - - A --> D - E --> K - - subgraph "Authentication" - O[Firebase Auth] --> A - O --> D - end ``` - -### Configuration Management Strategy - -1. **Environment Separation**: Clear distinction between development, staging, and production -2. **Service-Specific Configs**: Separate Firebase, GCloud, and Supabase configurations -3. **Secret Management**: Proper handling of API keys and service account credentials -4. **Deployment Consistency**: Single deployment strategy per environment +┌─────────────────────────────────────────────────────────────────────────────┐ +│ FRONTEND (React) │ +├─────────────────────────────────────────────────────────────────────────────┤ +│ Firebase Auth + Document Upload → Firebase Storage │ +└─────────────────────────────────────────────────────────────────────────────┘ + │ + ▼ HTTPS API +┌─────────────────────────────────────────────────────────────────────────────┐ +│ BACKEND (Node.js) │ +├─────────────────────────────────────────────────────────────────────────────┤ +│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ +│ │ API Routes │ │ Middleware │ │ Error Handler │ │ +│ │ - Documents │ │ - Auth │ │ - Global │ │ +│ │ - Monitoring │ │ - Validation │ │ - Correlation │ │ +│ │ - Vector │ │ - CORS │ │ - Logging │ │ +│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │ +│ │ +│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ +│ │ Core Services │ │ Processing │ │ External APIs │ │ +│ │ - Document │ │ - Agentic RAG │ │ - Document AI │ │ +│ │ - Upload │ │ - LLM Service │ │ - Claude AI │ │ +│ │ - Session │ │ - PDF Gen │ │ - Firebase │ │ +│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │ +└─────────────────────────────────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────────────────┐ +│ STORAGE & DATABASE │ +├─────────────────────────────────────────────────────────────────────────────┤ +│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ +│ │ Firebase │ │ Supabase │ │ Vector │ │ +│ │ Storage │ │ Database │ │ Database │ │ +│ │ - File Upload │ │ - Documents │ │ - Embeddings │ │ +│ │ - Security │ │ - Sessions │ │ - Chunks │ │ +│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │ +└─────────────────────────────────────────────────────────────────────────────┘ +``` ## Components and Interfaces -### 1. Configuration Cleanup Service +### 1. Enhanced Configuration Management -**Purpose**: Consolidate and standardize environment configurations +**Purpose**: Robust environment variable validation and configuration management for both development and production environments. + +**Key Features**: +- Graceful handling of missing environment variables in serverless environments +- Runtime configuration validation +- Fallback values for non-critical settings +- Clear error messages for missing critical configuration **Interface**: ```typescript -interface ConfigurationService { - validateEnvironment(): Promise; - consolidateConfigs(): Promise; - removeRedundantFiles(): Promise; - updateDeploymentConfigs(): Promise; +interface Config { + // Core settings + env: string; + port: number; + + // Firebase configuration + firebase: { + projectId: string; + storageBucket: string; + apiKey: string; + authDomain: string; + }; + + // Database configuration + supabase: { + url: string; + anonKey: string; + serviceKey: string; + }; + + // External services + googleCloud: { + projectId: string; + documentAiLocation: string; + documentAiProcessorId: string; + applicationCredentials: string; + }; + + // LLM configuration + llm: { + provider: 'anthropic' | 'openai'; + anthropicApiKey?: string; + openaiApiKey?: string; + model: string; + maxTokens: number; + temperature: number; + }; } ``` -**Responsibilities**: -- Remove duplicate/conflicting environment files -- Standardize Firebase and GCloud configurations -- Validate required environment variables -- Update deployment scripts and configurations +### 2. Firebase Storage Service -### 2. Storage Migration Service +**Purpose**: Complete Firebase Storage integration replacing Google Cloud Storage for file operations. -**Purpose**: Complete migration from local storage to Google Cloud Storage (no local storage going forward) +**Key Features**: +- Secure file upload with Firebase Authentication +- Proper file organization and naming conventions +- File metadata management +- Download URL generation +- File cleanup and lifecycle management **Interface**: ```typescript -interface StorageMigrationService { - migrateExistingFiles(): Promise; - replaceFileStorageService(): Promise; - validateGCSConfiguration(): Promise; - removeAllLocalStorageDependencies(): Promise; - updateDatabaseReferences(): Promise; +interface FirebaseStorageService { + uploadFile(file: Buffer, fileName: string, userId: string): Promise; + getDownloadUrl(filePath: string): Promise; + deleteFile(filePath: string): Promise; + getFileMetadata(filePath: string): Promise; + generateUploadUrl(fileName: string, userId: string): Promise; } ``` -**Responsibilities**: -- Migrate ALL existing uploaded files to GCS -- Completely replace file storage service to use ONLY GCS -- Update all file path references in database to GCS URLs -- Remove ALL local storage code and dependencies -- Ensure no fallback to local storage exists +### 3. Enhanced Document Service -### 3. Upload Error Diagnostic Service +**Purpose**: Centralized document management with proper error handling and validation. -**Purpose**: Identify and resolve document upload errors +**Key Features**: +- UUID validation for all document operations +- Proper error handling for database operations +- Document lifecycle management +- Status tracking and updates +- Metadata management **Interface**: ```typescript -interface UploadDiagnosticService { - analyzeUploadErrors(): Promise; - validateUploadPipeline(): Promise; - fixRouteHandling(): Promise; - improveErrorLogging(): Promise; +interface DocumentService { + createDocument(data: CreateDocumentData): Promise; + getDocument(id: string): Promise; + updateDocument(id: string, updates: Partial): Promise; + deleteDocument(id: string): Promise; + listDocuments(userId: string, filters?: DocumentFilters): Promise; + validateDocumentId(id: string): boolean; } ``` -**Responsibilities**: -- Analyze current upload error patterns -- Fix UUID validation issues in routes -- Improve error handling and logging -- Validate complete upload pipeline +### 4. Improved Processing Pipeline -### 4. Deployment Standardization Service +**Purpose**: Reliable document processing pipeline with proper error handling and recovery. -**Purpose**: Standardize deployment processes and remove legacy artifacts +**Key Features**: +- Step-by-step processing with checkpoints +- Error recovery and retry mechanisms +- Progress tracking and status updates +- Partial result preservation +- Processing timeout handling **Interface**: ```typescript -interface DeploymentService { - standardizeDeploymentScripts(): Promise; - removeLocalDeploymentArtifacts(): Promise; - validateCloudDeployment(): Promise; - updateDocumentation(): Promise; +interface ProcessingPipeline { + processDocument(documentId: string, options: ProcessingOptions): Promise; + getProcessingStatus(documentId: string): Promise; + retryProcessing(documentId: string, fromStep?: string): Promise; + cancelProcessing(documentId: string): Promise; } ``` -**Responsibilities**: -- Remove local deployment scripts and configurations -- Standardize Cloud Run and Firebase deployment -- Update package.json scripts -- Create deployment documentation +### 5. Robust Error Handling System + +**Purpose**: Comprehensive error handling with correlation tracking and proper logging. + +**Key Features**: +- Correlation ID generation for request tracking +- Structured error logging +- Error categorization and handling strategies +- User-friendly error messages +- Error recovery mechanisms + +**Interface**: +```typescript +interface ErrorHandler { + handleError(error: Error, context: ErrorContext): ErrorResponse; + logError(error: Error, correlationId: string, context: any): void; + createCorrelationId(): string; + categorizeError(error: Error): ErrorCategory; +} +``` ## Data Models -### Configuration Validation Model +### Enhanced Document Model ```typescript -interface ConfigValidation { - environment: 'development' | 'staging' | 'production'; - requiredVars: string[]; - optionalVars: string[]; - conflicts: ConfigConflict[]; - missing: string[]; - status: 'valid' | 'invalid' | 'warning'; +interface Document { + id: string; // UUID + userId: string; + originalFileName: string; + filePath: string; // Firebase Storage path + fileSize: number; + mimeType: string; + status: DocumentStatus; + extractedText?: string; + generatedSummary?: string; + summaryPdfPath?: string; + analysisData?: CIMReview; + processingSteps: ProcessingStep[]; + errorLog?: ErrorEntry[]; + createdAt: Date; + updatedAt: Date; } -interface ConfigConflict { - variable: string; - values: string[]; - files: string[]; - resolution: string; +enum DocumentStatus { + UPLOADED = 'uploaded', + PROCESSING = 'processing', + COMPLETED = 'completed', + FAILED = 'failed', + CANCELLED = 'cancelled' +} + +interface ProcessingStep { + step: string; + status: 'pending' | 'in_progress' | 'completed' | 'failed'; + startedAt?: Date; + completedAt?: Date; + error?: string; + metadata?: Record; } ``` -### Migration Status Model +### Processing Session Model ```typescript -interface MigrationStatus { - totalFiles: number; - migratedFiles: number; - failedFiles: FileError[]; - storageUsage: { - local: number; - cloud: number; - }; - status: 'pending' | 'in-progress' | 'completed' | 'failed'; -} - -interface FileError { - filePath: string; - error: string; - retryCount: number; - lastAttempt: Date; -} -``` - -### Upload Error Analysis Model - -```typescript -interface UploadErrorAnalysis { - errorTypes: { - [key: string]: { - count: number; - examples: string[]; - severity: 'low' | 'medium' | 'high'; - }; - }; - affectedRoutes: string[]; - timeRange: { - start: Date; - end: Date; - }; - recommendations: string[]; +interface ProcessingSession { + id: string; + documentId: string; + strategy: string; + status: SessionStatus; + steps: ProcessingStep[]; + totalSteps: number; + completedSteps: number; + failedSteps: number; + processingTimeMs?: number; + apiCallsCount: number; + totalCost: number; + errorLog: ErrorEntry[]; + createdAt: Date; + completedAt?: Date; } ``` ## Error Handling -### Upload Error Resolution Strategy +### Error Categories and Strategies -1. **Route Parameter Validation**: Fix UUID validation in document routes -2. **Error Logging Enhancement**: Add structured logging with correlation IDs -3. **Graceful Degradation**: Implement fallback mechanisms for upload failures -4. **User Feedback**: Provide clear error messages to users +1. **Validation Errors** + - UUID format validation + - File type and size validation + - Required field validation + - Strategy: Return 400 Bad Request with detailed error message -### Configuration Error Handling +2. **Authentication Errors** + - Invalid or expired tokens + - Missing authentication + - Strategy: Return 401 Unauthorized, trigger token refresh -1. **Validation on Startup**: Validate all configurations before service startup -2. **Fallback Configurations**: Provide sensible defaults for non-critical settings -3. **Environment Detection**: Automatically detect and configure for deployment environment -4. **Configuration Monitoring**: Monitor configuration drift in production +3. **Authorization Errors** + - Insufficient permissions + - Resource access denied + - Strategy: Return 403 Forbidden with clear message -### Storage Error Handling +4. **Resource Not Found** + - Document not found + - File not found + - Strategy: Return 404 Not Found -1. **Retry Logic**: Implement exponential backoff for GCS operations -2. **Migration Safety**: Backup existing files before migration, then remove local storage completely -3. **Integrity Checks**: Validate file integrity after migration to GCS -4. **GCS-Only Operations**: All storage operations must use GCS exclusively (no local fallbacks) +5. **External Service Errors** + - Firebase Storage errors + - Document AI failures + - LLM API errors + - Strategy: Retry with exponential backoff, fallback options + +6. **Processing Errors** + - Text extraction failures + - PDF generation errors + - Database operation failures + - Strategy: Preserve partial results, enable retry from checkpoint + +7. **System Errors** + - Memory issues + - Timeout errors + - Network failures + - Strategy: Graceful degradation, error logging, monitoring alerts + +### Error Response Format + +```typescript +interface ErrorResponse { + success: false; + error: { + code: string; + message: string; + details?: any; + correlationId: string; + timestamp: string; + retryable: boolean; + }; +} +``` ## Testing Strategy -### Configuration Testing +### Unit Testing +- Service layer testing with mocked dependencies +- Utility function testing +- Configuration validation testing +- Error handling testing -1. **Environment Validation Tests**: Verify all required configurations are present -2. **Configuration Conflict Tests**: Detect and report configuration conflicts -3. **Deployment Tests**: Validate deployment configurations work correctly -4. **Integration Tests**: Test configuration changes don't break existing functionality +### Integration Testing +- Firebase Storage integration +- Database operations +- External API integrations +- End-to-end processing pipeline -### Upload Pipeline Testing +### Error Scenario Testing +- Network failure simulation +- API rate limit testing +- Invalid input handling +- Timeout scenario testing -1. **Unit Tests**: Test individual upload components -2. **Integration Tests**: Test complete upload pipeline -3. **Error Scenario Tests**: Test various error conditions and recovery -4. **Performance Tests**: Validate upload performance after changes - -### Storage Migration Testing - -1. **Migration Tests**: Test file migration process -2. **Data Integrity Tests**: Verify files are correctly migrated -3. **Rollback Tests**: Test ability to rollback migration -4. **Performance Tests**: Compare storage performance before/after migration - -### End-to-End Testing - -1. **User Journey Tests**: Test complete user upload journey -2. **Cross-Environment Tests**: Verify functionality across all environments -3. **Regression Tests**: Ensure cleanup doesn't break existing features -4. **Load Tests**: Validate system performance under load +### Performance Testing +- Large file upload testing +- Concurrent processing testing +- Memory usage monitoring +- API response time testing ## Implementation Phases -### Phase 1: Analysis and Planning -- Audit current configuration files and identify conflicts -- Analyze upload error patterns and root causes -- Document current deployment process and identify issues -- Create detailed cleanup and migration plan +### Phase 1: Core Infrastructure Cleanup +1. Fix configuration management and environment variable handling +2. Implement proper UUID validation for database queries +3. Set up comprehensive error handling and logging +4. Fix service dependency issues -### Phase 2: Configuration Cleanup -- Remove redundant and conflicting configuration files -- Standardize environment variable naming and structure -- Update deployment configurations for consistency -- Validate configurations across all environments +### Phase 2: Firebase Storage Integration +1. Implement Firebase Storage service +2. Update file upload endpoints +3. Migrate existing file operations +4. Update frontend integration -### Phase 3: Storage Migration -- Implement Google Cloud Storage integration -- Migrate existing files from local storage to GCS -- Update file storage service and database references -- Test and validate storage functionality +### Phase 3: Processing Pipeline Stabilization +1. Fix service orchestration issues +2. Implement proper error recovery +3. Add processing checkpoints +4. Enhance monitoring and logging -### Phase 4: Upload Error Resolution -- Fix UUID validation issues in document routes -- Improve error handling and logging throughout upload pipeline -- Implement better user feedback for upload errors -- Add monitoring and alerting for upload failures +### Phase 4: Testing and Optimization +1. Comprehensive testing suite +2. Performance optimization +3. Error scenario testing +4. Documentation updates -### Phase 5: Deployment Standardization -- Remove local deployment artifacts and scripts -- Standardize Cloud Run and Firebase deployment processes -- Update documentation and deployment guides -- Implement automated deployment validation +## Security Considerations -### Phase 6: Testing and Validation -- Comprehensive testing of all changes -- Performance validation and optimization -- User acceptance testing -- Production deployment and monitoring \ No newline at end of file +### Firebase Storage Security +- Proper Firebase Security Rules +- User-based file access control +- File type and size validation +- Secure download URL generation + +### API Security +- Request validation and sanitization +- Rate limiting and abuse prevention +- Correlation ID tracking +- Secure error messages (no sensitive data exposure) + +### Data Protection +- User data isolation +- Secure file deletion +- Audit logging +- GDPR compliance considerations + +## Monitoring and Observability + +### Key Metrics +- Document processing success rate +- Average processing time per document +- API response times +- Error rates by category +- Firebase Storage usage +- Database query performance + +### Logging Strategy +- Structured logging with correlation IDs +- Error categorization and tracking +- Performance metrics logging +- User activity logging +- External service interaction logging + +### Health Checks +- Service availability checks +- Database connectivity +- External service status +- File storage accessibility +- Processing pipeline health \ No newline at end of file diff --git a/.kiro/specs/codebase-cleanup-and-upload-fix/requirements.md b/.kiro/specs/codebase-cleanup-and-upload-fix/requirements.md index ad34cda..3323da4 100644 --- a/.kiro/specs/codebase-cleanup-and-upload-fix/requirements.md +++ b/.kiro/specs/codebase-cleanup-and-upload-fix/requirements.md @@ -2,61 +2,74 @@ ## Introduction -This feature focuses on cleaning up the codebase that has accumulated technical debt during the migration from local deployment to Firebase/GCloud solution, and resolving persistent document upload errors. The cleanup will improve code maintainability, remove redundant configurations, and establish a clear deployment strategy while fixing the core document upload functionality. +The CIM Document Processor is experiencing backend processing failures that prevent the full document processing pipeline from working correctly. The system has a complex architecture with multiple services (Document AI, LLM processing, PDF generation, vector database, etc.) that need to be cleaned up and properly integrated to ensure reliable document processing from upload through final PDF generation. ## Requirements ### Requirement 1 -**User Story:** As a developer, I want a clean and organized codebase, so that I can easily maintain and extend the application without confusion from legacy configurations. +**User Story:** As a developer, I want a clean and properly functioning backend codebase, so that I can reliably process CIM documents without errors. #### Acceptance Criteria -1. WHEN reviewing the codebase THEN the system SHALL have only necessary environment files and configurations -2. WHEN examining deployment configurations THEN the system SHALL have a single, clear deployment strategy for each environment -3. WHEN looking at service configurations THEN the system SHALL have consistent Firebase/GCloud integration without local deployment remnants -4. WHEN reviewing file structure THEN the system SHALL have organized directories without redundant or conflicting files +1. WHEN the backend starts THEN all services SHALL initialize without errors +2. WHEN environment variables are loaded THEN all required configuration SHALL be validated and available +3. WHEN database connections are established THEN all database operations SHALL work correctly +4. WHEN external service integrations are tested THEN Google Document AI, Claude AI, and Firebase Storage SHALL be properly connected ### Requirement 2 -**User Story:** As a user, I want to upload documents successfully, so that I can process and analyze my files without encountering errors. +**User Story:** As a user, I want to upload PDF documents successfully, so that I can process CIM documents for analysis. #### Acceptance Criteria -1. WHEN a user uploads a document THEN the system SHALL accept the file and begin processing without errors -2. WHEN document upload fails THEN the system SHALL provide clear error messages indicating the specific issue -3. WHEN processing a document THEN the system SHALL handle all file types supported by the Document AI service -4. WHEN upload completes THEN the system SHALL store the document in the correct Firebase/GCloud storage location +1. WHEN a user uploads a PDF file THEN the file SHALL be stored in Firebase storage +2. WHEN upload is confirmed THEN a processing job SHALL be created in the database +3. WHEN upload fails THEN the user SHALL receive clear error messages +4. WHEN upload monitoring is active THEN real-time progress SHALL be tracked and displayed ### Requirement 3 -**User Story:** As a developer, I want clear error logging and debugging capabilities, so that I can quickly identify and resolve issues in the document processing pipeline. +**User Story:** As a user, I want the document processing pipeline to work end-to-end, so that I can get structured CIM analysis results. #### Acceptance Criteria -1. WHEN an error occurs during upload THEN the system SHALL log detailed error information including stack traces -2. WHEN debugging upload issues THEN the system SHALL provide clear logging at each step of the process -3. WHEN errors occur THEN the system SHALL distinguish between client-side and server-side issues -4. WHEN reviewing logs THEN the system SHALL have structured logging with appropriate log levels +1. WHEN a document is uploaded THEN Google Document AI SHALL extract text successfully +2. WHEN text is extracted THEN the optimized agentic RAG processor SHALL chunk and process the content +3. WHEN chunks are processed THEN vector embeddings SHALL be generated and stored +4. WHEN LLM analysis is triggered THEN Claude AI SHALL generate structured CIM review data +5. WHEN analysis is complete THEN a PDF summary SHALL be generated using Puppeteer +6. WHEN processing fails at any step THEN error handling SHALL provide graceful degradation ### Requirement 4 -**User Story:** As a system administrator, I want consistent and secure configuration management, so that the application can be deployed reliably across different environments. +**User Story:** As a developer, I want proper error handling and logging throughout the system, so that I can diagnose and fix issues quickly. #### Acceptance Criteria -1. WHEN deploying to different environments THEN the system SHALL use environment-specific configurations -2. WHEN handling sensitive data THEN the system SHALL properly manage API keys and credentials -3. WHEN configuring services THEN the system SHALL have consistent Firebase/GCloud service initialization -4. WHEN reviewing security THEN the system SHALL have proper authentication and authorization for file uploads +1. WHEN errors occur THEN they SHALL be logged with correlation IDs for tracking +2. WHEN API calls fail THEN retry logic SHALL be implemented with exponential backoff +3. WHEN processing fails THEN partial results SHALL be preserved where possible +4. WHEN system health is checked THEN monitoring endpoints SHALL provide accurate status information ### Requirement 5 -**User Story:** As a developer, I want to understand the current system architecture, so that I can make informed decisions about cleanup priorities and upload error resolution. +**User Story:** As a user, I want the frontend to properly communicate with the backend, so that I can see processing status and results in real-time. #### Acceptance Criteria -1. WHEN analyzing the codebase THEN the system SHALL have documented service dependencies and data flow -2. WHEN reviewing upload process THEN the system SHALL have clear understanding of each processing step -3. WHEN examining errors THEN the system SHALL identify specific failure points in the upload pipeline -4. WHEN planning cleanup THEN the system SHALL prioritize changes that don't break existing functionality \ No newline at end of file +1. WHEN frontend makes API calls THEN authentication SHALL work correctly +2. WHEN processing is in progress THEN real-time status updates SHALL be displayed +3. WHEN processing is complete THEN results SHALL be downloadable +4. WHEN errors occur THEN user-friendly error messages SHALL be shown + +### Requirement 6 + +**User Story:** As a developer, I want clean service dependencies and proper separation of concerns, so that the codebase is maintainable and testable. + +#### Acceptance Criteria + +1. WHEN services are initialized THEN dependencies SHALL be properly injected +2. WHEN business logic is executed THEN it SHALL be separated from API routing +3. WHEN database operations are performed THEN they SHALL use proper connection pooling +4. WHEN external APIs are called THEN they SHALL have proper rate limiting and error handling \ No newline at end of file diff --git a/.kiro/specs/codebase-cleanup-and-upload-fix/tasks.md b/.kiro/specs/codebase-cleanup-and-upload-fix/tasks.md index f0e38c2..70f10b7 100644 --- a/.kiro/specs/codebase-cleanup-and-upload-fix/tasks.md +++ b/.kiro/specs/codebase-cleanup-and-upload-fix/tasks.md @@ -1,85 +1,217 @@ # Implementation Plan -- [x] 1. Audit and analyze current codebase configuration issues - - Identify all environment files and their conflicts - - Document current local dependencies (storage and database) - - Analyze upload error patterns from logs - - Map current deployment artifacts and scripts - - _Requirements: 1.1, 1.2, 1.3, 1.4_ +## 1. Fix Core Configuration and Environment Management -- [x] 2. Remove redundant and conflicting configuration files - - Delete duplicate .env files (.env.backup, .env.backup.hybrid, .env.development, .env.document-ai-template) - - Consolidate environment variables into single .env.example and production configs - - Remove local PostgreSQL configuration references from env.ts - - Update config validation schema to require only cloud services - - _Requirements: 1.1, 4.1, 4.2_ +- [ ] 1.1 Update environment configuration validation + - Modify `backend/src/config/env.ts` to handle serverless environment variable loading gracefully + - Add runtime configuration validation with proper fallbacks + - Implement configuration health check endpoint + - Add Firebase configuration validation + - _Requirements: 1.1, 1.2_ -- [x] 3. Implement Google Cloud Storage service integration - - Create / confirm GCS-only file storage service replacing current local storage - - Implement GCS bucket operations (upload, download, delete, list) - - Add proper error handling and retry logic for GCS operations - - Configure GCS authentication using service account - - _Requirements: 2.1, 2.2, 4.3_ +- [ ] 1.2 Implement robust error handling middleware + - Create enhanced error handler in `backend/src/middleware/errorHandler.ts` + - Add correlation ID generation and tracking + - Implement structured error logging with Winston + - Add error categorization and response formatting + - _Requirements: 4.1, 4.2_ -- [ ] 4. Migrate existing files from local storage to GCS - - Create migration script to upload all files from backend/uploads to GCS - - Update database file_path references to use GCS URLs instead of local paths - - Verify file integrity after migration - - Create backup of local files before cleanup - - _Requirements: 2.1, 2.2_ +- [ ] 1.3 Fix database query validation issues + - Update all document-related database queries to validate UUID format before execution + - Add proper input sanitization in `backend/src/models/DocumentModel.ts` + - Implement UUID validation utility function + - Fix the "invalid input syntax for type uuid" errors seen in logs + - _Requirements: 1.3, 4.1_ -- [x] 5. Update file storage service to use GCS exclusively - - Replace fileStorageService.ts to use only Google Cloud Storage - - Remove all local file system operations (fs.readFileSync, fs.writeFileSync, etc.) - - Update upload middleware to work with GCS temporary URLs - - Remove local upload directory creation and management - - _Requirements: 2.1, 2.2, 2.3_ +## 2. Implement Firebase Storage Integration -- [x] 6. Fix document upload route UUID validation errors - - Analyze and fix invalid UUID errors in document routes - - Add proper UUID validation middleware for document ID parameters - - Improve error messages for invalid document ID requests - - Add request correlation IDs for better error tracking - - _Requirements: 2.2, 3.1, 3.2, 3.3_ +- [ ] 2.1 Create Firebase Storage service + - Implement `backend/src/services/firebaseStorageService.ts` with complete Firebase Storage integration + - Add file upload, download, delete, and metadata operations + - Implement secure file path generation and user-based access control + - Add proper error handling and retry logic for Firebase operations + - _Requirements: 2.1, 2.5_ -- [x] 7. Remove all local storage dependencies and cleanup - - Delete backend/uploads directory and all local file references - - Remove local storage configuration from env.ts and related files - - Update upload middleware to remove local file system operations - - Remove cleanup functions for local files +- [ ] 2.2 Update file upload endpoints + - Modify `backend/src/routes/documents.ts` to use Firebase Storage instead of Google Cloud Storage + - Update upload URL generation to use Firebase Storage signed URLs + - Implement proper file validation (type, size, security) + - Add upload progress tracking and monitoring - _Requirements: 2.1, 2.4_ -- [x] 8. Standardize deployment configurations for cloud-only architecture - - Update Firebase deployment configurations for both frontend and backend - - Remove any local deployment scripts and references - - Standardize Cloud Run deployment configuration - - Update package.json scripts to remove local development dependencies - - _Requirements: 1.1, 1.4, 4.1_ +- [ ] 2.3 Update document processing pipeline for Firebase Storage + - Modify `backend/src/services/unifiedDocumentProcessor.ts` to work with Firebase Storage + - Update file retrieval operations in processing services + - Ensure PDF generation service can access files from Firebase Storage + - Update all file path references throughout the codebase + - _Requirements: 2.1, 3.1_ -- [x] 9. Enhance error logging and monitoring for upload pipeline - - Add structured logging with correlation IDs throughout upload process - - Implement better error categorization and reporting - - Add monitoring for upload success/failure rates - - Create error dashboards for upload pipeline debugging - - _Requirements: 3.1, 3.2, 3.3_ +## 3. Fix Service Dependencies and Orchestration -- [x] 10. Update frontend to handle GCS-based file operations - - Update DocumentUpload component to work with GCS URLs - - Modify file progress monitoring to work with cloud storage - - Update error handling for GCS-specific errors - - Test upload functionality with new GCS backend - - _Requirements: 2.1, 2.2, 3.4_ +- [ ] 3.1 Resolve service import and dependency issues + - Fix circular dependencies in service imports + - Ensure all required services are properly imported and initialized + - Add dependency injection pattern for better testability + - Update service initialization order in `backend/src/index.ts` + - _Requirements: 6.1, 6.2_ -- [x] 11. Create comprehensive tests for cloud-only architecture - - Write unit tests for GCS file storage service - - Create integration tests for complete upload pipeline - - Add tests for error scenarios and recovery - - Test deployment configurations in staging environment - - _Requirements: 1.4, 2.1, 2.2, 2.3_ +- [ ] 3.2 Fix document processing pipeline orchestration + - Update `backend/src/services/unifiedDocumentProcessor.ts` to handle all processing steps correctly + - Ensure proper error handling between processing steps + - Add processing checkpoints and recovery mechanisms + - Implement proper status updates throughout the pipeline + - _Requirements: 3.1, 3.2, 3.6_ -- [x] 12. Validate and test complete system functionality - - Perform end-to-end testing of document upload and processing - - Validate all environment configurations work correctly - - Test error handling and user feedback mechanisms - - Verify no local dependencies remain in the system - - _Requirements: 1.1, 1.2, 1.4, 2.1, 2.2, 2.3, 2.4_ \ No newline at end of file +- [ ] 3.3 Enhance optimized agentic RAG processor + - Fix any issues in `backend/src/services/optimizedAgenticRAGProcessor.ts` + - Ensure proper memory management and garbage collection + - Add better error handling for LLM API calls + - Implement proper retry logic with exponential backoff + - _Requirements: 3.3, 3.4, 4.2_ + +## 4. Improve LLM Service Integration + +- [ ] 4.1 Fix LLM service configuration and initialization + - Update `backend/src/services/llmService.ts` to handle configuration properly + - Fix model selection logic and API key validation + - Add proper timeout handling for LLM API calls + - Implement cost tracking and usage monitoring + - _Requirements: 1.4, 3.4_ + +- [ ] 4.2 Enhance LLM error handling and retry logic + - Add comprehensive error handling for both Anthropic and OpenAI APIs + - Implement retry logic with exponential backoff for API failures + - Add fallback model selection when primary model fails + - Implement proper JSON parsing and validation for LLM responses + - _Requirements: 3.4, 4.2_ + +- [ ] 4.3 Add LLM response validation and self-correction + - Enhance JSON extraction from LLM responses + - Add schema validation for CIM review data + - Implement self-correction mechanism for invalid responses + - Add quality scoring and validation for generated content + - _Requirements: 3.4, 4.3_ + +## 5. Fix PDF Generation and File Operations + +- [ ] 5.1 Update PDF generation service for Firebase Storage + - Modify `backend/src/services/pdfGenerationService.ts` to work with Firebase Storage + - Ensure generated PDFs are properly stored in Firebase Storage + - Add proper error handling for PDF generation failures + - Implement PDF generation progress tracking + - _Requirements: 3.5, 2.1_ + +- [ ] 5.2 Implement proper file cleanup and lifecycle management + - Add automatic cleanup of temporary files during processing + - Implement file lifecycle management (retention policies) + - Add proper error handling for file operations + - Ensure no orphaned files are left in storage + - _Requirements: 2.1, 4.3_ + +## 6. Enhance Database Operations and Models + +- [ ] 6.1 Fix document model and database operations + - Update `backend/src/models/DocumentModel.ts` with proper UUID validation + - Add comprehensive error handling for all database operations + - Implement proper connection pooling and retry logic + - Add database operation logging and monitoring + - _Requirements: 1.3, 6.3_ + +- [ ] 6.2 Implement processing session tracking + - Enhance agentic RAG session management in database models + - Add proper session lifecycle tracking + - Implement session cleanup and archival + - Add session analytics and reporting + - _Requirements: 3.6, 4.4_ + +- [ ] 6.3 Add vector database integration fixes + - Ensure vector database service is properly integrated + - Fix any issues with embedding generation and storage + - Add proper error handling for vector operations + - Implement vector database health checks + - _Requirements: 3.3, 1.3_ + +## 7. Implement Comprehensive Monitoring and Logging + +- [ ] 7.1 Add structured logging throughout the application + - Implement correlation ID tracking across all services + - Add comprehensive logging for all processing steps + - Create structured log format for better analysis + - Add log aggregation and monitoring setup + - _Requirements: 4.1, 4.4_ + +- [ ] 7.2 Implement health check and monitoring endpoints + - Create comprehensive health check endpoint that tests all services + - Add monitoring endpoints for processing statistics + - Implement real-time status monitoring + - Add alerting for critical failures + - _Requirements: 4.4, 1.1_ + +- [ ] 7.3 Add performance monitoring and metrics + - Implement processing time tracking for all operations + - Add memory usage monitoring and alerts + - Create API response time monitoring + - Add cost tracking for external service usage + - _Requirements: 4.4, 3.6_ + +## 8. Update Frontend Integration + +- [ ] 8.1 Update frontend Firebase Storage integration + - Modify frontend upload components to work with Firebase Storage + - Update authentication flow for Firebase Storage access + - Add proper error handling and user feedback for upload operations + - Implement upload progress tracking on frontend + - _Requirements: 5.1, 5.4_ + +- [ ] 8.2 Enhance frontend error handling and user experience + - Add proper error message display for all error scenarios + - Implement retry mechanisms for failed operations + - Add loading states and progress indicators + - Ensure real-time status updates work correctly + - _Requirements: 5.2, 5.4_ + +## 9. Testing and Quality Assurance + +- [ ] 9.1 Create comprehensive unit tests + - Write unit tests for all service functions + - Add tests for error handling scenarios + - Create tests for configuration validation + - Add tests for UUID validation and database operations + - _Requirements: 6.4_ + +- [ ] 9.2 Implement integration tests + - Create end-to-end tests for document processing pipeline + - Add tests for Firebase Storage integration + - Create tests for external API integrations + - Add tests for error recovery scenarios + - _Requirements: 6.4_ + +- [ ] 9.3 Add performance and load testing + - Create tests for large file processing + - Add concurrent processing tests + - Implement memory leak detection tests + - Add API rate limiting tests + - _Requirements: 6.4_ + +## 10. Documentation and Deployment + +- [ ] 10.1 Update configuration documentation + - Document all required environment variables + - Create setup guides for Firebase Storage configuration + - Add troubleshooting guides for common issues + - Update deployment documentation + - _Requirements: 1.2_ + +- [ ] 10.2 Create operational runbooks + - Document error recovery procedures + - Create monitoring and alerting setup guides + - Add performance tuning guidelines + - Create backup and disaster recovery procedures + - _Requirements: 4.4_ + +- [ ] 10.3 Final integration testing and deployment + - Perform comprehensive end-to-end testing + - Validate all error scenarios work correctly + - Test deployment in staging environment + - Perform production deployment with monitoring + - _Requirements: 1.1, 2.1, 3.1_ \ No newline at end of file diff --git a/APP_DESIGN_DOCUMENTATION.md b/APP_DESIGN_DOCUMENTATION.md index fa2d5d2..1d5e5cb 100644 --- a/APP_DESIGN_DOCUMENTATION.md +++ b/APP_DESIGN_DOCUMENTATION.md @@ -85,7 +85,7 @@ Document Uploaded ▼ ┌─────────────────┐ │ 1. Text │ ──► Google Document AI extracts text from PDF -│ Extraction │ (documentAiGenkitProcessor or direct Document AI) +│ Extraction │ (documentAiProcessor or direct Document AI) └─────────┬───────┘ │ ▼ diff --git a/ARCHITECTURE_DIAGRAMS.md b/ARCHITECTURE_DIAGRAMS.md index 32cd237..a2274ba 100644 --- a/ARCHITECTURE_DIAGRAMS.md +++ b/ARCHITECTURE_DIAGRAMS.md @@ -92,7 +92,7 @@ ▼ ┌─────────────────┐ │ 4. Text │ ──► Google Document AI extracts text from PDF -│ Extraction │ (documentAiGenkitProcessor or direct Document AI) +│ Extraction │ (documentAiProcessor or direct Document AI) └─────────┬───────┘ │ ▼ diff --git a/DEPENDENCY_ANALYSIS_REPORT.md b/DEPENDENCY_ANALYSIS_REPORT.md index 9b5c746..0d56964 100644 --- a/DEPENDENCY_ANALYSIS_REPORT.md +++ b/DEPENDENCY_ANALYSIS_REPORT.md @@ -35,7 +35,7 @@ This report analyzes the dependencies in both backend and frontend packages to i - `joi` - Used for environment validation and middleware validation - `zod` - Used in llmSchemas.ts and llmService.ts for schema validation - `multer` - Used in upload middleware (legacy multipart upload) -- `pdf-parse` - Used in documentAiGenkitProcessor.ts (legacy processor) +- `pdf-parse` - Used in documentAiProcessor.ts (Document AI fallback) #### ⚠️ **Potentially Unused Dependencies** - `redis` - Only imported in sessionService.ts but may not be actively used @@ -108,7 +108,7 @@ None identified - all dependencies appear to be used somewhere in the codebase. ### Current Active Strategy Based on the code analysis, the current processing strategy is: - **Primary**: `optimized_agentic_rag` (most actively used) -- **Fallback**: `document_ai_genkit` (legacy implementation) +- **Fallback**: `document_ai_agentic_rag` (Document AI + Agentic RAG) ### Unused Processing Strategies The following strategies are implemented but not actively used: @@ -130,7 +130,7 @@ The following strategies are implemented but not actively used: #### ⚠️ **Legacy Services (Can be removed)** - `documentProcessingService` - Legacy chunking service -- `documentAiGenkitProcessor` - Legacy Document AI processor +- `documentAiProcessor` - Document AI + Agentic RAG processor - `ragDocumentProcessor` - Basic RAG processor ## Outdated Packages Analysis @@ -223,7 +223,7 @@ The following strategies are implemented but not actively used: 1. **Remove legacy processing services**: - `documentProcessingService.ts` - - `documentAiGenkitProcessor.ts` + - `documentAiProcessor.ts` - `ragDocumentProcessor.ts` 2. **Simplify unifiedDocumentProcessor**: @@ -299,7 +299,7 @@ The following strategies are implemented but not actively used: ### Key Findings 1. **Unused Dependencies**: 2 frontend dependencies (`clsx`, `tailwind-merge`) are completely unused -2. **Legacy Services**: 3 processing services can be removed (`documentProcessingService`, `documentAiGenkitProcessor`, `ragDocumentProcessor`) +2. **Legacy Services**: 2 processing services can be removed (`documentProcessingService`, `ragDocumentProcessor`) 3. **Redundant Dependencies**: Both `joi` and `zod` for validation, both `pg` and Supabase for database 4. **Outdated Packages**: 21 backend and 15 frontend packages have updates available 5. **Major Version Updates**: Many packages require major version updates with potential breaking changes diff --git a/DOCUMENT_AI_GENKIT_INTEGRATION.md b/DOCUMENT_AI_AGENTIC_RAG_INTEGRATION.md similarity index 87% rename from DOCUMENT_AI_GENKIT_INTEGRATION.md rename to DOCUMENT_AI_AGENTIC_RAG_INTEGRATION.md index ed1ffd5..83fb352 100644 --- a/DOCUMENT_AI_GENKIT_INTEGRATION.md +++ b/DOCUMENT_AI_AGENTIC_RAG_INTEGRATION.md @@ -1,10 +1,10 @@ -# Document AI + Genkit Integration Guide +# Document AI + Agentic RAG Integration Guide ## Overview -This guide explains how to integrate Google Cloud Document AI with Genkit for enhanced CIM document processing. This approach provides superior text extraction and structured analysis compared to traditional PDF parsing. +This guide explains how to integrate Google Cloud Document AI with Agentic RAG for enhanced CIM document processing. This approach provides superior text extraction and structured analysis compared to traditional PDF parsing. -## 🎯 **Benefits of Document AI + Genkit** +## 🎯 **Benefits of Document AI + Agentic RAG** ### **Document AI Advantages:** - **Superior text extraction** from complex PDF layouts @@ -13,7 +13,7 @@ This guide explains how to integrate Google Cloud Document AI with Genkit for en - **Layout understanding** maintains document structure - **Multi-format support** (PDF, images, scanned documents) -### **Genkit Advantages:** +### **Agentic RAG Advantages:** - **Structured AI workflows** with type safety - **Map-reduce processing** for large documents - **Timeout handling** and error recovery @@ -77,7 +77,7 @@ Add these to your `package.json`: "dependencies": { "@google-cloud/documentai": "^8.0.0", "@google-cloud/storage": "^7.0.0", - "genkit": "^0.1.0", + "@google-cloud/documentai": "^8.0.0", "zod": "^3.25.76" } } @@ -95,7 +95,7 @@ type ProcessingStrategy = | 'rag' // Retrieval-Augmented Generation | 'agentic_rag' // Multi-agent RAG system | 'optimized_agentic_rag' // Optimized multi-agent system - | 'document_ai_genkit'; // Document AI + Genkit (NEW) + | 'document_ai_agentic_rag'; // Document AI + Agentic RAG (NEW) ``` ### **2. Environment Configuration** @@ -120,14 +120,14 @@ const envSchema = Joi.object({ ```typescript // Set as default strategy -PROCESSING_STRATEGY=document_ai_genkit +PROCESSING_STRATEGY=document_ai_agentic_rag // Or select per document const result = await unifiedDocumentProcessor.processDocument( documentId, userId, text, - { strategy: 'document_ai_genkit' } + { strategy: 'document_ai_agentic_rag' } ); ``` @@ -136,7 +136,7 @@ const result = await unifiedDocumentProcessor.processDocument( ### **1. Basic Document Processing** ```typescript -import { processCimDocumentServerAction } from './documentAiGenkitProcessor'; +import { processCimDocumentServerAction } from './documentAiProcessor'; const result = await processCimDocumentServerAction({ fileDataUri: 'data:application/pdf;base64,JVBERi0xLjc...', @@ -154,9 +154,9 @@ export const documentController = { async uploadDocument(req: Request, res: Response): Promise { // ... existing upload logic - // Use Document AI + Genkit strategy + // Use Document AI + Agentic RAG strategy const processingOptions = { - strategy: 'document_ai_genkit', + strategy: 'document_ai_agentic_rag', enableTableExtraction: true, enableEntityRecognition: true }; @@ -179,11 +179,11 @@ const comparison = await unifiedDocumentProcessor.compareProcessingStrategies( documentId, userId, text, - { includeDocumentAiGenkit: true } + { includeDocumentAiAgenticRag: true } ); console.log('Best strategy:', comparison.winner); -console.log('Document AI + Genkit result:', comparison.documentAiGenkit); +console.log('Document AI + Agentic RAG result:', comparison.documentAiAgenticRag); ``` ## 📊 **Performance Comparison** @@ -195,7 +195,7 @@ console.log('Document AI + Genkit result:', comparison.documentAiGenkit); | Chunking | 3-5 minutes | 9-12 | 7/10 | $2-3 | | RAG | 2-3 minutes | 6-8 | 8/10 | $1.5-2 | | Agentic RAG | 4-6 minutes | 15-20 | 9/10 | $3-4 | -| **Document AI + Genkit** | **1-2 minutes** | **1-2** | **9.5/10** | **$1-1.5** | +| **Document AI + Agentic RAG** | **1-2 minutes** | **1-2** | **9.5/10** | **$1-1.5** | ### **Key Advantages:** - **50% faster** than traditional chunking @@ -218,7 +218,7 @@ try { } } -// 2. Genkit Flow Timeouts +// 2. Agentic RAG Flow Timeouts const TIMEOUT_DURATION_FLOW = 1800000; // 30 minutes const TIMEOUT_DURATION_ACTION = 2100000; // 35 minutes @@ -236,10 +236,10 @@ try { ### **1. Unit Tests** ```typescript -// Test Document AI + Genkit processor -describe('DocumentAiGenkitProcessor', () => { +// Test Document AI + Agentic RAG processor +describe('DocumentAiProcessor', () => { it('should process CIM document successfully', async () => { - const processor = new DocumentAiGenkitProcessor(); + const processor = new DocumentAiProcessor(); const result = await processor.processDocument( 'test-doc-id', 'test-user-id', @@ -258,7 +258,7 @@ describe('DocumentAiGenkitProcessor', () => { ```typescript // Test full pipeline -describe('Document AI + Genkit Integration', () => { +describe('Document AI + Agentic RAG Integration', () => { it('should process real CIM document', async () => { const fileDataUri = await loadTestPdfAsDataUri(); const result = await processCimDocumentServerAction({ @@ -326,7 +326,7 @@ const metrics = { ```typescript // Log detailed error information -logger.error('Document AI + Genkit processing failed', { +logger.error('Document AI + Agentic RAG processing failed', { documentId, error: error.message, stack: error.stack, @@ -341,7 +341,7 @@ logger.error('Document AI + Genkit processing failed', { 2. **Configure environment variables** with your project details 3. **Test with sample CIM documents** to validate extraction quality 4. **Compare performance** with existing strategies -5. **Gradually migrate** from chunking to Document AI + Genkit +5. **Gradually migrate** from chunking to Document AI + Agentic RAG 6. **Monitor costs and performance** in production ## 📞 **Support** @@ -349,7 +349,7 @@ logger.error('Document AI + Genkit processing failed', { For issues with: - **Google Cloud setup**: Check Google Cloud documentation - **Document AI**: Review processor configuration and permissions -- **Genkit integration**: Verify API keys and model configuration +- **Agentic RAG integration**: Verify API keys and model configuration - **Performance**: Monitor logs and adjust timeout settings This integration provides a significant upgrade to your CIM processing capabilities with better quality, faster processing, and lower costs. \ No newline at end of file diff --git a/DOCUMENT_AI_INTEGRATION_SUMMARY.md b/DOCUMENT_AI_INTEGRATION_SUMMARY.md index d86d391..68f602f 100644 --- a/DOCUMENT_AI_INTEGRATION_SUMMARY.md +++ b/DOCUMENT_AI_INTEGRATION_SUMMARY.md @@ -1,8 +1,8 @@ -# Document AI + Genkit Integration Summary +# Document AI + Agentic RAG Integration Summary ## 🎉 **Integration Complete!** -We have successfully set up Google Cloud Document AI + Genkit integration for your CIM processing system. Here's what we've accomplished: +We have successfully set up Google Cloud Document AI + Agentic RAG integration for your CIM processing system. Here's what we've accomplished: ## ✅ **What's Been Set Up:** @@ -16,9 +16,9 @@ We have successfully set up Google Cloud Document AI + Genkit integration for yo - ✅ **Permissions**: Document AI API User, Storage Object Admin ### **2. Code Integration** -- ✅ **New Processor**: `DocumentAiGenkitProcessor` class +- ✅ **New Processor**: `DocumentAiProcessor` class - ✅ **Environment Config**: Updated with Document AI settings -- ✅ **Unified Processor**: Added `document_ai_genkit` strategy +- ✅ **Unified Processor**: Added `document_ai_agentic_rag` strategy - ✅ **Dependencies**: Installed `@google-cloud/documentai` and `@google-cloud/storage` ### **3. Testing & Validation** @@ -54,15 +54,15 @@ node scripts/test-integration-with-mock.js node scripts/test-document-ai-integration.js ``` -### **4. Switch to Document AI + Genkit Strategy** +### **4. Switch to Document AI + Agentic RAG Strategy** Update your environment or processing options: ```bash -PROCESSING_STRATEGY=document_ai_genkit +PROCESSING_STRATEGY=document_ai_agentic_rag ``` ## 📊 **Expected Performance Improvements:** -| Metric | Current (Chunking) | Document AI + Genkit | Improvement | +| Metric | Current (Chunking) | Document AI + Agentic RAG | Improvement | |--------|-------------------|---------------------|-------------| | **Processing Time** | 3-5 minutes | 1-2 minutes | **50% faster** | | **API Calls** | 9-12 calls | 1-2 calls | **90% reduction** | @@ -80,7 +80,7 @@ CIM Document Upload ↓ Text + Entities + Tables ↓ - Genkit AI Analysis + Agentic RAG AI Analysis ↓ Structured CIM Analysis ``` @@ -93,15 +93,15 @@ Your system now supports **5 processing strategies**: 2. **`rag`** - Retrieval-Augmented Generation 3. **`agentic_rag`** - Multi-agent RAG system 4. **`optimized_agentic_rag`** - Optimized multi-agent system -5. **`document_ai_genkit`** - Document AI + Genkit (NEW) +5. **`document_ai_agentic_rag`** - Document AI + Agentic RAG (NEW) ## 📁 **Generated Files:** - `backend/.env.document-ai-template` - Environment configuration template - `backend/DOCUMENT_AI_SETUP_INSTRUCTIONS.md` - Detailed setup instructions - `backend/scripts/` - Various test and setup scripts -- `backend/src/services/documentAiGenkitProcessor.ts` - Integration processor -- `DOCUMENT_AI_GENKIT_INTEGRATION.md` - Comprehensive integration guide +- `backend/src/services/documentAiProcessor.ts` - Integration processor +- `DOCUMENT_AI_AGENTIC_RAG_INTEGRATION.md` - Comprehensive integration guide ## 🚀 **Next Steps:** @@ -118,7 +118,7 @@ Your system now supports **5 processing strategies**: - **Layout understanding** maintains document structure - **Lower costs** with better quality - **Faster processing** with fewer API calls -- **Type-safe workflows** with Genkit +- **Type-safe workflows** with Agentic RAG ## 🔍 **Troubleshooting:** @@ -131,7 +131,7 @@ Your system now supports **5 processing strategies**: - **Google Cloud Console**: https://console.cloud.google.com - **Document AI Documentation**: https://cloud.google.com/document-ai -- **Genkit Documentation**: https://genkit.ai +- **Agentic RAG Documentation**: See optimizedAgenticRAGProcessor.ts - **Generated Instructions**: `backend/DOCUMENT_AI_SETUP_INSTRUCTIONS.md` --- diff --git a/backend/DOCUMENT_AI_SETUP_INSTRUCTIONS.md b/backend/DOCUMENT_AI_SETUP_INSTRUCTIONS.md index d12349b..7a0bc39 100644 --- a/backend/DOCUMENT_AI_SETUP_INSTRUCTIONS.md +++ b/backend/DOCUMENT_AI_SETUP_INSTRUCTIONS.md @@ -1,4 +1,4 @@ -# Document AI + Genkit Setup Instructions +# Document AI + Agentic RAG Setup Instructions ## ✅ Completed Steps: 1. Google Cloud Project: cim-summarizer @@ -27,7 +27,7 @@ Go to: https://console.cloud.google.com/ai/document-ai/processors Run: node scripts/test-integration-with-mock.js ### 4. Integrate with Existing System -1. Update PROCESSING_STRATEGY=document_ai_genkit +1. Update PROCESSING_STRATEGY=document_ai_agentic_rag 2. Test with real CIM documents 3. Monitor performance and costs @@ -45,4 +45,4 @@ Run: node scripts/test-integration-with-mock.js ## 📞 Support: - Google Cloud Console: https://console.cloud.google.com - Document AI Documentation: https://cloud.google.com/document-ai -- Genkit Documentation: https://genkit.ai +- Agentic RAG Documentation: See optimizedAgenticRAGProcessor.ts diff --git a/backend/package.json b/backend/package.json index 6178b91..ff3a61e 100644 --- a/backend/package.json +++ b/backend/package.json @@ -45,7 +45,6 @@ "joi": "^17.11.0", "jsonwebtoken": "^9.0.2", "morgan": "^1.10.0", - "multer": "^1.4.5-lts.1", "openai": "^5.10.2", "pdf-parse": "^1.1.1", "pg": "^8.11.3", @@ -62,7 +61,6 @@ "@types/jest": "^29.5.8", "@types/jsonwebtoken": "^9.0.5", "@types/morgan": "^1.9.9", - "@types/multer": "^1.4.11", "@types/node": "^20.9.0", "@types/pdf-parse": "^1.1.4", "@types/pg": "^8.10.7", diff --git a/backend/scripts/setup-complete.js b/backend/scripts/setup-complete.js index 35ffa0e..088431e 100644 --- a/backend/scripts/setup-complete.js +++ b/backend/scripts/setup-complete.js @@ -10,7 +10,7 @@ const GCS_BUCKET_NAME = 'cim-summarizer-uploads'; const DOCUMENT_AI_OUTPUT_BUCKET_NAME = 'cim-summarizer-document-ai-output'; async function setupComplete() { - console.log('🚀 Complete Document AI + Genkit Setup\n'); + console.log('🚀 Complete Document AI + Agentic RAG Setup\n'); try { // Check current setup @@ -57,7 +57,7 @@ GCS_BUCKET_NAME=${GCS_BUCKET_NAME} DOCUMENT_AI_OUTPUT_BUCKET_NAME=${DOCUMENT_AI_OUTPUT_BUCKET_NAME} # Processing Strategy -PROCESSING_STRATEGY=document_ai_genkit +PROCESSING_STRATEGY=document_ai_agentic_rag # Google Cloud Authentication GOOGLE_APPLICATION_CREDENTIALS=./serviceAccountKey.json @@ -91,7 +91,7 @@ MAX_FILE_SIZE=104857600 // Generate setup instructions console.log('\n3. Setup Instructions...'); - const instructions = `# Document AI + Genkit Setup Instructions + const instructions = `# Document AI + Agentic RAG Setup Instructions ## ✅ Completed Steps: 1. Google Cloud Project: ${PROJECT_ID} @@ -120,7 +120,7 @@ Go to: https://console.cloud.google.com/ai/document-ai/processors Run: node scripts/test-integration-with-mock.js ### 4. Integrate with Existing System -1. Update PROCESSING_STRATEGY=document_ai_genkit +1. Update PROCESSING_STRATEGY=document_ai_agentic_rag 2. Test with real CIM documents 3. Monitor performance and costs @@ -138,7 +138,7 @@ Run: node scripts/test-integration-with-mock.js ## 📞 Support: - Google Cloud Console: https://console.cloud.google.com - Document AI Documentation: https://cloud.google.com/document-ai -- Genkit Documentation: https://genkit.ai +- Agentic RAG Documentation: See optimizedAgenticRAGProcessor.ts `; const instructionsPath = path.join(__dirname, '../DOCUMENT_AI_SETUP_INSTRUCTIONS.md'); @@ -177,7 +177,7 @@ Run: node scripts/test-integration-with-mock.js console.log('1. Create Document AI processor in console'); console.log('2. Update .env file with processor ID'); console.log('3. Test with real CIM documents'); - console.log('4. Switch to document_ai_genkit strategy'); + console.log('4. Switch to document_ai_agentic_rag strategy'); console.log('\n📁 Generated Files:'); console.log(` - ${envPath}`); diff --git a/backend/scripts/test-full-integration.js b/backend/scripts/test-full-integration.js index 64df60e..57ec150 100644 --- a/backend/scripts/test-full-integration.js +++ b/backend/scripts/test-full-integration.js @@ -94,7 +94,7 @@ APPENDIX } async function testFullIntegration() { - console.log('🧪 Testing Full Document AI + Genkit Integration...\n'); + console.log('🧪 Testing Full Document AI + Agentic RAG Integration...\n'); let testFile = null; @@ -236,20 +236,20 @@ async function testFullIntegration() { console.log(` 🏷️ Entities found: ${documentAiOutput.entities.length}`); console.log(` 📋 Tables found: ${documentAiOutput.tables.length}`); - // Step 6: Test Genkit Integration (Simulated) - console.log('\n6. Testing Genkit AI Analysis...'); + // Step 6: Test Agentic RAG Integration (Simulated) +console.log('\n6. Testing Agentic RAG AI Analysis...'); - // Simulate Genkit processing with the Document AI output - const genkitInput = { + // Simulate Agentic RAG processing with the Document AI output +const agenticRagInput = { extractedText: documentAiOutput.text, fileName: testFile.testFileName, documentAiOutput: documentAiOutput }; - console.log(' 🤖 Simulating Genkit AI analysis...'); + console.log(' 🤖 Simulating Agentic RAG AI analysis...'); - // Simulate Genkit output based on the CIM analysis prompt - const genkitOutput = { + // Simulate Agentic RAG output based on the CIM analysis prompt +const agenticRagOutput = { markdownOutput: `# CIM Investment Analysis: TechFlow Solutions Inc. ## Executive Summary @@ -360,19 +360,19 @@ async function testFullIntegration() { 5. Team background verification --- -*Analysis generated by Document AI + Genkit integration* +*Analysis generated by Document AI + Agentic RAG integration* ` }; - console.log(` ✅ Genkit analysis completed`); - console.log(` 📊 Analysis length: ${genkitOutput.markdownOutput.length} characters`); + console.log(` ✅ Agentic RAG analysis completed`); + console.log(` 📊 Analysis length: ${agenticRagOutput.markdownOutput.length} characters`); // Step 7: Final Integration Test console.log('\n7. Final Integration Test...'); const finalResult = { success: true, - summary: genkitOutput.markdownOutput, + summary: agenticRagOutput.markdownOutput, analysisData: { company: 'TechFlow Solutions Inc.', industry: 'SaaS / Enterprise Software', @@ -393,7 +393,7 @@ async function testFullIntegration() { ], exitStrategy: 'IPO within 3-4 years, $500M-$1B valuation' }, - processingStrategy: 'document_ai_genkit', + processingStrategy: 'document_ai_agentic_rag', processingTime: Date.now(), apiCalls: 1, metadata: { @@ -430,7 +430,7 @@ async function testFullIntegration() { console.log('✅ Document AI text extraction simulated'); console.log('✅ Entity recognition working (20 entities found)'); console.log('✅ Table structure preserved'); - console.log('✅ Genkit AI analysis completed'); + console.log('✅ Agentic RAG AI analysis completed'); console.log('✅ Full pipeline integration working'); console.log('✅ Cleanup operations successful'); @@ -439,11 +439,11 @@ async function testFullIntegration() { console.log(` 📊 Extracted text: ${documentAiOutput.text.length} characters`); console.log(` 🏷️ Entities recognized: ${documentAiOutput.entities.length}`); console.log(` 📋 Tables extracted: ${documentAiOutput.tables.length}`); - console.log(` 🤖 AI analysis length: ${genkitOutput.markdownOutput.length} characters`); - console.log(` ⚡ Processing strategy: document_ai_genkit`); + console.log(` 🤖 AI analysis length: ${agenticRagOutput.markdownOutput.length} characters`); + console.log(` ⚡ Processing strategy: document_ai_agentic_rag`); console.log('\n🚀 Ready for Production!'); - console.log('Your Document AI + Genkit integration is fully operational and ready to process real CIM documents.'); + console.log('Your Document AI + Agentic RAG integration is fully operational and ready to process real CIM documents.'); return finalResult; diff --git a/backend/scripts/test-integration-with-mock.js b/backend/scripts/test-integration-with-mock.js index a83cc13..2e2ff48 100644 --- a/backend/scripts/test-integration-with-mock.js +++ b/backend/scripts/test-integration-with-mock.js @@ -153,7 +153,7 @@ IPO or strategic acquisition within 5 years Expected return: 3-5x `, metadata: { - processingStrategy: 'document_ai_genkit', + processingStrategy: 'document_ai_agentic_rag', documentAiOutput: mockDocumentAiOutput, processingTime: Date.now(), fileSize: sampleCIM.length, diff --git a/backend/scripts/test-real-processor.js b/backend/scripts/test-real-processor.js index cac0e2e..6a251ac 100644 --- a/backend/scripts/test-real-processor.js +++ b/backend/scripts/test-real-processor.js @@ -166,7 +166,7 @@ IPO or strategic acquisition within 5 years Expected return: 3-5x `, metadata: { - processingStrategy: 'document_ai_genkit', + processingStrategy: 'document_ai_agentic_rag', documentAiOutput: mockDocumentAiOutput, processingTime: Date.now(), fileSize: sampleCIM.length, @@ -193,7 +193,7 @@ GCS_BUCKET_NAME=${GCS_BUCKET_NAME} DOCUMENT_AI_OUTPUT_BUCKET_NAME=${DOCUMENT_AI_OUTPUT_BUCKET_NAME} # Processing Strategy -PROCESSING_STRATEGY=document_ai_genkit +PROCESSING_STRATEGY=document_ai_agentic_rag # Google Cloud Authentication GOOGLE_APPLICATION_CREDENTIALS=./serviceAccountKey.json @@ -212,7 +212,7 @@ GOOGLE_APPLICATION_CREDENTIALS=./serviceAccountKey.json console.log('\n📋 Next Steps:'); console.log('1. Add the environment variables to your .env file'); console.log('2. Test with real PDF CIM documents'); - console.log('3. Switch to document_ai_genkit strategy'); + console.log('3. Switch to document_ai_agentic_rag strategy'); console.log('4. Monitor performance and quality'); return processingResult; diff --git a/backend/src/config/env.ts b/backend/src/config/env.ts index 8ad806d..5dbe770 100644 --- a/backend/src/config/env.ts +++ b/backend/src/config/env.ts @@ -74,9 +74,7 @@ const envSchema = Joi.object({ LOG_FILE: Joi.string().default('logs/app.log'), // Processing Strategy - PROCESSING_STRATEGY: Joi.string().valid('chunking', 'rag', 'agentic_rag', 'document_ai_genkit').default('chunking'), - ENABLE_RAG_PROCESSING: Joi.boolean().default(false), - ENABLE_PROCESSING_COMPARISON: Joi.boolean().default(false), + PROCESSING_STRATEGY: Joi.string().valid('document_ai_agentic_rag').default('document_ai_agentic_rag'), // Agentic RAG Configuration AGENTIC_RAG_ENABLED: Joi.boolean().default(false), diff --git a/backend/src/controllers/documentController.ts b/backend/src/controllers/documentController.ts index ec3644b..79a7072 100644 --- a/backend/src/controllers/documentController.ts +++ b/backend/src/controllers/documentController.ts @@ -197,7 +197,7 @@ export const documentController = { userId, '', // Text is not needed for this strategy { - strategy: 'document_ai_genkit', + strategy: 'document_ai_agentic_rag', fileBuffer: fileBuffer, fileName: document.original_file_name, mimeType: 'application/pdf' @@ -294,153 +294,7 @@ export const documentController = { } }, - async uploadDocument(req: Request, res: Response): Promise { - const startTime = Date.now(); - - // 🔍 COMPREHENSIVE DEBUG: Log everything about the request - console.log('🚀 ========================='); - console.log('🚀 DOCUMENT AI UPLOAD STARTED'); - console.log('🚀 Method:', req.method); - console.log('🚀 URL:', req.url); - console.log('🚀 Content-Type:', req.get('Content-Type')); - console.log('🚀 Content-Length:', req.get('Content-Length')); - console.log('🚀 Authorization header present:', !!req.get('Authorization')); - console.log('🚀 User from token:', req.user?.uid || 'NOT_FOUND'); - - // Debug body in detail - console.log('🚀 Has body:', !!req.body); - console.log('🚀 Body type:', typeof req.body); - console.log('🚀 Body constructor:', req.body?.constructor?.name); - console.log('🚀 Body length:', req.body?.length || 0); - console.log('🚀 Is Buffer?:', Buffer.isBuffer(req.body)); - - // Debug all headers - console.log('🚀 All headers:', JSON.stringify(req.headers, null, 2)); - - // Debug request properties - console.log('🚀 Request readable:', req.readable); - console.log('🚀 Request complete:', req.complete); - - // If body exists, show first few bytes - if (req.body && req.body.length > 0) { - const preview = req.body.slice(0, 100).toString('hex'); - console.log('🚀 Body preview (hex):', preview); - - // Try to see if it contains multipart boundary - const bodyStr = req.body.toString('utf8', 0, Math.min(500, req.body.length)); - console.log('🚀 Body preview (string):', bodyStr.substring(0, 200)); - } - - console.log('🚀 ========================='); - - try { - const userId = req.user?.uid; - if (!userId) { - console.log('❌ Authentication failed - no userId'); - res.status(401).json({ - error: 'User not authenticated', - correlationId: req.correlationId - }); - return; - } - - console.log('✅ Authentication successful for user:', userId); - // Get raw body buffer for Document AI processing - const rawBody = req.body; - if (!rawBody || rawBody.length === 0) { - res.status(400).json({ - error: 'No file data received', - correlationId: req.correlationId, - debug: { - method: req.method, - contentType: req.get('Content-Type'), - contentLength: req.get('Content-Length'), - hasRawBody: !!rawBody, - rawBodySize: rawBody?.length || 0, - bodyType: typeof rawBody - } - }); - return; - } - - console.log('✅ Found raw body buffer:', rawBody.length, 'bytes'); - - // Create document record first - const document = await DocumentModel.create({ - user_id: userId, - original_file_name: 'uploaded-document.pdf', - file_path: '', - file_size: rawBody.length, - status: 'processing_llm' - }); - - console.log('✅ Document record created:', document.id); - - // Process with Document AI directly - const { DocumentAiGenkitProcessor } = await import('../services/documentAiGenkitProcessor'); - const processor = new DocumentAiGenkitProcessor(); - - console.log('✅ Starting Document AI processing...'); - const result = await processor.processDocument( - document.id, - userId, - rawBody, - 'uploaded-document.pdf', - 'application/pdf' - ); - - if (result.success) { - await DocumentModel.updateById(document.id, { - status: 'completed', - generated_summary: result.content, - processing_completed_at: new Date() - }); - - console.log('✅ Document AI processing completed successfully'); - - res.status(201).json({ - id: document.id, - name: 'uploaded-document.pdf', - originalName: 'uploaded-document.pdf', - status: 'completed', - uploadedAt: document.created_at, - uploadedBy: userId, - fileSize: rawBody.length, - summary: result.content, - correlationId: req.correlationId || undefined - }); - return; - } else { - console.log('❌ Document AI processing failed:', result.error); - await DocumentModel.updateById(document.id, { - status: 'failed', - error_message: result.error - }); - - res.status(500).json({ - error: 'Document processing failed', - message: result.error, - correlationId: req.correlationId || undefined - }); - return; - } - - } catch (error) { - console.log('❌ Upload error:', error); - - logger.error('Upload document failed', { - error, - correlationId: req.correlationId - }); - - res.status(500).json({ - error: 'Upload failed', - message: error instanceof Error ? error.message : 'Unknown error', - correlationId: req.correlationId || undefined - }); - } - }, async getDocuments(req: Request, res: Response): Promise { try { diff --git a/backend/src/middleware/__tests__/upload.test.ts b/backend/src/middleware/__tests__/upload.test.ts deleted file mode 100644 index c76c59f..0000000 --- a/backend/src/middleware/__tests__/upload.test.ts +++ /dev/null @@ -1,189 +0,0 @@ -import { Request, Response, NextFunction } from 'express'; -import multer from 'multer'; -import fs from 'fs'; -import { handleFileUpload, handleUploadError, cleanupUploadedFile, getFileInfo } from '../upload'; - -// Mock the logger -jest.mock('../../utils/logger', () => ({ - logger: { - info: jest.fn(), - warn: jest.fn(), - error: jest.fn(), - }, -})); - -// Mock fs -jest.mock('fs', () => ({ - existsSync: jest.fn(), - mkdirSync: jest.fn(), -})); - -describe('Upload Middleware', () => { - let mockReq: Partial; - let mockRes: Partial; - let mockNext: NextFunction; - - beforeEach(() => { - mockReq = { - ip: '127.0.0.1', - } as any; - mockRes = { - status: jest.fn().mockReturnThis(), - json: jest.fn(), - }; - mockNext = jest.fn(); - - // Reset mocks - jest.clearAllMocks(); - }); - - describe('handleUploadError', () => { - it('should handle LIMIT_FILE_SIZE error', () => { - const error = new multer.MulterError('LIMIT_FILE_SIZE', 'document'); - error.code = 'LIMIT_FILE_SIZE'; - - handleUploadError(error, mockReq as Request, mockRes as Response, mockNext); - - expect(mockRes.status).toHaveBeenCalledWith(400); - expect(mockRes.json).toHaveBeenCalledWith({ - success: false, - error: 'File too large', - message: expect.stringContaining('File size must be less than'), - }); - }); - - it('should handle LIMIT_FILE_COUNT error', () => { - const error = new multer.MulterError('LIMIT_FILE_COUNT', 'document'); - error.code = 'LIMIT_FILE_COUNT'; - - handleUploadError(error, mockReq as Request, mockRes as Response, mockNext); - - expect(mockRes.status).toHaveBeenCalledWith(400); - expect(mockRes.json).toHaveBeenCalledWith({ - success: false, - error: 'Too many files', - message: 'Only one file can be uploaded at a time', - }); - }); - - it('should handle LIMIT_UNEXPECTED_FILE error', () => { - const error = new multer.MulterError('LIMIT_UNEXPECTED_FILE', 'document'); - error.code = 'LIMIT_UNEXPECTED_FILE'; - - handleUploadError(error, mockReq as Request, mockRes as Response, mockNext); - - expect(mockRes.status).toHaveBeenCalledWith(400); - expect(mockRes.json).toHaveBeenCalledWith({ - success: false, - error: 'Unexpected file field', - message: 'File must be uploaded using the correct field name', - }); - }); - - it('should handle generic multer errors', () => { - const error = new multer.MulterError('LIMIT_FILE_SIZE', 'document'); - error.code = 'LIMIT_FILE_SIZE'; - - handleUploadError(error, mockReq as Request, mockRes as Response, mockNext); - - expect(mockRes.status).toHaveBeenCalledWith(400); - expect(mockRes.json).toHaveBeenCalledWith({ - success: false, - error: 'File too large', - message: expect.stringContaining('File size must be less than'), - }); - }); - - it('should handle non-multer errors', () => { - const error = new Error('Custom upload error'); - - handleUploadError(error, mockReq as Request, mockRes as Response, mockNext); - - expect(mockRes.status).toHaveBeenCalledWith(400); - expect(mockRes.json).toHaveBeenCalledWith({ - success: false, - error: 'File upload failed', - message: 'Custom upload error', - }); - }); - - it('should call next when no error', () => { - handleUploadError(null, mockReq as Request, mockRes as Response, mockNext); - - expect(mockNext).toHaveBeenCalled(); - expect(mockRes.status).not.toHaveBeenCalled(); - expect(mockRes.json).not.toHaveBeenCalled(); - }); - }); - - describe('cleanupUploadedFile', () => { - it('should delete existing file', () => { - const filePath = '/test/path/file.pdf'; - const mockUnlinkSync = jest.fn(); - - (fs.existsSync as jest.Mock).mockReturnValue(true); - (fs.unlinkSync as jest.Mock) = mockUnlinkSync; - - cleanupUploadedFile(filePath); - - expect(fs.existsSync).toHaveBeenCalledWith(filePath); - expect(mockUnlinkSync).toHaveBeenCalledWith(filePath); - }); - - it('should not delete non-existent file', () => { - const filePath = '/test/path/file.pdf'; - const mockUnlinkSync = jest.fn(); - - (fs.existsSync as jest.Mock).mockReturnValue(false); - (fs.unlinkSync as jest.Mock) = mockUnlinkSync; - - cleanupUploadedFile(filePath); - - expect(fs.existsSync).toHaveBeenCalledWith(filePath); - expect(mockUnlinkSync).not.toHaveBeenCalled(); - }); - - it('should handle deletion errors gracefully', () => { - const filePath = '/test/path/file.pdf'; - const mockUnlinkSync = jest.fn().mockImplementation(() => { - throw new Error('Permission denied'); - }); - - (fs.existsSync as jest.Mock).mockReturnValue(true); - (fs.unlinkSync as jest.Mock) = mockUnlinkSync; - - // Should not throw error - expect(() => cleanupUploadedFile(filePath)).not.toThrow(); - }); - }); - - describe('getFileInfo', () => { - it('should return correct file info', () => { - const mockFile = { - originalname: 'test-document.pdf', - filename: '1234567890-abc123.pdf', - path: '/uploads/test-user-id/1234567890-abc123.pdf', - size: 1024, - mimetype: 'application/pdf', - }; - - const fileInfo = getFileInfo(mockFile as any); - - expect(fileInfo).toEqual({ - originalName: 'test-document.pdf', - filename: '1234567890-abc123.pdf', - path: '/uploads/test-user-id/1234567890-abc123.pdf', - size: 1024, - mimetype: 'application/pdf', - uploadedAt: expect.any(Date), - }); - }); - }); - - describe('handleFileUpload middleware', () => { - it('should be an array with uploadMiddleware and handleUploadError', () => { - expect(Array.isArray(handleFileUpload)).toBe(true); - expect(handleFileUpload).toHaveLength(2); - }); - }); -}); \ No newline at end of file diff --git a/backend/src/middleware/errorHandler.ts b/backend/src/middleware/errorHandler.ts index 7c90bf6..85db2ad 100644 --- a/backend/src/middleware/errorHandler.ts +++ b/backend/src/middleware/errorHandler.ts @@ -65,12 +65,7 @@ export const errorHandler = ( error = { message, statusCode: 401 } as AppError; } - // Multer errors (check if multer is imported anywhere) - if (err.name === 'MulterError' || (err as any).code === 'UNEXPECTED_END_OF_FORM') { - console.log('🚨 MULTER ERROR CAUGHT:', err.message); - const message = `File upload failed: ${err.message}`; - error = { message, statusCode: 400 } as AppError; - } + // Default error const statusCode = error.statusCode || 500; diff --git a/backend/src/middleware/upload.ts b/backend/src/middleware/upload.ts deleted file mode 100644 index 1c54a73..0000000 --- a/backend/src/middleware/upload.ts +++ /dev/null @@ -1,212 +0,0 @@ -import multer from 'multer'; -import path from 'path'; -import fs from 'fs'; -import { Request, Response, NextFunction } from 'express'; -import { config } from '../config/env'; -import { logger } from '../utils/logger'; - -// Use temporary directory for file uploads (files will be immediately moved to GCS) -const uploadDir = '/tmp/uploads'; -if (!fs.existsSync(uploadDir)) { - fs.mkdirSync(uploadDir, { recursive: true }); -} - -// File filter function -const fileFilter = (req: Request, file: any, cb: multer.FileFilterCallback) => { - console.log('🔍 ===== FILE FILTER CALLED ====='); - console.log('🔍 File originalname:', file.originalname); - console.log('🔍 File mimetype:', file.mimetype); - console.log('🔍 File size:', file.size); - console.log('🔍 File encoding:', file.encoding); - console.log('🔍 File fieldname:', file.fieldname); - console.log('🔍 Request Content-Type:', req.get('Content-Type')); - console.log('🔍 Request Content-Length:', req.get('Content-Length')); - console.log('🔍 ==========================='); - - // Check file type - allow PDF and text files for testing - const allowedTypes = ['application/pdf', 'text/plain', 'text/html']; - if (!allowedTypes.includes(file.mimetype)) { - const error = new Error(`File type ${file.mimetype} is not allowed. Only PDF and text files are accepted.`); - console.log('❌ File rejected - invalid type:', file.mimetype); - logger.warn(`File upload rejected - invalid type: ${file.mimetype}`, { - originalName: file.originalname, - size: file.size, - ip: req.ip, - }); - return cb(error); - } - - // Check file extension - allow PDF and text extensions for testing - const ext = path.extname(file.originalname).toLowerCase(); - if (!['.pdf', '.txt', '.html'].includes(ext)) { - const error = new Error(`File extension ${ext} is not allowed. Only .pdf, .txt, and .html files are accepted.`); - console.log('❌ File rejected - invalid extension:', ext); - logger.warn(`File upload rejected - invalid extension: ${ext}`, { - originalName: file.originalname, - size: file.size, - ip: req.ip, - }); - return cb(error); - } - - console.log('✅ File accepted:', file.originalname); - logger.info(`File upload accepted: ${file.originalname}`, { - originalName: file.originalname, - size: file.size, - mimetype: file.mimetype, - ip: req.ip, - }); - cb(null, true); -}; - -// Storage configuration - use memory storage for immediate GCS upload -const storage = multer.memoryStorage(); - -// Create multer instance -const upload = multer({ - storage, - fileFilter, - limits: { - fileSize: config.upload.maxFileSize, // 100MB default - files: 1, // Only allow 1 file per request - }, -}); - -// Error handling middleware for multer -export const handleUploadError = (error: any, req: Request, res: Response, next: NextFunction): void => { - console.log('🚨 ============================='); - console.log('🚨 UPLOAD ERROR HANDLER CALLED'); - console.log('🚨 Error type:', error?.constructor?.name); - console.log('🚨 Error message:', error?.message); - console.log('🚨 Error code:', error?.code); - console.log('🚨 Is MulterError:', error instanceof multer.MulterError); - console.log('🚨 ============================='); - - if (error instanceof multer.MulterError) { - logger.error('Multer error during file upload:', { - error: error.message, - code: error.code, - field: error.field, - originalName: req.file?.originalname, - ip: req.ip, - }); - - switch (error.code) { - case 'LIMIT_FILE_SIZE': - res.status(400).json({ - success: false, - error: 'File too large', - message: `File size must be less than ${config.upload.maxFileSize / (1024 * 1024)}MB`, - }); - return; - case 'LIMIT_FILE_COUNT': - res.status(400).json({ - success: false, - error: 'Too many files', - message: 'Only one file can be uploaded at a time', - }); - return; - case 'LIMIT_UNEXPECTED_FILE': - res.status(400).json({ - success: false, - error: 'Unexpected file field', - message: 'File must be uploaded using the correct field name', - }); - return; - default: - res.status(400).json({ - success: false, - error: 'File upload error', - message: error.message, - }); - return; - } - } - - if (error) { - logger.error('File upload error:', { - error: error.message, - originalName: req.file?.originalname, - ip: req.ip, - }); - - res.status(400).json({ - success: false, - error: 'File upload failed', - message: error.message, - }); - return; - } - - next(); -}; - -// Main upload middleware with timeout handling -export const uploadMiddleware = (req: Request, res: Response, next: NextFunction) => { - console.log('📤 ============================='); - console.log('📤 UPLOAD MIDDLEWARE CALLED'); - console.log('📤 Request method:', req.method); - console.log('📤 Request URL:', req.url); - console.log('📤 Content-Type:', req.get('Content-Type')); - console.log('📤 Content-Length:', req.get('Content-Length')); - console.log('📤 User-Agent:', req.get('User-Agent')); - console.log('📤 ============================='); - - // Set a timeout for the upload - const uploadTimeout = setTimeout(() => { - logger.error('Upload timeout for request:', { - ip: req.ip, - userAgent: req.get('User-Agent'), - }); - res.status(408).json({ - success: false, - error: 'Upload timeout', - message: 'Upload took too long to complete', - }); - }, 300000); // 5 minutes timeout - - // Clear timeout on successful upload - const originalNext = next; - next = (err?: any) => { - clearTimeout(uploadTimeout); - if (err) { - console.log('❌ Upload middleware error:', err); - console.log('❌ Error details:', { - name: err.name, - message: err.message, - code: err.code, - stack: err.stack?.split('\n')[0] - }); - } else { - console.log('✅ Upload middleware completed successfully'); - console.log('✅ File after multer processing:', { - hasFile: !!req.file, - filename: req.file?.originalname, - size: req.file?.size, - mimetype: req.file?.mimetype - }); - } - originalNext(err); - }; - - console.log('🔄 Calling multer.single("document")...'); - upload.single('document')(req, res, next); -}; - -// Combined middleware for file uploads -export const handleFileUpload = [ - uploadMiddleware, - handleUploadError, -]; - -// Utility function to get file info from memory buffer -export const getFileInfo = (file: any) => { - return { - originalName: file.originalname, - filename: file.originalname, // Use original name since we're not saving to disk - buffer: file.buffer, // File buffer for GCS upload - size: file.size, - mimetype: file.mimetype, - uploadedAt: new Date(), - }; -}; \ No newline at end of file diff --git a/backend/src/routes/documents.ts b/backend/src/routes/documents.ts index 45bdc53..74c8777 100644 --- a/backend/src/routes/documents.ts +++ b/backend/src/routes/documents.ts @@ -4,7 +4,6 @@ import { documentController } from '../controllers/documentController'; import { unifiedDocumentProcessor } from '../services/unifiedDocumentProcessor'; import { logger } from '../utils/logger'; import { config } from '../config/env'; -import { handleFileUpload } from '../middleware/upload'; import { DocumentModel } from '../models/DocumentModel'; import { validateUUID, addCorrelationId } from '../middleware/validation'; @@ -79,13 +78,11 @@ router.get('/processing-stats', async (req, res) => { } }); -// NEW Firebase Storage direct upload routes +// Firebase Storage direct upload routes router.post('/upload-url', documentController.getUploadUrl); router.post('/:id/confirm-upload', validateUUID('id'), documentController.confirmUpload); -// LEGACY multipart upload routes (keeping for backward compatibility) -router.post('/upload', handleFileUpload, documentController.uploadDocument); -router.post('/', handleFileUpload, documentController.uploadDocument); +// Document listing route router.get('/', documentController.getDocuments); // Document-specific routes with UUID validation diff --git a/backend/src/routes/vector.ts b/backend/src/routes/vector.ts index 126a162..64dfaf7 100644 --- a/backend/src/routes/vector.ts +++ b/backend/src/routes/vector.ts @@ -1,72 +1,9 @@ import { Router } from 'express'; -import { vectorDocumentProcessor } from '../services/vectorDocumentProcessor'; import { VectorDatabaseModel } from '../models/VectorDatabaseModel'; import { logger } from '../utils/logger'; const router = Router(); -// Extend VectorDocumentProcessor with missing methods -const extendedVectorProcessor = { - ...vectorDocumentProcessor, - - async findSimilarDocuments( - documentId: string, - limit: number, - similarityThreshold: number - ) { - // Implementation for finding similar documents - const chunks = await VectorDatabaseModel.getDocumentChunks(documentId); - // For now, return a basic implementation - return chunks.slice(0, limit).map(chunk => ({ - ...chunk, - similarity: Math.random() * (1 - similarityThreshold) + similarityThreshold - })); - }, - - async searchByIndustry( - industry: string, - query: string, - limit: number - ) { - // Implementation for industry search - const allChunks = await VectorDatabaseModel.getAllChunks(); - return allChunks - .filter(chunk => - chunk.content.toLowerCase().includes(industry.toLowerCase()) || - chunk.content.toLowerCase().includes(query.toLowerCase()) - ) - .slice(0, limit); - }, - - async processCIMSections( - documentId: string, - cimData: any, - metadata: any - ) { - // Implementation for processing CIM sections - const chunks = await VectorDatabaseModel.getDocumentChunks(documentId); - return { - documentId, - processedSections: chunks.length, - metadata, - cimData - }; - }, - - async getVectorDatabaseStats() { - // Implementation for getting vector database stats - const totalChunks = await VectorDatabaseModel.getTotalChunkCount(); - return { - totalChunks, - totalDocuments: await VectorDatabaseModel.getTotalDocumentCount(), - averageChunkSize: await VectorDatabaseModel.getAverageChunkSize() - }; - } -}; - -// DISABLED: All vector processing routes have been disabled -// Only read-only endpoints for monitoring and analytics are kept - /** * GET /api/vector/document-chunks/:documentId * Get document chunks for a specific document (read-only) @@ -115,7 +52,11 @@ router.get('/analytics', async (req, res) => { */ router.get('/stats', async (_req, res) => { try { - const stats = await extendedVectorProcessor.getVectorDatabaseStats(); + const stats = { + totalChunks: await VectorDatabaseModel.getTotalChunkCount(), + totalDocuments: await VectorDatabaseModel.getTotalDocumentCount(), + averageChunkSize: await VectorDatabaseModel.getAverageChunkSize() + }; return res.json({ stats }); } catch (error) { diff --git a/backend/src/services/__tests__/agenticRAGProcessor.test.ts b/backend/src/services/__tests__/agenticRAGProcessor.test.ts deleted file mode 100644 index 04d8c7e..0000000 --- a/backend/src/services/__tests__/agenticRAGProcessor.test.ts +++ /dev/null @@ -1,523 +0,0 @@ -import { agenticRAGProcessor } from '../agenticRAGProcessor'; -import { llmService } from '../llmService'; -import { AgentExecutionModel, AgenticRAGSessionModel, QualityMetricsModel } from '../../models/AgenticRAGModels'; -import { config } from '../../config/env'; -import { QualityMetrics } from '../../models/agenticTypes'; - -// Mock dependencies -jest.mock('../llmService'); -jest.mock('../../models/AgenticRAGModels'); -jest.mock('../../config/env'); -jest.mock('../../utils/logger'); - -const mockLLMService = llmService as jest.Mocked; -const mockAgentExecutionModel = AgentExecutionModel as jest.Mocked; -const mockAgenticRAGSessionModel = AgenticRAGSessionModel as jest.Mocked; -const mockQualityMetricsModel = QualityMetricsModel as jest.Mocked; - -describe('AgenticRAGProcessor', () => { - let processor: any; - - beforeEach(() => { - jest.clearAllMocks(); - - // Mock config - Object.assign(config, { - agenticRag: { - enabled: true, - maxAgents: 6, - parallelProcessing: true, - validationStrict: true, - retryAttempts: 3, - timeoutPerAgent: 60000, - }, - agentSpecific: { - documentUnderstandingEnabled: true, - financialAnalysisEnabled: true, - marketAnalysisEnabled: true, - investmentThesisEnabled: true, - synthesisEnabled: true, - validationEnabled: true, - }, - llm: { - maxTokens: 3000, - temperature: 0.1, - }, - }); - - // Mock successful LLM responses using the public method - mockLLMService.processCIMDocument.mockResolvedValue({ - success: true, - jsonOutput: createMockAgentResponse('document_understanding'), - model: 'claude-3-opus-20240229', - cost: 0.50, - inputTokens: 1000, - outputTokens: 500, - }); - - // Mock database operations - mockAgenticRAGSessionModel.create.mockResolvedValue(createMockSession()); - mockAgenticRAGSessionModel.update.mockResolvedValue(createMockSession()); - mockAgentExecutionModel.create.mockResolvedValue(createMockExecution()); - mockAgentExecutionModel.update.mockResolvedValue(createMockExecution()); - mockAgentExecutionModel.getBySessionId.mockResolvedValue([createMockExecution()]); - mockQualityMetricsModel.create.mockResolvedValue(createMockQualityMetric()); - - processor = agenticRAGProcessor; - }); - - describe('processDocument', () => { - it('should successfully process document with all agents', async () => { - // Arrange - const documentText = loadTestDocument(); - const documentId = 'test-doc-123'; - const userId = 'test-user-123'; - - // Mock successful agent responses for all steps - mockLLMService.processCIMDocument - .mockResolvedValueOnce({ - success: true, - jsonOutput: createMockAgentResponse('document_understanding'), - model: 'claude-3-opus-20240229', - cost: 0.50, - inputTokens: 1000, - outputTokens: 500, - }) - .mockResolvedValueOnce({ - success: true, - jsonOutput: createMockAgentResponse('financial_analysis'), - model: 'claude-3-opus-20240229', - cost: 0.50, - inputTokens: 1000, - outputTokens: 500, - }) - .mockResolvedValueOnce({ - success: true, - jsonOutput: createMockAgentResponse('market_analysis'), - model: 'claude-3-opus-20240229', - cost: 0.50, - inputTokens: 1000, - outputTokens: 500, - }) - .mockResolvedValueOnce({ - success: true, - jsonOutput: createMockAgentResponse('investment_thesis'), - model: 'claude-3-opus-20240229', - cost: 0.50, - inputTokens: 1000, - outputTokens: 500, - }) - .mockResolvedValueOnce({ - success: true, - jsonOutput: createMockAgentResponse('synthesis'), - model: 'claude-3-opus-20240229', - cost: 0.50, - inputTokens: 1000, - outputTokens: 500, - }) - .mockResolvedValueOnce({ - success: true, - jsonOutput: createMockAgentResponse('validation'), - model: 'claude-3-opus-20240229', - cost: 0.50, - inputTokens: 1000, - outputTokens: 500, - }); - - // Act - const result = await processor.processDocument(documentText, documentId, userId); - - // Assert - expect(result.success).toBe(true); - expect(result.reasoningSteps).toBeDefined(); - expect(result.qualityMetrics).toBeDefined(); - expect(result.processingTime).toBeGreaterThan(0); - expect(result.sessionId).toBeDefined(); - expect(result.error).toBeUndefined(); - - // Verify session was created and updated - expect(mockAgenticRAGSessionModel.create).toHaveBeenCalledWith( - expect.objectContaining({ - documentId, - userId, - strategy: 'agentic_rag', - status: 'pending', - totalAgents: 6, - }) - ); - - // Verify all agents were executed - expect(mockLLMService.processCIMDocument).toHaveBeenCalledTimes(6); - }); - - it('should handle agent failures gracefully', async () => { - // Arrange - const documentText = loadTestDocument(); - const documentId = 'test-doc-123'; - const userId = 'test-user-123'; - - // Mock one agent failure - mockLLMService.processCIMDocument - .mockResolvedValueOnce({ - success: true, - jsonOutput: createMockAgentResponse('document_understanding'), - model: 'claude-3-opus-20240229', - cost: 0.50, - inputTokens: 1000, - outputTokens: 500, - }) - .mockRejectedValueOnce(new Error('Financial analysis failed')); - - // Act - const result = await processor.processDocument(documentText, documentId, userId); - - // Assert - expect(result.success).toBe(false); - expect(result.error).toContain('Financial analysis failed'); - expect(result.reasoningSteps).toBeDefined(); - expect(result.sessionId).toBeDefined(); - - // Verify session was marked as failed - expect(mockAgenticRAGSessionModel.update).toHaveBeenCalledWith( - expect.any(String), - expect.objectContaining({ - status: 'failed', - }) - ); - }); - - it('should retry failed agents according to retry strategy', async () => { - // Arrange - const documentText = loadTestDocument(); - const documentId = 'test-doc-123'; - const userId = 'test-user-123'; - - // Mock agent that fails twice then succeeds - mockLLMService.processCIMDocument - .mockRejectedValueOnce(new Error('Temporary failure')) - .mockRejectedValueOnce(new Error('Temporary failure')) - .mockResolvedValueOnce({ - success: true, - jsonOutput: createMockAgentResponse('document_understanding'), - model: 'claude-3-opus-20240229', - cost: 0.50, - inputTokens: 1000, - outputTokens: 500, - }); - - // Act - const result = await processor.processDocument(documentText, documentId, userId); - - // Assert - expect(mockLLMService.processCIMDocument).toHaveBeenCalledTimes(3); - expect(result.success).toBe(true); - }); - - it('should assess quality metrics correctly', async () => { - // Arrange - const documentText = loadTestDocument(); - const documentId = 'test-doc-123'; - const userId = 'test-user-123'; - - // Mock successful processing - mockLLMService.processCIMDocument.mockResolvedValue({ - success: true, - jsonOutput: createMockAgentResponse('document_understanding'), - model: 'claude-3-opus-20240229', - cost: 0.50, - inputTokens: 1000, - outputTokens: 500, - }); - - // Act - const result = await processor.processDocument(documentText, documentId, userId); - - // Assert - expect(result.qualityMetrics).toBeDefined(); - expect(result.qualityMetrics.length).toBeGreaterThan(0); - expect(result.qualityMetrics.every((m: QualityMetrics) => m.metricValue >= 0 && m.metricValue <= 1)).toBe(true); - }); - - it('should handle circuit breaker pattern', async () => { - // Arrange - const documentText = loadTestDocument(); - const documentId = 'test-doc-123'; - const userId = 'test-user-123'; - - // Mock repeated failures to trigger circuit breaker - mockLLMService.processCIMDocument.mockRejectedValue(new Error('Service unavailable')); - - // Act - const result = await processor.processDocument(documentText, documentId, userId); - - // Assert - expect(result.success).toBe(false); - expect(result.error).toContain('Service unavailable'); - }); - - it('should track API calls and costs', async () => { - // Arrange - const documentText = loadTestDocument(); - const documentId = 'test-doc-123'; - const userId = 'test-user-123'; - - // Mock successful processing - mockLLMService.processCIMDocument.mockResolvedValue({ - success: true, - jsonOutput: createMockAgentResponse('document_understanding'), - model: 'claude-3-opus-20240229', - cost: 0.50, - inputTokens: 1000, - outputTokens: 500, - }); - - // Act - const result = await processor.processDocument(documentText, documentId, userId); - - // Assert - expect(result.apiCalls).toBeGreaterThan(0); - expect(result.totalCost).toBeDefined(); - }); - }); - - describe('error handling', () => { - it('should handle database errors gracefully', async () => { - // Arrange - const documentText = loadTestDocument(); - const documentId = 'test-doc-123'; - const userId = 'test-user-123'; - - mockAgenticRAGSessionModel.create.mockRejectedValue(new Error('Database connection failed')); - - // Act - const result = await processor.processDocument(documentText, documentId, userId); - - // Assert - expect(result.success).toBe(false); - expect(result.error).toContain('Database connection failed'); - }); - - it('should handle invalid JSON responses', async () => { - // Arrange - const documentText = loadTestDocument(); - const documentId = 'test-doc-123'; - const userId = 'test-user-123'; - - mockLLMService.processCIMDocument.mockResolvedValue({ - success: false, - error: 'Invalid JSON response', - model: 'claude-3-opus-20240229', - cost: 0.50, - inputTokens: 1000, - outputTokens: 500, - }); - - // Act - const result = await processor.processDocument(documentText, documentId, userId); - - // Assert - expect(result.success).toBe(false); - expect(result.error).toContain('Failed to parse JSON'); - }); - }); - - describe('configuration', () => { - it('should respect agent-specific configuration', async () => { - // Arrange - const documentText = loadTestDocument(); - const documentId = 'test-doc-123'; - const userId = 'test-user-123'; - - // Disable some agents - (config as any).agentSpecific.financialAnalysisEnabled = false; - (config as any).agentSpecific.marketAnalysisEnabled = false; - - mockLLMService.processCIMDocument.mockResolvedValue({ - success: true, - jsonOutput: createMockAgentResponse('document_understanding'), - model: 'claude-3-opus-20240229', - cost: 0.50, - inputTokens: 1000, - outputTokens: 500, - }); - - // Act - const result = await processor.processDocument(documentText, documentId, userId); - - // Assert - // Should still work with enabled agents - expect(result.success).toBeDefined(); - }); - }); -}); - -// Helper functions -function createMockAgentResponse(agentName: string): any { - const responses: Record = { - document_understanding: { - companyOverview: { - name: 'Test Company', - industry: 'Technology', - location: 'San Francisco, CA', - founded: '2010', - employees: '500' - }, - documentStructure: { - sections: ['Executive Summary', 'Financial Analysis', 'Market Analysis'], - pageCount: 50, - keyTopics: ['Financial Performance', 'Market Position', 'Growth Strategy'] - }, - financialHighlights: { - revenue: '$100M', - ebitda: '$20M', - growth: '15%', - margins: '20%' - } - }, - financial_analysis: { - historicalPerformance: { - revenue: ['$80M', '$90M', '$100M'], - ebitda: ['$15M', '$18M', '$20M'], - margins: ['18%', '20%', '20%'] - }, - qualityOfEarnings: 'High', - workingCapital: 'Positive', - cashFlow: 'Strong' - }, - market_analysis: { - marketSize: '$10B', - growthRate: '8%', - competitors: ['Competitor A', 'Competitor B'], - barriersToEntry: 'High', - competitiveAdvantages: ['Technology', 'Brand', 'Scale'] - }, - investment_thesis: { - keyAttractions: ['Strong growth', 'Market leadership', 'Technology advantage'], - potentialRisks: ['Market competition', 'Regulatory changes'], - valueCreation: ['Operational improvements', 'Market expansion'], - recommendation: 'Proceed with diligence' - }, - synthesis: { - dealOverview: { - targetCompanyName: 'Test Company', - industrySector: 'Technology', - geography: 'San Francisco, CA' - }, - financialSummary: { - financials: { - ltm: { - revenue: '$100M', - ebitda: '$20M' - } - } - }, - preliminaryInvestmentThesis: { - keyAttractions: ['Strong growth', 'Market leadership'], - potentialRisks: ['Market competition'] - } - }, - validation: { - isValid: true, - issues: [], - completeness: '95%', - quality: 'high' - } - }; - - return responses[agentName] || {}; -} - -function createMockSession(): any { - return { - id: 'session-123', - documentId: 'doc-123', - userId: 'user-123', - strategy: 'agentic_rag', - status: 'completed', - totalAgents: 6, - completedAgents: 6, - failedAgents: 0, - overallValidationScore: 0.9, - processingTimeMs: 120000, - apiCallsCount: 6, - totalCost: 2.50, - reasoningSteps: [], - finalResult: {}, - createdAt: new Date(), - completedAt: new Date() - }; -} - -function createMockExecution(): any { - return { - id: 'execution-123', - documentId: 'doc-123', - sessionId: 'session-123', - agentName: 'document_understanding', - stepNumber: 1, - status: 'completed', - inputData: {}, - outputData: createMockAgentResponse('document_understanding'), - validationResult: true, - processingTimeMs: 20000, - errorMessage: null, - retryCount: 0, - createdAt: new Date(), - updatedAt: new Date() - }; -} - -function createMockQualityMetric(): any { - return { - id: 'metric-123', - documentId: 'doc-123', - sessionId: 'session-123', - metricType: 'completeness', - metricValue: 0.9, - metricDetails: { - requiredSections: 7, - presentSections: 6, - missingSections: ['managementTeamOverview'] - }, - createdAt: new Date() - }; -} - -function loadTestDocument(): string { - // Mock document content for testing - return ` - CONFIDENTIAL INVESTMENT MEMORANDUM - - Test Company, Inc. - - Executive Summary - Test Company is a leading technology company with strong financial performance and market position. - - Financial Performance - - Revenue: $100M (2023) - - EBITDA: $20M (2023) - - Growth Rate: 15% annually - - Market Position - - Market Size: $10B - - Market Share: 5% - - Competitive Advantages: Technology, Brand, Scale - - Management Team - - CEO: John Smith (10+ years experience) - - CFO: Jane Doe (15+ years experience) - - Investment Opportunity - - Strong growth potential - - Market leadership position - - Technology advantage - - Experienced management team - - Risks and Considerations - - Market competition - - Regulatory changes - - Technology disruption - - This memorandum contains confidential information and is for internal use only. - `; -} \ No newline at end of file diff --git a/backend/src/services/__tests__/documentProcessingService.test.ts b/backend/src/services/__tests__/documentProcessingService.test.ts deleted file mode 100644 index 1c42d95..0000000 --- a/backend/src/services/__tests__/documentProcessingService.test.ts +++ /dev/null @@ -1,435 +0,0 @@ -import { documentProcessingService } from '../documentProcessingService'; -import { DocumentModel } from '../../models/DocumentModel'; -import { ProcessingJobModel } from '../../models/ProcessingJobModel'; -import { fileStorageService } from '../fileStorageService'; -import { llmService } from '../llmService'; -import { pdfGenerationService } from '../pdfGenerationService'; -import { config } from '../../config/env'; -import fs from 'fs'; -import path from 'path'; - -// Mock dependencies -jest.mock('../../models/DocumentModel'); -jest.mock('../../models/ProcessingJobModel'); -jest.mock('../fileStorageService'); -jest.mock('../llmService'); -jest.mock('../pdfGenerationService'); -jest.mock('../../config/env'); -jest.mock('fs'); -jest.mock('path'); - -const mockDocumentModel = DocumentModel as jest.Mocked; -const mockProcessingJobModel = ProcessingJobModel as jest.Mocked; -const mockFileStorageService = fileStorageService as jest.Mocked; -const mockLlmService = llmService as jest.Mocked; -const mockPdfGenerationService = pdfGenerationService as jest.Mocked; - -// Mock CIM review data that matches the schema -const mockCIMReviewData = { - dealOverview: { - targetCompanyName: 'Test Company', - industrySector: 'Technology', - geography: 'US', - dealSource: 'Investment Bank', - transactionType: 'Buyout', - dateCIMReceived: '2024-01-01', - dateReviewed: '2024-01-02', - reviewers: 'Test Reviewer', - cimPageCount: '50', - statedReasonForSale: 'Strategic exit' - }, - businessDescription: { - coreOperationsSummary: 'Test operations', - keyProductsServices: 'Software solutions', - uniqueValueProposition: 'Market leader', - customerBaseOverview: { - keyCustomerSegments: 'Enterprise clients', - customerConcentrationRisk: 'Low', - typicalContractLength: '3 years' - }, - keySupplierOverview: { - dependenceConcentrationRisk: 'Moderate' - } - }, - marketIndustryAnalysis: { - estimatedMarketSize: '$1B', - estimatedMarketGrowthRate: '10%', - keyIndustryTrends: 'Digital transformation', - competitiveLandscape: { - keyCompetitors: 'Competitor A, B', - targetMarketPosition: '#2', - basisOfCompetition: 'Innovation' - }, - barriersToEntry: 'High switching costs' - }, - financialSummary: { - financials: { - fy3: { - revenue: '$10M', - revenueGrowth: '15%', - grossProfit: '$7M', - grossMargin: '70%', - ebitda: '$2M', - ebitdaMargin: '20%' - }, - fy2: { - revenue: '$12M', - revenueGrowth: '20%', - grossProfit: '$8.4M', - grossMargin: '70%', - ebitda: '$2.4M', - ebitdaMargin: '20%' - }, - fy1: { - revenue: '$15M', - revenueGrowth: '25%', - grossProfit: '$10.5M', - grossMargin: '70%', - ebitda: '$3M', - ebitdaMargin: '20%' - }, - ltm: { - revenue: '$18M', - revenueGrowth: '20%', - grossProfit: '$12.6M', - grossMargin: '70%', - ebitda: '$3.6M', - ebitdaMargin: '20%' - } - }, - qualityOfEarnings: 'High quality', - revenueGrowthDrivers: 'Market expansion', - marginStabilityAnalysis: 'Stable', - capitalExpenditures: '5%', - workingCapitalIntensity: 'Low', - freeCashFlowQuality: 'Strong' - }, - managementTeamOverview: { - keyLeaders: 'CEO, CFO, CTO', - managementQualityAssessment: 'Experienced team', - postTransactionIntentions: 'Stay on board', - organizationalStructure: 'Flat structure' - }, - preliminaryInvestmentThesis: { - keyAttractions: 'Market leader with strong growth', - potentialRisks: 'Market competition', - valueCreationLevers: 'Operational improvements', - alignmentWithFundStrategy: 'Strong fit' - }, - keyQuestionsNextSteps: { - criticalQuestions: 'Market sustainability', - missingInformation: 'Customer references', - preliminaryRecommendation: 'Proceed', - rationaleForRecommendation: 'Strong fundamentals', - proposedNextSteps: 'Management presentation' - } -}; - -describe('DocumentProcessingService', () => { - const mockDocument = { - id: 'doc-123', - user_id: 'user-123', - original_file_name: 'test-document.pdf', - file_path: '/uploads/test-document.pdf', - file_size: 1024, - status: 'uploaded' as const, - uploaded_at: new Date(), - created_at: new Date(), - updated_at: new Date(), - }; - - - - beforeEach(() => { - jest.clearAllMocks(); - - // Mock config - (config as any).upload = { - uploadDir: '/test/uploads', - }; - (config as any).llm = { - maxTokens: 4000, - }; - - // Mock fs - (fs.existsSync as jest.Mock).mockReturnValue(true); - (fs.mkdirSync as jest.Mock).mockImplementation(() => {}); - (fs.writeFileSync as jest.Mock).mockImplementation(() => {}); - - // Mock path - (path.join as jest.Mock).mockImplementation((...args) => args.join('/')); - (path.dirname as jest.Mock).mockReturnValue('/test/uploads/summaries'); - }); - - describe('processDocument', () => { - it('should process a document successfully', async () => { - // Mock document model - mockDocumentModel.findById.mockResolvedValue(mockDocument); - mockDocumentModel.updateStatus.mockResolvedValue(mockDocument); - - // Mock file storage service - mockFileStorageService.getFile.mockResolvedValue(Buffer.from('mock pdf content')); - mockFileStorageService.fileExists.mockResolvedValue(true); - - // Mock processing job model - mockProcessingJobModel.create.mockResolvedValue({} as any); - mockProcessingJobModel.updateStatus.mockResolvedValue({} as any); - - // Mock LLM service - // Remove estimateTokenCount mock - it's a private method - mockLlmService.processCIMDocument.mockResolvedValue({ - success: true, - jsonOutput: mockCIMReviewData, - model: 'test-model', - cost: 0.01, - inputTokens: 1000, - outputTokens: 500 - }); - - // Mock PDF generation service - mockPdfGenerationService.generatePDFFromMarkdown.mockResolvedValue(true); - - const result = await documentProcessingService.processDocument( - 'doc-123', - 'user-123' - ); - - expect(result.success).toBe(true); - expect(result.documentId).toBe('doc-123'); - expect(result.jobId).toBeDefined(); - expect(result.steps).toHaveLength(5); - expect(result.steps.every(step => step.status === 'completed')).toBe(true); - }); - - it('should handle document validation failure', async () => { - mockDocumentModel.findById.mockResolvedValue(null); - - const result = await documentProcessingService.processDocument( - 'doc-123', - 'user-123' - ); - - expect(result.success).toBe(false); - expect(result.error).toContain('Document not found'); - }); - - it('should handle access denied', async () => { - const wrongUserDocument = { ...mockDocument, user_id: 'wrong-user' as any }; - mockDocumentModel.findById.mockResolvedValue(wrongUserDocument); - - const result = await documentProcessingService.processDocument( - 'doc-123', - 'user-123' - ); - - expect(result.success).toBe(false); - expect(result.error).toContain('Access denied'); - }); - - it('should handle file not found', async () => { - mockDocumentModel.findById.mockResolvedValue(mockDocument); - mockFileStorageService.fileExists.mockResolvedValue(false); - - const result = await documentProcessingService.processDocument( - 'doc-123', - 'user-123' - ); - - expect(result.success).toBe(false); - expect(result.error).toContain('Document file not accessible'); - }); - - it('should handle text extraction failure', async () => { - mockDocumentModel.findById.mockResolvedValue(mockDocument); - mockFileStorageService.fileExists.mockResolvedValue(true); - mockFileStorageService.getFile.mockResolvedValue(null); - - const result = await documentProcessingService.processDocument( - 'doc-123', - 'user-123' - ); - - expect(result.success).toBe(false); - expect(result.error).toContain('Could not read document file'); - }); - - it('should handle LLM processing failure', async () => { - mockDocumentModel.findById.mockResolvedValue(mockDocument); - mockFileStorageService.fileExists.mockResolvedValue(true); - mockFileStorageService.getFile.mockResolvedValue(Buffer.from('mock pdf content')); - mockProcessingJobModel.create.mockResolvedValue({} as any); - // Remove estimateTokenCount mock - it's a private method - mockLlmService.processCIMDocument.mockRejectedValue(new Error('LLM API error')); - - const result = await documentProcessingService.processDocument( - 'doc-123', - 'user-123' - ); - - expect(result.success).toBe(false); - expect(result.error).toContain('LLM processing failed'); - }); - - it('should handle PDF generation failure', async () => { - mockDocumentModel.findById.mockResolvedValue(mockDocument); - mockFileStorageService.fileExists.mockResolvedValue(true); - mockFileStorageService.getFile.mockResolvedValue(Buffer.from('mock pdf content')); - mockProcessingJobModel.create.mockResolvedValue({} as any); - // Remove estimateTokenCount mock - it's a private method - mockLlmService.processCIMDocument.mockResolvedValue({ - success: true, - jsonOutput: mockCIMReviewData, - model: 'test-model', - cost: 0.01, - inputTokens: 1000, - outputTokens: 500 - }); - mockPdfGenerationService.generatePDFFromMarkdown.mockResolvedValue(false); - - const result = await documentProcessingService.processDocument( - 'doc-123', - 'user-123' - ); - - expect(result.success).toBe(false); - expect(result.error).toContain('Failed to generate PDF'); - }); - - it('should process large documents in chunks', async () => { - mockDocumentModel.findById.mockResolvedValue(mockDocument); - mockFileStorageService.fileExists.mockResolvedValue(true); - mockFileStorageService.getFile.mockResolvedValue(Buffer.from('mock pdf content')); - mockProcessingJobModel.create.mockResolvedValue({} as any); - mockProcessingJobModel.updateStatus.mockResolvedValue({} as any); - - // Mock large document - mockLlmService.processCIMDocument.mockResolvedValue({ - success: true, - jsonOutput: mockCIMReviewData, - model: 'test-model', - cost: 0.01, - inputTokens: 1000, - outputTokens: 500 - }); - mockPdfGenerationService.generatePDFFromMarkdown.mockResolvedValue(true); - - const result = await documentProcessingService.processDocument( - 'doc-123', - 'user-123' - ); - - expect(result.success).toBe(true); - expect(mockLlmService.processCIMDocument).toHaveBeenCalled(); - }); - }); - - describe('getProcessingJobStatus', () => { - it('should return job status', async () => { - const mockJob = { - id: 'job-123', - status: 'completed', - created_at: new Date(), - }; - - mockProcessingJobModel.findById.mockResolvedValue(mockJob as any); - - const result = await documentProcessingService.getProcessingJobStatus('job-123'); - - expect(result).toEqual(mockJob); - expect(mockProcessingJobModel.findById).toHaveBeenCalledWith('job-123'); - }); - - it('should handle job not found', async () => { - mockProcessingJobModel.findById.mockResolvedValue(null); - - const result = await documentProcessingService.getProcessingJobStatus('job-123'); - - expect(result).toBeNull(); - }); - }); - - describe('getDocumentProcessingHistory', () => { - it('should return processing history', async () => { - const mockJobs = [ - { id: 'job-1', status: 'completed' }, - { id: 'job-2', status: 'failed' }, - ]; - - mockProcessingJobModel.findByDocumentId.mockResolvedValue(mockJobs as any); - - const result = await documentProcessingService.getDocumentProcessingHistory('doc-123'); - - expect(result).toEqual(mockJobs); - expect(mockProcessingJobModel.findByDocumentId).toHaveBeenCalledWith('doc-123'); - }); - - it('should return empty array for no history', async () => { - mockProcessingJobModel.findByDocumentId.mockResolvedValue([]); - - const result = await documentProcessingService.getDocumentProcessingHistory('doc-123'); - - expect(result).toEqual([]); - }); - }); - - describe('document analysis', () => { - it('should detect financial content', () => { - const financialText = 'Revenue increased by 25% and EBITDA margins improved.'; - const result = (documentProcessingService as any).detectFinancialContent(financialText); - expect(result).toBe(true); - }); - - it('should detect technical content', () => { - const technicalText = 'The system architecture includes multiple components.'; - const result = (documentProcessingService as any).detectTechnicalContent(technicalText); - expect(result).toBe(true); - }); - - it('should extract key topics', () => { - const text = 'Financial analysis shows strong market growth and competitive advantages.'; - const result = (documentProcessingService as any).extractKeyTopics(text); - expect(result).toContain('Financial Analysis'); - expect(result).toContain('Market Analysis'); - }); - - it('should analyze sentiment', () => { - const positiveText = 'Strong growth and excellent opportunities.'; - const result = (documentProcessingService as any).analyzeSentiment(positiveText); - expect(result).toBe('positive'); - }); - - it('should assess complexity', () => { - const simpleText = 'This is a simple document.'; - const result = (documentProcessingService as any).assessComplexity(simpleText); - expect(result).toBe('low'); - }); - }); - - describe('error handling', () => { - it('should handle database errors gracefully', async () => { - mockDocumentModel.findById.mockRejectedValue(new Error('Database connection failed')); - - const result = await documentProcessingService.processDocument( - 'doc-123', - 'user-123' - ); - - expect(result.success).toBe(false); - expect(result.error).toContain('Database connection failed'); - }); - - it('should handle file system errors', async () => { - mockDocumentModel.findById.mockResolvedValue(mockDocument); - mockFileStorageService.fileExists.mockResolvedValue(true); - mockFileStorageService.getFile.mockRejectedValue(new Error('File system error')); - - const result = await documentProcessingService.processDocument( - 'doc-123', - 'user-123' - ); - - expect(result.success).toBe(false); - expect(result.error).toContain('File system error'); - }); - }); -}); \ No newline at end of file diff --git a/backend/src/services/__tests__/vectorDocumentProcessor.test.ts b/backend/src/services/__tests__/vectorDocumentProcessor.test.ts deleted file mode 100644 index 7092c3a..0000000 --- a/backend/src/services/__tests__/vectorDocumentProcessor.test.ts +++ /dev/null @@ -1,121 +0,0 @@ -import { VectorDocumentProcessor, TextBlock } from '../vectorDocumentProcessor'; -import { llmService } from '../llmService'; -import { vectorDatabaseService } from '../vectorDatabaseService'; - -// Mock the dependencies -jest.mock('../llmService'); -jest.mock('../vectorDatabaseService'); - -const mockedLlmService = llmService as jest.Mocked; -const mockedVectorDBService = vectorDatabaseService as jest.Mocked; - -// Sample text mimicking a messy PDF extraction with various elements -const sampleText = ` -This is the first paragraph of the document. It contains some general information about the company. - -Financial Highlights - -This paragraph discusses the financial performance. It is located after a heading. - -Here is a table of financial data: - -Metric FY2021 FY2022 FY2023 -Revenue $10.0M $12.5M $15.0M -EBITDA $2.0M $2.5M $3.0M - -This is the final paragraph, coming after the table. It summarizes the outlook. -`; - -describe('VectorDocumentProcessor', () => { - let processor: VectorDocumentProcessor; - - beforeEach(() => { - processor = new VectorDocumentProcessor(); - // Reset mocks before each test - jest.clearAllMocks(); - - // Set up VectorDatabaseService mock methods - (mockedVectorDBService as any).generateEmbeddings = jest.fn(); - (mockedVectorDBService as any).storeDocumentChunks = jest.fn(); - (mockedVectorDBService as any).search = jest.fn(); - }); - - describe('identifyTextBlocks', () => { - it('should correctly identify paragraphs, headings, and tables', () => { - // Access the private method for testing purposes - const blocks: TextBlock[] = (processor as any).identifyTextBlocks(sampleText); - - expect(blocks).toHaveLength(5); - - // Check block types - expect(blocks[0]?.type).toBe('paragraph'); - expect(blocks[1]?.type).toBe('heading'); - expect(blocks[2]?.type).toBe('paragraph'); - expect(blocks[3]?.type).toBe('table'); - expect(blocks[4]?.type).toBe('paragraph'); - - // Check block content - expect(blocks[0]?.content).toBe('This is the first paragraph of the document. It contains some general information about the company.'); - expect(blocks[1]?.content).toBe('Financial Highlights'); - expect(blocks[3]?.content).toContain('Revenue $10.0M $12.5M $15.0M'); - }); - }); - - describe('processDocumentForVectorSearch', () => { - it('should use the LLM to summarize tables and store original in metadata', async () => { - const documentId = 'test-doc-1'; - const tableSummary = 'The table shows revenue growing from $10M to $15M and EBITDA growing from $2M to $3M between FY2021 and FY2023.'; - - // Mock the LLM service to return a summary for the table - mockedLlmService.processCIMDocument.mockResolvedValue({ - success: true, - jsonOutput: { summary: tableSummary } as any, - model: 'test-model', - cost: 0.01, - inputTokens: 100, - outputTokens: 50, - }); - - // Mock the embedding service to return a dummy vector - (mockedVectorDBService as any).generateEmbeddings.mockResolvedValue([0.1, 0.2, 0.3]); - - // Mock the storage service - (mockedVectorDBService as any).storeDocumentChunks.mockResolvedValue(); - - await processor.processDocumentForVectorSearch(documentId, sampleText); - - // Verify that storeDocumentChunks was called - expect((mockedVectorDBService as any).storeDocumentChunks).toHaveBeenCalled(); - - // Get the arguments passed to storeDocumentChunks - const storedChunks = (mockedVectorDBService as any).storeDocumentChunks.mock.calls[0]?.[0]; - expect(storedChunks).toBeDefined(); - if (!storedChunks) return; - - expect(storedChunks).toHaveLength(5); - - // Find the table chunk - const tableChunk = storedChunks.find((c: any) => c.metadata.block_type === 'table'); - expect(tableChunk).toBeDefined(); - if (!tableChunk) return; - - // Assert that the LLM was called for the table summarization - expect(mockedLlmService.processCIMDocument).toHaveBeenCalledTimes(1); - const prompt = mockedLlmService.processCIMDocument.mock.calls[0]?.[0]; - expect(prompt).toContain('Summarize the key information in this table'); - - // Assert that the table chunk's content is the LLM summary - expect(tableChunk.content).toBe(tableSummary); - - // Assert that the original table text is stored in the metadata - expect(tableChunk.metadata['original_table']).toContain('Metric FY2021 FY2022 FY2023'); - - // Find a paragraph chunk and check its content - const paragraphChunk = storedChunks.find((c: any) => c.metadata.block_type === 'paragraph'); - expect(paragraphChunk).toBeDefined(); - if (paragraphChunk) { - expect(paragraphChunk.content).not.toBe(tableSummary); // Ensure it wasn't summarized - } - }); - }); -}); diff --git a/backend/src/services/agenticRAGProcessor.ts b/backend/src/services/agenticRAGProcessor.ts deleted file mode 100644 index 7bc1b6f..0000000 --- a/backend/src/services/agenticRAGProcessor.ts +++ /dev/null @@ -1,1120 +0,0 @@ -import { logger } from '../utils/logger'; -import { llmService } from './llmService'; -import { config } from '../config/env'; -import { CIMReview } from './llmSchemas'; -import { - AgentExecution, - AgenticRAGSession, - AgenticRAGResult, - AgenticRAGConfig -} from '../models/agenticTypes'; -import { agenticRAGDatabaseService } from './agenticRAGDatabaseService'; -import { vectorDocumentProcessor } from './vectorDocumentProcessor'; -import { loadAndParseTemplate, IReviewTemplate, IFormSection, IFormField } from '../utils/templateParser'; -import { extractFinancials, CleanedFinancials } from '../utils/financialExtractor'; - - -// Best Practice: Centralized error types -enum AgenticRAGErrorType { - AGENT_EXECUTION_FAILED = 'AGENT_EXECUTION_FAILED', - VALIDATION_FAILED = 'VALIDATION_FAILED', - TIMEOUT_ERROR = 'TIMEOUT_ERROR', - RATE_LIMIT_ERROR = 'RATE_LIMIT_ERROR', - INVALID_RESPONSE = 'INVALID_RESPONSE', - DATABASE_ERROR = 'DATABASE_ERROR', - CONFIGURATION_ERROR = 'CONFIGURATION_ERROR', - TEMPLATE_ERROR = 'TEMPLATE_ERROR' -} - -// Best Practice: Custom error class with context -class AgenticRAGError extends Error { - constructor( - message: string, - public type: AgenticRAGErrorType, - public agentName?: string, - public retryable: boolean = false, - public context?: any, - public originalError?: Error - ) { - super(message); - this.name = 'AgenticRAGError'; - - // Best Practice: Preserve stack trace - if (originalError?.stack) { - this.stack = originalError.stack; - } - } -} - -// Best Practice: Safe serialization utility -class SafeSerializer { - static serialize(data: any): any { - if (data === null || data === undefined) { - return null; - } - - if (typeof data === 'string' || typeof data === 'number' || typeof data === 'boolean') { - return data; - } - - if (data instanceof Date) { - return data.toISOString(); - } - - if (Array.isArray(data)) { - return data.map(item => this.serialize(item)); - } - - if (typeof data === 'object') { - // Handle circular references and complex objects - const seen = new WeakSet(); - return this.serializeObject(data, seen); - } - - return String(data); - } - - private static serializeObject(obj: any, seen: WeakSet): any { - if (seen.has(obj)) { - return '[Circular Reference]'; - } - - seen.add(obj); - - const result: any = {}; - - for (const [key, value] of Object.entries(obj)) { - try { - // Skip functions and symbols - if (typeof value === 'function' || typeof value === 'symbol') { - continue; - } - - // Skip undefined values - if (value === undefined) { - continue; - } - - result[key] = this.serialize(value); - } catch (error) { - logger.warn('Failed to serialize object property', { key, error: error instanceof Error ? error.message : 'Unknown error' }); - result[key] = '[Serialization Error]'; - } - } - - return result; - } - - static safeStringify(data: any): string { - try { - const serialized = this.serialize(data); - return JSON.stringify(serialized); - } catch (error) { - logger.error('Failed to stringify data', { error: error instanceof Error ? error.message : 'Unknown error' }); - return JSON.stringify({ error: 'Serialization failed', originalType: typeof data }); - } - } - - static safeParse(jsonString: string): any { - try { - return JSON.parse(jsonString); - } catch (error) { - logger.error('Failed to parse JSON string', { error: error instanceof Error ? error.message : 'Unknown error' }); - return { error: 'JSON parsing failed', originalString: jsonString.substring(0, 100) + '...' }; - } - } -} - -// Best Practice: Configuration validation -class ConfigurationValidator { - static validateAgenticRAGConfig(config: AgenticRAGConfig): void { - const errors: string[] = []; - - if (config.maxAgents <= 0 || config.maxAgents > 10) { - errors.push('maxAgents must be between 1 and 10'); - } - - if (config.retryAttempts < 0 || config.retryAttempts > 5) { - errors.push('retryAttempts must be between 0 and 5'); - } - - if (config.timeoutPerAgent < 1000 || config.timeoutPerAgent > 300000) { - errors.push('timeoutPerAgent must be between 1 and 300 seconds'); - } - - if (config.qualityThreshold < 0 || config.qualityThreshold > 1) { - errors.push('qualityThreshold must be between 0 and 1'); - } - - if (errors.length > 0) { - throw new AgenticRAGError( - `Invalid configuration: ${errors.join(', ')}`, - AgenticRAGErrorType.CONFIGURATION_ERROR, - undefined, - false, - { errors } - ); - } - } -} - -// Best Practice: Dependency injection interface - - -interface ISessionManager { - createSession(documentId: string, userId: string, strategy: string): Promise; - updateSession(sessionId: string, updates: Partial): Promise; - createExecution(sessionId: string, agentName: string, inputData: any): Promise; -} - -class AgenticRAGSessionManager implements ISessionManager { - async createSession( - documentId: string, - userId: string, - strategy: string - ): Promise { - return await agenticRAGDatabaseService.createSessionWithTransaction(documentId, userId, strategy); - } - - async updateSession( - sessionId: string, - updates: Partial - ): Promise { - await agenticRAGDatabaseService.updateSessionWithMetrics(sessionId, updates); - } - - async createExecution( - sessionId: string, - agentName: string, - inputData: any - ): Promise { - return await agenticRAGDatabaseService.createExecutionWithTransaction(sessionId, agentName, inputData); - } -} - -// Main processor with configuration validation and health monitoring -class AgenticRAGProcessor { - private sessionManager: AgenticRAGSessionManager; - private isInitialized: boolean = false; - - constructor() { - // Best Practice: Validate configuration at startup - try { - // Validate the configuration structure we have - ConfigurationValidator.validateAgenticRAGConfig({ - enabled: config.agenticRag.enabled, - maxAgents: config.agenticRag.maxAgents, - parallelProcessing: config.agenticRag.parallelProcessing, - validationStrict: config.agenticRag.validationStrict, - retryAttempts: config.agenticRag.retryAttempts, - timeoutPerAgent: config.agenticRag.timeoutPerAgent, - qualityThreshold: config.qualityControl.qualityThreshold, - completenessThreshold: config.qualityControl.completenessThreshold, - consistencyCheck: config.qualityControl.consistencyCheck, - detailedLogging: config.monitoringAndLogging.detailedLogging, - performanceTracking: config.monitoringAndLogging.performanceTracking, - errorReporting: config.monitoringAndLogging.errorReporting - }); - this.sessionManager = new AgenticRAGSessionManager(); - this.isInitialized = true; - - logger.info('Agentic RAG Processor initialized successfully', { - maxAgents: config.agenticRag.maxAgents, - parallelProcessing: config.agenticRag.parallelProcessing, - retryAttempts: config.agenticRag.retryAttempts - }); - } catch (error) { - logger.error('Failed to initialize Agentic RAG Processor', { error }); - throw error; - } - } - - /** - * Process CIM document using the new multi-phase agentic RAG approach - */ - async processDocument(text: string, documentId: string, userId: string): Promise { - const startTime = Date.now(); - - if (!this.isInitialized) { - throw new AgenticRAGError('Processor not initialized', AgenticRAGErrorType.CONFIGURATION_ERROR); - } - if (!text || !documentId || !userId) { - throw new AgenticRAGError('Text, document ID, and user ID are required', AgenticRAGErrorType.CONFIGURATION_ERROR); - } - - logger.info('Starting agentic RAG processing...', { documentId, userId }); - - const session = await this.sessionManager.createSession(documentId, userId, 'agentic_rag'); - - try { - await this.sessionManager.updateSession(session.id, { status: 'processing' }); - - // Phase 0: Load Template - const reviewTemplate = await this.loadTemplate(session.id); - - // Phase 0.5: Document Vectorization (Critical for accurate retrieval) - await this.executePhase0_DocumentVectorization(text, documentId, session.id); - - // Phase 1: Structured Data Extraction - const structuredData = await this.executePhase1_StructuredDataExtraction(text, documentId, session.id); - - // Phase 2: Iterative, Section-by-Section Analysis - const analysisResults = await this.executePhase2_IterativeAnalysis(documentId, session.id, reviewTemplate, structuredData); - - // Phase 3: Thesis Generation and Gap Analysis - const thesisAndQuestions = await this.executePhase3_ThesisAndGapAnalysis(session.id, structuredData, analysisResults); - - // Combine all results - const finalResult = { ...structuredData, ...analysisResults, ...thesisAndQuestions }; - - const processingTime = Date.now() - startTime; - await this.sessionManager.updateSession(session.id, { - status: 'completed', - processingTimeMs: processingTime, - finalResult: SafeSerializer.serialize(finalResult), - completedAt: new Date() - }); - - logger.info('Agentic RAG processing completed successfully.', { sessionId: session.id, processingTime }); - - return { - success: true, - summary: this.convertToMarkdown(finalResult as any), - analysisData: SafeSerializer.serialize(finalResult), - reasoningSteps: [], // This can be populated by the sub-methods - processingTime, - apiCalls: 0, // This should be tracked in a refactored execution engine - totalCost: 0, - qualityMetrics: [], - sessionId: session.id - }; - - } catch (error) { - const processingTime = Date.now() - startTime; - const agenticError = error instanceof AgenticRAGError ? error : new AgenticRAGError( - error instanceof Error ? error.message : 'Unknown error', - AgenticRAGErrorType.AGENT_EXECUTION_FAILED, - 'main_processor', - false, - {}, - error instanceof Error ? error : undefined - ); - - await this.sessionManager.updateSession(session.id, { - status: 'failed', - processingTimeMs: processingTime, - completedAt: new Date() - }); - - logger.error('Agentic RAG processing failed', { sessionId: session.id, error: agenticError }); - - throw agenticError; - } - } - - private async loadTemplate(sessionId: string): Promise { - const execution = await this.sessionManager.createExecution(sessionId, 'load_template', {}); - try { - const template = await loadAndParseTemplate(); - await agenticRAGDatabaseService.updateExecutionWithTransaction(execution.id, { status: 'completed', outputData: { success: true } }); - return template; - } catch (e) { - await agenticRAGDatabaseService.updateExecutionWithTransaction(execution.id, { status: 'failed', errorMessage: (e as Error).message }); - throw new AgenticRAGError('Failed to load or parse review template', AgenticRAGErrorType.TEMPLATE_ERROR, 'load_template', false, {}, e as Error); - } - } - - private async executePhase1_StructuredDataExtraction(text: string, documentId: string, sessionId: string): Promise { - const execution = await this.sessionManager.createExecution(sessionId, 'phase1_extraction', { documentId }); - logger.info('Executing Phase 1: Structured Data Extraction', { sessionId }); - - try { - // 1. Extract Financials - const financials = extractFinancials(text); - if (!financials) { - logger.warn('Could not extract structured financial table.', { sessionId }); - // Decide if this is a critical failure or not. For now, we'll continue. - } - - // 2. Extract Deal Overview (Example of a targeted LLM call) - const dealOverviewPrompt = this.buildDealOverviewPrompt(); - const dealOverviewResult = await llmService.processCIMDocument(dealOverviewPrompt, text.substring(0, 15000), { agentName: 'deal_overview_extractor' }); - - if (!dealOverviewResult.success) { - throw new Error('Failed to extract deal overview from LLM.'); - } - - const outputData = { - financialSummary: financials, - dealOverview: dealOverviewResult.jsonOutput?.dealOverview, - }; - - await agenticRAGDatabaseService.updateExecutionWithTransaction(execution.id, { status: 'completed', outputData: SafeSerializer.serialize(outputData) }); - return outputData; - - } catch (e) { - await agenticRAGDatabaseService.updateExecutionWithTransaction(execution.id, { status: 'failed', errorMessage: (e as Error).message }); - throw new AgenticRAGError('Phase 1 failed', AgenticRAGErrorType.AGENT_EXECUTION_FAILED, 'phase1_extraction', false, {}, e as Error); - } - } - - private async executePhase2_IterativeAnalysis(documentId: string, sessionId: string, template: IReviewTemplate, structuredData: any): Promise { - const execution = await this.sessionManager.createExecution(sessionId, 'phase2_iterative_analysis', { documentId }); - logger.info('Executing Phase 2: Iterative Analysis with Parallelism and Intelligent Query Generation', { sessionId }); - - const analysisPromises: Promise[] = []; - - for (const section of template.sections) { - // Skip sections already handled in Phase 1 or other special sections - if (section.title.includes('Deal Overview') || section.title.includes('Financial Summary')) { - continue; - } - - for (const field of section.fields) { - analysisPromises.push( - this.analyzeSingleField(section, field, documentId, structuredData) - ); - } - } - - try { - const settledResults = await Promise.allSettled(analysisPromises); - const finalResults: any = {}; - - settledResults.forEach(result => { - if (result.status === 'fulfilled') { - const { sectionTitle, fieldLabel, analysis } = result.value; - if (!finalResults[sectionTitle]) { - finalResults[sectionTitle] = {}; - } - finalResults[sectionTitle][fieldLabel] = analysis; - } else { - // Log the rejected promise error - logger.error('Field analysis failed in Promise.allSettled', { - error: result.reason, - sessionId - }); - // Optionally, you can still mark it as failed in the final output - // This requires passing section/field info in the error - } - }); - - await agenticRAGDatabaseService.updateExecutionWithTransaction(execution.id, { status: 'completed', outputData: SafeSerializer.serialize(finalResults) }); - return finalResults; - - } catch (e) { - await agenticRAGDatabaseService.updateExecutionWithTransaction(execution.id, { status: 'failed', errorMessage: (e as Error).message }); - throw new AgenticRAGError('Phase 2 failed', AgenticRAGErrorType.AGENT_EXECUTION_FAILED, 'phase2_iterative_analysis', false, {}, e as Error); - } - } - - private async analyzeSingleField(section: IFormSection, field: IFormField, documentId: string, structuredData: any): Promise<{sectionTitle: string, fieldLabel: string, analysis: any}> { - try { - // Step 1: Generate intelligent search queries for the field - const searchQueries = await this.generateSearchQueriesForField(section, field); - - // Step 2: Execute enhanced vector searches for all generated queries - const searchPromises = searchQueries.map(query => - vectorDocumentProcessor.searchRelevantContent(query, { - documentId, - limit: 5, // Increased for better context - similarityThreshold: 0.75, // Higher threshold for precision - prioritizeFinancial: this.isFinancialField(section, field), - boostImportance: true - }) - ); - const searchResults = await Promise.all(searchPromises); - const relevantChunks = [...new Set(searchResults.flat().map((c: any) => c.chunkContent))]; // Deduplicate chunks - const context = relevantChunks.join('\n---\n'); - - if (!context) { - logger.warn('No context found from vector search for field', { section: section.title, field: field.label }); - return { sectionTitle: section.title, fieldLabel: field.label, analysis: 'No relevant information found in document.' }; - } - - // Step 3: Build the final analysis prompt and get the answer - const fieldPrompt = this.buildIterativeFieldPrompt(section, field, structuredData); - const fieldResult = await llmService.processCIMDocument(fieldPrompt, context, { agentName: `field_analyzer_${field.label}` }); - - let analysis = 'Analysis failed for this field.'; - if (fieldResult.success && fieldResult.jsonOutput && (fieldResult.jsonOutput as any).analysis) { - analysis = (fieldResult.jsonOutput as any).analysis; - } else { - logger.warn('LLM analysis failed for field', { section: section.title, field: field.label, result: fieldResult }); - } - - return { sectionTitle: section.title, fieldLabel: field.label, analysis }; - - } catch (error) { - logger.error('Error analyzing single field', { section: section.title, field: field.label, error }); - // Re-throw to be caught by Promise.allSettled - throw error; - } - } - - private async generateSearchQueriesForField(section: IFormSection, field: IFormField): Promise { - const prompt = `You are a research analyst preparing to analyze a section of a report.\n Your goal is to find information about the field "${field.label}" within the broader section "${section.title}".\n The purpose is to understand: "${field.details || section.purpose}".\n\n Generate 3 specific, targeted questions you would use to search a document for this information. The questions should be distinct and cover different angles of the topic.\n \n Return ONLY a JSON array of strings. Example: ["What is the company's annual revenue?", "How has revenue grown over the past three years?"]`; - - try { - const result = await llmService.processCIMDocument(prompt, '', { agentName: 'query_generator' }); - if (result.success && Array.isArray(result.jsonOutput) && result.jsonOutput.length > 0) { - return result.jsonOutput as string[]; - } - logger.warn('Failed to generate intelligent search queries, falling back to generic query.', { section: section.title, field: field.label }); - } catch (error) { - logger.error('Error generating search queries', { error }); - } - - // Fallback to a generic query if LLM-based generation fails - return [`${section.title} ${field.label}`]; - } - - private async executePhase3_ThesisAndGapAnalysis(sessionId: string, structuredData: any, analysisResults: any): Promise { - const execution = await this.sessionManager.createExecution(sessionId, 'phase3_thesis_and_gaps', {}); - logger.info('Executing Phase 3: Thesis Generation and Gap Analysis', { sessionId }); - - try { - // 1. Consolidate all available data - const fullContext = { - structuredData, - qualitativeAnalysis: analysisResults - }; - - // 2. Perform the rule-based fund strategy alignment check - const alignment = this.checkFundStrategyAlignment(structuredData); - - // 3. Build and execute the master synthesis prompt - const prompt = this.buildThesisAndQuestionsPrompt(fullContext); - const result = await llmService.processCIMDocument(prompt, '', { agentName: 'thesis_generator' }); - - if (!result.success || !result.jsonOutput) { - throw new Error('LLM failed to generate thesis and questions.'); - } - - const thesisData = result.jsonOutput; - // Inject the definitive, rule-based alignment check into the final output - if (thesisData.preliminaryInvestmentThesis) { - thesisData.preliminaryInvestmentThesis.alignmentWithFundStrategy = alignment; - } - - await agenticRAGDatabaseService.updateExecutionWithTransaction(execution.id, { status: 'completed', outputData: SafeSerializer.serialize(thesisData) }); - return thesisData; - - } catch (e) { - await agenticRAGDatabaseService.updateExecutionWithTransaction(execution.id, { status: 'failed', errorMessage: (e as Error).message }); - throw new AgenticRAGError('Phase 3 failed', AgenticRAGErrorType.AGENT_EXECUTION_FAILED, 'phase3_thesis_and_gaps', false, {}, e as Error); - } - } - - private checkFundStrategyAlignment(structuredData: any): string { - const reasons: string[] = []; - let aligns = true; - - // Rule 1: EBITDA check - const financials: CleanedFinancials | null = structuredData.financialSummary; - const ebitdaMetric = financials?.metrics.find(m => m.name.toLowerCase().includes('ebitda')); - const lastPeriodEbitda = ebitdaMetric?.values[ebitdaMetric.values.length - 1]; - if (lastPeriodEbitda && lastPeriodEbitda < 5000000) { - aligns = false; - reasons.push(`EBITDA of ~${(lastPeriodEbitda / 1000000).toFixed(1)}M is below the $5M target.`); - } else if (!lastPeriodEbitda) { - reasons.push('EBITDA could not be definitively determined.'); - } - - // Rule 2: Industry check - const industry = structuredData.dealOverview?.industrySector?.toLowerCase() || ''; - if (!industry.includes('consumer') && !industry.includes('industrial')) { - aligns = false; - reasons.push(`Industry "${structuredData.dealOverview?.industrySector || 'Unknown'}" is not in the target consumer or industrial end markets.`); - } - - // Final alignment string - if (aligns && reasons.length === 0) { - return "Aligns with fund strategy: EBITDA is above $5M and the industry is in a target sector."; - } - return `Does not align with fund strategy. ${reasons.join(' ')}`; - } - - private buildThesisAndQuestionsPrompt(fullContext: any): string { - return `You are a senior partner at a private equity firm. You have been presented with a full analysis of a potential investment. Your task is to synthesize this information into a final recommendation. - - Here is all the information you have: - ${JSON.stringify(fullContext, null, 2)} - - Based on the entirety of the data provided, perform the following tasks and return the output as a single, valid JSON object with the specified keys: - - 1. **preliminaryInvestmentThesis**: Create a concise investment thesis. - - keyAttractions: What are the 3-4 most compelling reasons to invest in this company? (e.g., market leadership, strong margins, recurring revenue). - - potentialRisks: What are the 3-4 most significant risks or concerns? (e.g., customer concentration, cyclical industry, management turnover). - - valueCreationLevers: How could a PE firm create value here? (e.g., M&A, operational improvements, professionalizing sales). - - 2. **keyQuestionsNextSteps**: Based on your analysis, identify gaps and define the next steps. - - criticalQuestions: What are the top 5-7 critical questions that remain unanswered by the CIM? - - keyMissingInfo: What specific information is missing that is crucial for the next stage of diligence? - - recommendation: Based on everything, what is your preliminary recommendation? (e.g., "Proceed with initial diligence", "Pass", "Hold for more information"). - - rationale: A brief (1-2 sentence) rationale for your recommendation. - - Respond with ONLY the JSON object. Do not include any other text or explanation. - `; - } - - private buildDealOverviewPrompt(): string { - return `You are an analyst. From the first few pages of the provided CIM text, extract the following information. Respond with ONLY a JSON object. - { - "dealOverview": { - "targetCompanyName": "...", - "industrySector": "...", - "geography": "...", - "dealSource": "...", - "transactionType": "...", - "statedReasonForSale": "..." - } - }`; - } - - private buildIterativeFieldPrompt(section: IFormSection, field: IFormField, structuredData: any): string { - return `You are a private equity analyst. Your current task is to fill out one specific field in an investment memo. - - SECTION: "${section.title}" - FIELD: "${field.label}" - FIELD PURPOSE: "${field.details || section.purpose}" - - Previously extracted data for context: - ${JSON.stringify(structuredData, null, 2)} - - Based ONLY on the provided text snippets from the CIM, provide a concise and factual analysis for the field "${field.label}". - Do not use outside knowledge. Respond with ONLY a JSON object with a single key "analysis". - - { - "analysis": "..." - }`; - } - - /** - * Convert analysis to markdown - */ - private convertToMarkdown(analysis: CIMReview): string { - if (!analysis || typeof analysis !== 'object') { - return `# BPCP CIM Review: Analysis Unavailable`; - } - - return `# BPCP CIM Review: ${analysis.dealOverview?.targetCompanyName || 'Unknown Company'} - - ... markdown conversion logic ... - `; - } - - /** - * Phase 0.5: Advanced Document Vectorization with Intelligent Chunking - * This is critical for accurate retrieval in subsequent phases - */ - private async executePhase0_DocumentVectorization(text: string, documentId: string, sessionId: string): Promise { - logger.info('Starting comprehensive document vectorization', { documentId, sessionId }); - - try { - // Strategy 1: Stream processing for large documents - const MAX_TEXT_SIZE = 50000; // 50KB chunks to prevent memory issues - const chunks: Array<{ - content: string; - chunkIndex: number; - startPosition: number; - endPosition: number; - sectionType?: string; - }> = []; - - if (text.length > MAX_TEXT_SIZE) { - logger.info('Large document detected, using streaming chunking', { - documentId, - textLength: text.length, - estimatedChunks: Math.ceil(text.length / MAX_TEXT_SIZE) - }); - - // Stream processing for large documents - let chunkIndex = 0; - let position = 0; - - while (position < text.length) { - // Force garbage collection between chunks - if (global.gc) { - global.gc(); - } - - const chunkSize = Math.min(MAX_TEXT_SIZE, text.length - position); - let chunkEnd = position + chunkSize; - - // Try to end at sentence boundary - if (chunkEnd < text.length) { - const sentenceEnd = this.findSentenceBoundary(text, chunkEnd); - if (sentenceEnd > position + 1000) { // Ensure minimum chunk size - chunkEnd = sentenceEnd; - } - } - - const chunkText = text.substring(position, chunkEnd); - - // Detect section type for this chunk - const sectionType = this.identifySectionType(chunkText); - - chunks.push({ - content: chunkText, - chunkIndex: chunkIndex++, - startPosition: position, - endPosition: chunkEnd, - sectionType - }); - - position = chunkEnd; - - // Log progress for large documents - if (chunkIndex % 10 === 0) { - logger.info('Vectorization progress', { - documentId, - chunkIndex, - progress: Math.round((position / text.length) * 100) + '%' - }); - } - } - } else { - // For smaller documents, use the original intelligent chunking - chunks.push(...await this.createIntelligentChunks(text, documentId)); - } - - // Strategy 2: Process chunks in batches to manage memory - const BATCH_SIZE = 5; // Process 5 chunks at a time - const enrichedChunks: Array<{ - content: string; - chunkIndex: number; - startPosition: number; - endPosition: number; - sectionType?: string; - metadata: { - hasFinancialData: boolean; - hasMetrics: boolean; - keyTerms: string[]; - importance: 'high' | 'medium' | 'low'; - conceptDensity: number; - }; - }> = []; - - for (let i = 0; i < chunks.length; i += BATCH_SIZE) { - const batch = chunks.slice(i, i + BATCH_SIZE); - - // Process batch - const batchPromises = batch.map(async (chunk) => { - const metadata = { - hasFinancialData: this.containsFinancialData(chunk.content), - hasMetrics: this.containsMetrics(chunk.content), - keyTerms: this.extractKeyTerms(chunk.content), - importance: this.calculateImportance(chunk.content, chunk.sectionType), - conceptDensity: this.calculateConceptDensity(chunk.content) - }; - - return { - ...chunk, - metadata - }; - }); - - const batchResults = await Promise.all(batchPromises); - enrichedChunks.push(...batchResults); - - // Force garbage collection after each batch - if (global.gc) { - global.gc(); - } - - // Log batch progress - logger.info('Enriched chunk batch', { - documentId, - batchNumber: Math.floor(i / BATCH_SIZE) + 1, - totalBatches: Math.ceil(chunks.length / BATCH_SIZE), - processedChunks: enrichedChunks.length - }); - } - - // Strategy 3: Store chunks in batches to prevent memory buildup - const STORE_BATCH_SIZE = 3; - for (let i = 0; i < enrichedChunks.length; i += STORE_BATCH_SIZE) { - const storeBatch = enrichedChunks.slice(i, i + STORE_BATCH_SIZE); - - await vectorDocumentProcessor.storeDocumentChunks(storeBatch, { - documentId, - indexingStrategy: 'hierarchical', - similarity_threshold: 0.8, - enable_hybrid_search: true - }); - - // Force garbage collection after storing each batch - if (global.gc) { - global.gc(); - } - - logger.info('Stored chunk batch', { - documentId, - batchNumber: Math.floor(i / STORE_BATCH_SIZE) + 1, - totalBatches: Math.ceil(enrichedChunks.length / STORE_BATCH_SIZE), - storedChunks: Math.min(i + STORE_BATCH_SIZE, enrichedChunks.length) - }); - } - - logger.info('Document vectorization completed successfully', { - documentId, - sessionId, - chunksCreated: enrichedChunks.length, - avgChunkSize: Math.round(enrichedChunks.reduce((sum: number, c: any) => sum + c.content.length, 0) / enrichedChunks.length), - totalTextLength: text.length - }); - - } catch (error) { - logger.error('Document vectorization failed', { documentId, sessionId, error }); - throw new AgenticRAGError( - 'Failed to vectorize document for retrieval', - AgenticRAGErrorType.DATABASE_ERROR, - 'vectorization_engine', - true, - { documentId, sessionId }, - error instanceof Error ? error : undefined - ); - } - } - - /** - * Create intelligent chunks with semantic boundaries and optimal overlap - */ - private async createIntelligentChunks(text: string, documentId: string): Promise> { - const chunks: Array<{ - content: string; - chunkIndex: number; - startPosition: number; - endPosition: number; - sectionType?: string; - }> = []; - - // Configuration for optimal CIM document processing - const CHUNK_SIZE = 1000; // Optimal for financial documents - const OVERLAP_SIZE = 200; // 20% overlap for context preservation - const MIN_CHUNK_SIZE = 300; // Minimum meaningful chunk size - - // Strategy 1: Detect section boundaries (headers, page breaks, etc.) - const sectionBoundaries = this.detectSectionBoundaries(text); - - // Strategy 2: Split on semantic boundaries first - const semanticSections = this.splitOnSemanticBoundaries(text, sectionBoundaries); - - let chunkIndex = 0; - let globalPosition = 0; - - for (const section of semanticSections) { - const sectionText = section.content; - const sectionType = section.type; - - // If section is small enough, keep it as one chunk - if (sectionText.length <= CHUNK_SIZE) { - chunks.push({ - content: sectionText, - chunkIndex: chunkIndex++, - startPosition: globalPosition, - endPosition: globalPosition + sectionText.length, - sectionType - }); - globalPosition += sectionText.length; - continue; - } - - // For larger sections, create overlapping chunks - let sectionPosition = 0; - const sectionStart = globalPosition; - - while (sectionPosition < sectionText.length) { - const remainingText = sectionText.length - sectionPosition; - const chunkSize = Math.min(CHUNK_SIZE, remainingText); - - // Adjust chunk end to sentence boundary if possible - let chunkEnd = sectionPosition + chunkSize; - if (chunkEnd < sectionText.length) { - const sentenceEnd = this.findSentenceBoundary(sectionText, chunkEnd); - if (sentenceEnd > sectionPosition + MIN_CHUNK_SIZE) { - chunkEnd = sentenceEnd; - } - } - - const chunkContent = sectionText.substring(sectionPosition, chunkEnd); - - chunks.push({ - content: chunkContent.trim(), - chunkIndex: chunkIndex++, - startPosition: sectionStart + sectionPosition, - endPosition: sectionStart + chunkEnd, - sectionType - }); - - // Move to next chunk with overlap - sectionPosition = chunkEnd - OVERLAP_SIZE; - if (sectionPosition < 0) sectionPosition = chunkEnd; - } - - globalPosition += sectionText.length; - } - - logger.info('Intelligent chunking completed', { - documentId, - totalChunks: chunks.length, - avgChunkSize: Math.round(chunks.reduce((sum: number, c: any) => sum + c.content.length, 0) / chunks.length), - sectionTypes: [...new Set(chunks.map(c => c.sectionType).filter(Boolean))] - }); - - return chunks; - } - - // /** - // * Enrich chunks with metadata for enhanced retrieval - // */ - // private async enrichChunksWithMetadata(chunks: Array<{ - // content: string; - // chunkIndex: number; - // startPosition: number; - // endPosition: number; - // sectionType?: string; - // }>): Promise> { - // const enrichedChunks = []; - - // for (const chunk of chunks) { - // // Analyze chunk content for metadata - // const hasFinancialData = this.containsFinancialData(chunk.content); - // const hasMetrics = this.containsMetrics(chunk.content); - // const keyTerms = this.extractKeyTerms(chunk.content); - // const importance = this.calculateImportance(chunk.content, chunk.sectionType); - // const conceptDensity = this.calculateConceptDensity(chunk.content); - - // enrichedChunks.push({ - // ...chunk, - // metadata: { - // hasFinancialData, - // hasMetrics, - // keyTerms, - // importance, - // conceptDensity - // } - // }); - // } - - // return enrichedChunks; - // } - - /** - * Detect section boundaries in CIM documents - */ - private detectSectionBoundaries(text: string): number[] { - const boundaries: number[] = [0]; - - // Common CIM section patterns - const sectionPatterns = [ - /^(EXECUTIVE SUMMARY|COMPANY OVERVIEW|BUSINESS DESCRIPTION)/im, - /^(FINANCIAL PERFORMANCE|FINANCIAL ANALYSIS|HISTORICAL FINANCIALS)/im, - /^(MARKET ANALYSIS|INDUSTRY OVERVIEW|COMPETITIVE LANDSCAPE)/im, - /^(MANAGEMENT TEAM|LEADERSHIP|KEY PERSONNEL)/im, - /^(INVESTMENT HIGHLIGHTS|GROWTH OPPORTUNITIES)/im, - /^(APPENDIX|FINANCIAL STATEMENTS|SUPPORTING DOCUMENTS)/im - ]; - - const lines = text.split('\n'); - let position = 0; - - for (let i = 0; i < lines.length; i++) { - const line = (lines[i] || '').trim(); - - // Check for section headers - if (sectionPatterns.some(pattern => pattern.test(line))) { - boundaries.push(position); - } - - // Check for page breaks or significant whitespace - if (line === '' && i > 0 && i < lines.length - 1) { - const nextNonEmpty = lines.slice(i + 1).findIndex(l => l.trim() !== ''); - if (nextNonEmpty > 2) { // Multiple empty lines suggest section break - boundaries.push(position); - } - } - - position += (lines[i] || '').length + 1; // +1 for newline - } - - boundaries.push(text.length); - return [...new Set(boundaries)].sort((a, b) => a - b); - } - - /** - * Split text on semantic boundaries - */ - private splitOnSemanticBoundaries(text: string, boundaries: number[]): Array<{ - content: string; - type: string; - }> { - const sections: Array<{ content: string; type: string }> = []; - - for (let i = 0; i < boundaries.length - 1; i++) { - const start = boundaries[i] || 0; - const end = boundaries[i + 1] || text.length; - const content = text.substring(start, end).trim(); - - if (content.length > 50) { // Filter out tiny sections - const type = this.identifySectionType(content); - sections.push({ content, type }); - } - } - - return sections; - } - - /** - * Identify section type based on content - */ - private identifySectionType(content: string): string { - const firstLines = content.split('\n').slice(0, 3).join(' ').toLowerCase(); - - if (/executive summary|overview|introduction/i.test(firstLines)) return 'executive_summary'; - if (/financial|revenue|ebitda|cash flow/i.test(firstLines)) return 'financial'; - if (/market|industry|competitive|sector/i.test(firstLines)) return 'market_analysis'; - if (/management|team|leadership|personnel/i.test(firstLines)) return 'management'; - if (/growth|opportunity|strategy|expansion/i.test(firstLines)) return 'growth_strategy'; - if (/risk|challenge|concern/i.test(firstLines)) return 'risk_analysis'; - - return 'general'; - } - - /** - * Find optimal sentence boundary for chunk splitting - */ - private findSentenceBoundary(text: string, position: number): number { - const searchWindow = 100; // Look 100 chars back for sentence end - const searchStart = Math.max(0, position - searchWindow); - - for (let i = position; i >= searchStart; i--) { - const char = text[i]; - if (char === '.' || char === '!' || char === '?') { - // Make sure it's actually end of sentence, not abbreviation - if (i < text.length - 1 && /\s/.test(text[i + 1] || '')) { - return i + 1; - } - } - } - - return position; // Fallback to original position - } - - /** - * Check if chunk contains financial data - */ - private containsFinancialData(content: string): boolean { - const financialPatterns = [ - /\$[\d,]+(?:\.\d{2})?(?:[kmb])?/i, // Currency amounts - /\d+(?:\.\d+)?%/, // Percentages - /revenue|ebitda|cash flow|profit|margin|roi|irr/i, - /\d{4}\s*(fy|fiscal year|year ended)/i // Fiscal years - ]; - - return financialPatterns.some(pattern => pattern.test(content)); - } - - /** - * Check if chunk contains metrics - */ - private containsMetrics(content: string): boolean { - const metricPatterns = [ - /\d+(?:\.\d+)?\s*(?:million|billion|thousand|m|b|k)/i, - /\d+(?:\.\d+)?x/i, // Multiples - /growth|increase|decrease|change/i - ]; - - return metricPatterns.some(pattern => pattern.test(content)); - } - - /** - * Extract key terms from chunk - */ - private extractKeyTerms(content: string): string[] { - // Simple key term extraction - could be enhanced with NLP - const keyTermPatterns = [ - /\b[A-Z][a-z]+ [A-Z][a-z]+\b/g, // Proper nouns (likely company/person names) - /\b(?:EBITDA|ROI|IRR|CAGR|SaaS|B2B|B2C)\b/gi, // Business acronyms - /\b\d+(?:\.\d+)?%\b/g, // Percentages - /\$[\d,]+(?:\.\d{2})?(?:[kmb])?/gi // Currency amounts - ]; - - const terms: string[] = []; - keyTermPatterns.forEach(pattern => { - const matches = content.match(pattern) || []; - terms.push(...matches); - }); - - return [...new Set(terms)].slice(0, 10); // Top 10 unique terms - } - - /** - * Calculate importance score for chunk - */ - private calculateImportance(content: string, sectionType?: string): 'high' | 'medium' | 'low' { - let score = 0; - - // Section type scoring - if (sectionType === 'executive_summary') score += 3; - else if (sectionType === 'financial') score += 2; - else if (sectionType === 'market_analysis') score += 2; - else score += 1; - - // Content analysis scoring - if (this.containsFinancialData(content)) score += 2; - if (this.containsMetrics(content)) score += 1; - if (/key|important|critical|significant/i.test(content)) score += 1; - - if (score >= 5) return 'high'; - if (score >= 3) return 'medium'; - return 'low'; - } - - /** - * Calculate concept density (information richness) - */ - private calculateConceptDensity(content: string): number { - const words = content.split(/\s+/).length; - const concepts = this.extractKeyTerms(content).length; - const financialElements = (content.match(/\$[\d,]+|\d+%|\d+(?:\.\d+)?[kmb]/gi) || []).length; - - return Math.min(1.0, (concepts + financialElements) / Math.max(words / 100, 1)); - } - - /** - * Determine if a field is financial-related for search prioritization - */ - private isFinancialField(section: IFormSection, field: IFormField): boolean { - const fieldText = `${section.title} ${field.label}`.toLowerCase(); - return /financial|revenue|ebitda|profit|margin|cash|debt|cost|expense|income|sales/i.test(fieldText); - } - - // Best Practice: Graceful shutdown - async shutdown(): Promise { - logger.info('Shutting down Agentic RAG Processor'); - this.isInitialized = false; - } -} - -export const agenticRAGProcessor = new AgenticRAGProcessor(); \ No newline at end of file diff --git a/backend/src/services/documentAiGenkitProcessor.ts b/backend/src/services/documentAiProcessor.ts similarity index 97% rename from backend/src/services/documentAiGenkitProcessor.ts rename to backend/src/services/documentAiProcessor.ts index 0d5b6cf..e169639 100644 --- a/backend/src/services/documentAiGenkitProcessor.ts +++ b/backend/src/services/documentAiProcessor.ts @@ -23,13 +23,7 @@ interface DocumentAIOutput { mimeType: string; } -interface PageChunk { - startPage: number; - endPage: number; - buffer: Buffer; -} - -export class DocumentAiGenkitProcessor { +export class DocumentAiProcessor { private gcsBucketName: string; private documentAiClient: DocumentProcessorServiceClient; private storageClient: Storage; @@ -44,7 +38,7 @@ export class DocumentAiGenkitProcessor { // Construct the processor name this.processorName = `projects/${config.googleCloud.projectId}/locations/${config.googleCloud.documentAiLocation}/processors/${config.googleCloud.documentAiProcessorId}`; - logger.info('Document AI + Genkit processor initialized', { + logger.info('Document AI processor initialized', { projectId: config.googleCloud.projectId, location: config.googleCloud.documentAiLocation, processorId: config.googleCloud.documentAiProcessorId, @@ -386,4 +380,4 @@ export class DocumentAiGenkitProcessor { } } -export const documentAiGenkitProcessor = new DocumentAiGenkitProcessor(); \ No newline at end of file +export const documentAiProcessor = new DocumentAiProcessor(); \ No newline at end of file diff --git a/backend/src/services/documentProcessingService.ts b/backend/src/services/documentProcessingService.ts deleted file mode 100644 index 3c53847..0000000 --- a/backend/src/services/documentProcessingService.ts +++ /dev/null @@ -1,1403 +0,0 @@ -import fs from 'fs'; -import path from 'path'; -import { logger } from '../utils/logger'; -import { fileStorageService } from './fileStorageService'; -import { DocumentModel } from '../models/DocumentModel'; -import { ProcessingJobModel } from '../models/ProcessingJobModel'; -import { llmService } from './llmService'; -import { pdfGenerationService } from './pdfGenerationService'; -import { config } from '../config/env'; -import { uploadProgressService } from './uploadProgressService'; -import { CIMReview } from './llmSchemas'; - -export interface ProcessingStep { - name: string; - status: 'pending' | 'processing' | 'completed' | 'failed'; - startTime?: Date; - endTime?: Date; - error?: string; - metadata?: Record; -} - -export interface ProcessingResult { - success: boolean; - jobId: string; - documentId: string; - steps: ProcessingStep[]; - extractedText?: string; - summary?: string; - analysis?: Record; - error?: string; -} - -export interface ProcessingOptions { - extractText?: boolean; - generateSummary?: boolean; - performAnalysis?: boolean; - maxTextLength?: number; - chunkSize?: number; - strategy?: string; -} - -class DocumentProcessingService { - private readonly defaultOptions: ProcessingOptions = { - extractText: true, - generateSummary: true, - performAnalysis: true, - maxTextLength: 100000, // 100KB limit - chunkSize: 4000, // 4KB chunks for processing - }; - - /** - * Process a document through the complete pipeline - */ - async processDocument( - documentId: string, - userId: string, - options: ProcessingOptions = {} - ): Promise { - const mergedOptions = { ...this.defaultOptions, ...options }; - const jobId = `job_${Date.now()}_${Math.random().toString(36).substr(2, 9)}`; - - logger.info(`Starting document processing: ${documentId}`, { - documentId, - jobId, - userId, - options: mergedOptions, - timestamp: new Date().toISOString(), - }); - - // Initialize progress tracking - uploadProgressService.initializeProgress(documentId, jobId); - - const steps: ProcessingStep[] = [ - { name: 'validation', status: 'pending' }, - { name: 'text_extraction', status: 'pending' }, - { name: 'analysis', status: 'pending' }, - { name: 'summary_generation', status: 'pending' }, - { name: 'storage', status: 'pending' }, - ]; - - let extractedText: string | undefined; - let analysis: Record | undefined; - let summary: string | undefined; - // Removed unused variable - let markdownPath: string | undefined; - let pdfPath: string | undefined; - - try { - // Create processing job record - await this.createProcessingJob(jobId, documentId); - - // Step 1: Validation - uploadProgressService.updateProgress(documentId, 'validation', 10, 'Validating document...'); - await this.executeStep(steps, 'validation', async () => { - await this.validateDocument(documentId, userId); - }); - - // Step 2: Text Extraction - uploadProgressService.updateProgress(documentId, 'text_extraction', 20, 'Extracting text from document...'); - await this.executeStep(steps, 'text_extraction', async () => { - extractedText = await this.extractTextFromPDF(documentId); - uploadProgressService.updateProgress(documentId, 'text_extraction', 100, `Text extraction completed (${extractedText.length} characters)`); - }); - - // Step 3: Analysis - uploadProgressService.updateProgress(documentId, 'analysis', 40, 'Analyzing document content...'); - await this.executeStep(steps, 'analysis', async () => { - if (extractedText && mergedOptions.performAnalysis) { - analysis = await this.analyzeDocument(extractedText); - uploadProgressService.updateProgress(documentId, 'analysis', 100, 'Document analysis completed'); - } - }); - - // Step 4: Summary Generation - uploadProgressService.updateProgress(documentId, 'summary_generation', 60, 'Generating summary...'); - await this.executeStep(steps, 'summary_generation', async () => { - if (extractedText && mergedOptions.generateSummary) { - const summaryResult = await this.generateSummary(documentId, extractedText, analysis || {}); - summary = summaryResult.summary; - analysis = summaryResult.analysisData; - - // Generate markdown file - const timestamp = new Date().toISOString().replace(/[:.]/g, '-'); - markdownPath = `uploads/summaries/${documentId}_${timestamp}.md`; - pdfPath = `uploads/summaries/${documentId}_${timestamp}.pdf`; - - logger.info('Saving markdown file', { - documentId, - markdownPath, - summaryLength: summary.length - }); - - // Save markdown file - await this.saveMarkdownFile(markdownPath, summary); - - logger.info('Markdown file saved successfully', { - documentId, - markdownPath - }); - - // Generate PDF from markdown - logger.info('Generating PDF from markdown', { - documentId, - pdfPath, - summaryLength: summary.length - }); - - const pdfGenerated = await pdfGenerationService.generatePDFFromMarkdown( - summary, - path.join(config.upload.uploadDir, pdfPath) - ); - - if (!pdfGenerated) { - logger.warn('PDF generation failed, continuing without PDF', { - documentId, - pdfPath - }); - } else { - logger.info('PDF generated successfully', { - documentId, - pdfPath - }); - } - - uploadProgressService.updateProgress(documentId, 'summary_generation', 100, 'Summary generation completed'); - - return { - summaryLength: summary.length, - markdownPath, - pdfPath: pdfGenerated ? pdfPath : '', - }; - } - return null; - }); - - // Step 5: Storage - uploadProgressService.updateProgress(documentId, 'storage', 90, 'Saving processing results...'); - await this.executeStep(steps, 'storage', async () => { - logger.info('Starting storage step', { - documentId, - hasExtractedText: !!extractedText, - hasSummary: !!summary, - hasAnalysis: !!analysis, - summaryLength: summary?.length || 0 - }); - - try { - const storageResult = await this.storeProcessingResults(documentId, { - extractedText: extractedText || '', - summary: summary || '', - analysis: analysis as CIMReview || {} as CIMReview, - processingSteps: steps, - markdownPath: markdownPath || '', - pdfPath: pdfPath || '', - }); - - logger.info('Storage step completed successfully', { - documentId, - storageResult - }); - - return storageResult; - } catch (error) { - logger.error('Storage step failed', { - documentId, - error: error instanceof Error ? error.message : 'Unknown error' - }); - throw error; - } - }); - - logger.info('All processing steps completed, updating job status', { - documentId, - jobId, - stepsCompleted: steps.filter(s => s.status === 'completed').length, - totalSteps: steps.length - }); - - // Update job status to completed - await this.updateProcessingJob(jobId, 'completed'); - - // Clean up the original uploaded file after successful processing - await this.cleanupOriginalFile(documentId); - - // Mark progress as completed - uploadProgressService.markCompleted(documentId, 'Document processing completed successfully'); - - logger.info(`Document processing completed: ${documentId}`, { - jobId, - documentId, - userId, - processingTime: this.calculateProcessingTime(steps), - summaryLength: summary?.length || 0 - }); - - return { - success: true, - jobId, - documentId, - steps, - extractedText: extractedText || '', - summary: summary || '', - analysis: analysis || {}, - }; - } catch (error) { - const errorMessage = error instanceof Error ? error.message : 'Unknown error'; - - logger.error(`Document processing failed: ${documentId}`, { - jobId, - documentId, - userId, - error: errorMessage, - steps: steps.map(s => ({ name: s.name, status: s.status, error: s.error })), - }); - - // Update job status to failed - await this.updateProcessingJob(jobId, 'failed'); - - // Only clean up the original uploaded file if this is the final attempt - // (not a retry) to avoid cleaning up files that might be needed for retries - const job = await ProcessingJobModel.findByJobId(jobId); - if (job && (job as any).attempts >= 3) { - await this.cleanupOriginalFile(documentId); - } - - // Mark progress as failed - uploadProgressService.markError(documentId, errorMessage); - - return { - success: false, - jobId, - documentId, - steps, - error: errorMessage, - }; - } - } - - /** - * Execute a processing step with error handling - */ - private async executeStep( - steps: ProcessingStep[], - stepName: string, - stepFunction: () => Promise - ): Promise { - const step = steps.find(s => s.name === stepName); - if (!step) { - throw new Error(`Step ${stepName} not found`); - } - - try { - step.status = 'processing'; - step.startTime = new Date(); - - logger.info(`Executing processing step: ${stepName}`); - - const result = await stepFunction(); - - step.status = 'completed'; - step.endTime = new Date(); - step.metadata = result; - - logger.info(`Processing step completed: ${stepName}`, { - duration: step.endTime.getTime() - step.startTime!.getTime(), - }); - } catch (error) { - step.status = 'failed'; - step.endTime = new Date(); - step.error = error instanceof Error ? error.message : 'Unknown error'; - - logger.error(`Processing step failed: ${stepName}`, { - error: step.error, - }); - - throw error; - } - } - - /** - * Validate document exists and user has access - */ - private async validateDocument(documentId: string, userId: string): Promise { - const document = await DocumentModel.findById(documentId); - - if (!document) { - throw new Error('Document not found'); - } - - if (document.user_id !== userId) { - throw new Error('Access denied'); - } - - if (!document.file_path) { - throw new Error('Document file not found'); - } - - const fileExists = await fileStorageService.fileExists(document.file_path); - if (!fileExists) { - throw new Error('Document file not accessible'); - } - - logger.info(`Document validation passed: ${documentId}`); - } - - /** - * Extract text from PDF file - */ - private async extractTextFromPDF(documentId: string): Promise { - const document = await DocumentModel.findById(documentId); - if (!document || !document.file_path) { - throw new Error('Document file not found'); - } - - try { - const fileBuffer = await fileStorageService.getFile(document.file_path); - if (!fileBuffer) { - throw new Error('Could not read document file'); - } - - const filePath = document.file_path; - let extractedText: string; - - // Check file extension to determine processing method - if (filePath.toLowerCase().endsWith('.pdf')) { - // Use pdf-parse for actual PDF text extraction - const pdfParse = require('pdf-parse'); - const data = await pdfParse(fileBuffer); - extractedText = data.text; - - logger.info(`PDF text extraction completed: ${documentId}`, { - textLength: extractedText.length, - fileSize: fileBuffer.length, - pages: data.numpages, - }); - } else { - // For text files, read the content directly - extractedText = fileBuffer.toString('utf-8'); - - logger.info(`Text file extraction completed: ${documentId}`, { - textLength: extractedText.length, - fileSize: fileBuffer.length, - fileType: path.extname(filePath), - }); - } - - return extractedText; - } catch (error) { - logger.error(`Text extraction failed: ${documentId}`, error); - throw new Error(`Text extraction failed: ${error instanceof Error ? error.message : 'Unknown error'}`); - } - } - - /** - * Analyze extracted text for key information - */ - private async analyzeDocument(text: string): Promise> { - try { - // Enhanced document analysis with LLM integration - const analysis: Record = { - wordCount: text.split(/\s+/).length, - characterCount: text.length, - paragraphCount: text.split(/\n\s*\n/).length, - estimatedReadingTime: Math.ceil(text.split(/\s+/).length / 200), // 200 words per minute - language: this.detectLanguage(text), - hasFinancialData: this.detectFinancialContent(text), - hasTechnicalData: this.detectTechnicalContent(text), - documentType: this.detectDocumentType(text), - keyTopics: this.extractKeyTopics(text), - sentiment: this.analyzeSentiment(text), - complexity: this.assessComplexity(text), - tokenEstimate: this.estimateTokenCount(text), - }; - - logger.info('Document analysis completed', analysis); - return analysis; - } catch (error) { - logger.error('Document analysis failed', error); - throw new Error(`Document analysis failed: ${error instanceof Error ? error.message : 'Unknown error'}`); - } - } - - /** - * Generate summary from extracted text using LLM with hierarchical processing - */ - private async generateSummary(documentId: string, text: string, analysis: Record): Promise<{ summary: string; analysisData: CIMReview }> { - try { - // Update document status to processing_llm - await this.updateDocumentStatus(documentId, 'processing_llm'); - - logger.info('Starting hierarchical summary generation process', { - textLength: text.length, - analysisKeys: Object.keys(analysis || {}) - }); - - // Estimate tokens and determine processing strategy - const tokenEstimate = this.estimateTokenCount(text); - const maxTokens = config.llm.maxTokens; - const threshold = maxTokens - config.llm.promptBuffer; - const needsChunking = tokenEstimate > threshold; - - logger.info('Token analysis completed', { - tokenEstimate, - maxTokens, - threshold, - needsChunking - }); - - if (needsChunking) { - // Use hierarchical processing for large documents - return await this.processLargeDocumentHierarchically(documentId, text, analysis); - } else { - // Process entire document in a single pass - const llmResult = await llmService.processCIMDocument(text, '', analysis || {}); - - if (!llmResult.success || !llmResult.jsonOutput) { - throw new Error(llmResult.error || 'LLM processing failed to return valid JSON.'); - } - - return { - summary: this.convertJsonToMarkdown(llmResult.jsonOutput), - analysisData: llmResult.jsonOutput - }; - } - } catch (error) { - logger.error('Summary generation failed', { - documentId, - error: error instanceof Error ? error.message : 'Unknown error' - }); - throw error; - } - } - - /** - * Process large documents using hierarchical approach - */ - private async processLargeDocumentHierarchically(documentId: string, text: string, analysis: Record): Promise<{ summary: string; analysisData: CIMReview }> { - logger.info('Starting hierarchical processing for large document'); - - // Step 1: High-level document overview (first 20% of document) - uploadProgressService.updateProgress(documentId, 'summary_generation', 65, 'Analyzing document structure...'); - - const overviewChunk = this.extractDocumentOverview(text); - const overviewResult = await llmService.processCIMDocument(overviewChunk, '', { overviewMode: true }); - - if (!overviewResult.success || !overviewResult.jsonOutput) { - throw new Error('Failed to generate document overview'); - } - - // Step 2: Section-specific analysis with larger chunks - uploadProgressService.updateProgress(documentId, 'summary_generation', 75, 'Analyzing document sections...'); - - const sections = this.extractDocumentSections(text); - const sectionResults: any[] = []; - - for (let i = 0; i < sections.length; i++) { - const section = sections[i]; - if (!section) continue; - - logger.info(`Processing section ${i + 1}/${sections.length}`, { - sectionType: section.type, - sectionLength: section.content.length - }); - - uploadProgressService.updateProgress(documentId, 'summary_generation', 75 + ((i + 1) / sections.length) * 15, `Analyzing ${section.type} section...`); - - try { - const sectionResult = await llmService.processCIMDocument( - section.content, - '', - { - sectionType: section.type, - overview: overviewResult.jsonOutput, - analysis - } - ); - - if (sectionResult.success && sectionResult.jsonOutput) { - sectionResults.push({ - type: section.type, - data: sectionResult.jsonOutput - }); - } - } catch (error) { - logger.warn(`Section ${section.type} processing failed, continuing with other sections`, { - error: error instanceof Error ? error.message : 'Unknown error' - }); - } - } - - // Step 3: Synthesize results - uploadProgressService.updateProgress(documentId, 'summary_generation', 95, 'Synthesizing analysis...'); - - const synthesizedResult = await this.synthesizeSectionResults( - overviewResult.jsonOutput, - sectionResults, - text - ); - - return { - summary: this.convertJsonToMarkdown(synthesizedResult), - analysisData: synthesizedResult - }; - } - - /** - * Extract document overview (first portion with key information) - */ - private extractDocumentOverview(text: string): string { - // Extract first 30% of document which typically contains executive summary, company overview - const overviewLength = Math.floor(text.length * 0.3); - return text.substring(0, overviewLength); - } - - /** - * Extract logical sections from document - */ - private extractDocumentSections(text: string): Array<{ type: string; content: string }> { - const sections: Array<{ type: string; content: string }> = []; - - // Split document into logical sections based on headers - const sectionHeaders = [ - { pattern: /(?:^|\n)(?:executive\s+summary|overview|introduction)/i, type: 'overview' }, - { pattern: /(?:^|\n)(?:business\s+description|company\s+overview|operations)/i, type: 'business' }, - { pattern: /(?:^|\n)(?:market\s+analysis|industry\s+analysis|competitive\s+landscape)/i, type: 'market' }, - { pattern: /(?:^|\n)(?:financial\s+(?:overview|summary|performance)|financials)/i, type: 'financial' }, - { pattern: /(?:^|\n)(?:management\s+(?:team|overview)|leadership)/i, type: 'management' }, - { pattern: /(?:^|\n)(?:investment\s+(?:thesis|opportunity|highlights))/i, type: 'investment' }, - ]; - - let currentSection = { type: 'overview', content: text }; - let remainingText = text; - - for (const header of sectionHeaders) { - const match = remainingText.match(header.pattern); - if (match) { - // Save previous section - if (currentSection.content.length > 1000) { // Only include substantial sections - sections.push(currentSection); - } - - // Start new section - const startIndex = match.index!; - currentSection = { - type: header.type, - content: remainingText.substring(startIndex) - }; - remainingText = remainingText.substring(startIndex); - } - } - - // Add final section - if (currentSection.content.length > 1000) { - sections.push(currentSection); - } - - // If no sections found, use intelligent chunking - if (sections.length === 0) { - const chunks = this.chunkText(text, 15000); // Larger chunks for section analysis - return chunks.map((chunk, index) => ({ - type: `section_${index + 1}`, - content: chunk - })); - } - - return sections; - } - - /** - * Synthesize results from different sections - */ - private async synthesizeSectionResults( - overview: CIMReview, - sectionResults: Array<{ type: string; data: any }>, - fullText: string - ): Promise { - // Combine all section data - const combinedData = { ...overview }; - - for (const section of sectionResults) { - // Merge section data into combined structure - this.mergeSectionData(combinedData, section.data, section.type); - } - - // Final synthesis pass with full context - const synthesisResult = await llmService.processCIMDocument( - JSON.stringify(combinedData), - '', - { - synthesisMode: true, - fullText: fullText.substring(0, 10000) // Include sample of full text for context - } - ); - - if (synthesisResult.success && synthesisResult.jsonOutput) { - return synthesisResult.jsonOutput; - } - - return combinedData; - } - - /** - * Merge section data into main structure - */ - private mergeSectionData(mainData: CIMReview, sectionData: any, sectionType: string): void { - switch (sectionType) { - case 'business': - if (sectionData.businessDescription) { - mainData.businessDescription = { - ...mainData.businessDescription, - ...sectionData.businessDescription - }; - } - break; - case 'market': - if (sectionData.marketIndustryAnalysis) { - mainData.marketIndustryAnalysis = { - ...mainData.marketIndustryAnalysis, - ...sectionData.marketIndustryAnalysis - }; - } - break; - case 'financial': - if (sectionData.financialSummary) { - mainData.financialSummary = { - ...mainData.financialSummary, - ...sectionData.financialSummary - }; - } - break; - case 'management': - if (sectionData.managementTeamOverview) { - mainData.managementTeamOverview = { - ...mainData.managementTeamOverview, - ...sectionData.managementTeamOverview - }; - } - break; - case 'investment': - if (sectionData.preliminaryInvestmentThesis) { - mainData.preliminaryInvestmentThesis = { - ...mainData.preliminaryInvestmentThesis, - ...sectionData.preliminaryInvestmentThesis - }; - } - break; - } - } - - /** - * Update document status during processing - */ - private async updateDocumentStatus(documentId: string, status: string): Promise { - try { - const updateData: any = { status }; - - if (status === 'processing_llm') { - updateData.processing_started_at = new Date(); - } - - const updated = await DocumentModel.updateById(documentId, updateData); - if (!updated) { - logger.warn(`Failed to update document status: ${documentId} - ${status}`); - } else { - logger.info(`Document status updated: ${documentId} - ${status}`); - } - } catch (error) { - logger.error(`Failed to update document status: ${documentId} - ${status}`, error); - } - } - - /** - * Store processing results in database - */ - private async storeProcessingResults( - documentId: string, - results: { - extractedText?: string; - summary?: string; - analysis?: CIMReview; - processingSteps: ProcessingStep[]; - markdownPath?: string; - pdfPath?: string; - } - ): Promise { - try { - const updateFields: any = { - status: 'completed', - processing_completed_at: new Date(), - }; - - if (results.extractedText) { - updateFields.extracted_text = results.extractedText; - } - - if (results.summary) { - updateFields.generated_summary = results.summary; - } - - if (results.analysis) { - updateFields.analysis_data = results.analysis; - } - - if (results.markdownPath) { - updateFields.summary_markdown_path = results.markdownPath; - } - - if (results.pdfPath) { - updateFields.summary_pdf_path = results.pdfPath; - } - - const updated = await DocumentModel.updateById(documentId, updateFields); - if (!updated) { - throw new Error('Failed to update document with processing results'); - } - - logger.info(`Processing results stored: ${documentId}`); - } catch (error) { - logger.error(`Failed to store processing results: ${documentId}`, error); - throw new Error(`Failed to store processing results: ${error instanceof Error ? error.message : 'Unknown error'}`); - } - } - - /** - * Create processing job record - */ - private async createProcessingJob( - jobId: string, - documentId: string - ): Promise { - try { - await ProcessingJobModel.create({ - job_id: jobId, - document_id: documentId, - type: 'llm_processing', // Default type - }); - - logger.info(`Processing job created: ${jobId}`); - } catch (error) { - logger.error(`Failed to create processing job: ${jobId}`, error); - throw error; - } - } - - /** - * Update processing job status - */ - private async updateProcessingJob( - jobId: string, - status: string - ): Promise { - // Note: Job queue service manages jobs in memory, database jobs are separate - // This method is kept for potential future integration but currently disabled - // to avoid warnings about missing job_id values in database - logger.debug(`Processing job status update (in-memory): ${jobId} -> ${status}`); - } - - /** - * Calculate total processing time - */ - private calculateProcessingTime(steps: ProcessingStep[]): number { - const completedSteps = steps.filter(s => s.startTime && s.endTime); - if (completedSteps.length === 0) return 0; - - const startTime = Math.min(...completedSteps.map(s => s.startTime!.getTime())); - const endTime = Math.max(...completedSteps.map(s => s.endTime!.getTime())); - - return endTime - startTime; - } - - /** - * Detect language of the text - */ - private detectLanguage(text: string): string { - // Simple language detection based on common words - const lowerText = text.toLowerCase(); - - if (lowerText.includes('the') && lowerText.includes('and') && lowerText.includes('of')) { - return 'en'; - } - - // Add more language detection logic as needed - return 'en'; // Default to English - } - - /** - * Detect financial content in text - */ - private detectFinancialContent(text: string): boolean { - const financialKeywords = [ - 'financial', 'revenue', 'profit', 'ebitda', 'margin', 'cash flow', - 'balance sheet', 'income statement', 'assets', 'liabilities', - 'equity', 'debt', 'investment', 'return', 'valuation' - ]; - - const lowerText = text.toLowerCase(); - return financialKeywords.some(keyword => lowerText.includes(keyword)); - } - - /** - * Detect technical content in text - */ - private detectTechnicalContent(text: string): boolean { - const technicalKeywords = [ - 'technical', 'specification', 'requirements', 'architecture', - 'system', 'technology', 'software', 'hardware', 'protocol', - 'algorithm', 'data', 'analysis', 'methodology' - ]; - - const lowerText = text.toLowerCase(); - return technicalKeywords.some(keyword => lowerText.includes(keyword)); - } - - /** - * Extract key topics from text - */ - private extractKeyTopics(text: string): string[] { - const topics: string[] = []; - const lowerText = text.toLowerCase(); - - // Extract potential topics based on common patterns - if (lowerText.includes('financial')) topics.push('Financial Analysis'); - if (lowerText.includes('market')) topics.push('Market Analysis'); - if (lowerText.includes('competitive')) topics.push('Competitive Landscape'); - if (lowerText.includes('technology')) topics.push('Technology'); - if (lowerText.includes('operations')) topics.push('Operations'); - if (lowerText.includes('management')) topics.push('Management'); - - return topics.slice(0, 5); // Return top 5 topics - } - - /** - * Analyze sentiment of the text - */ - private analyzeSentiment(text: string): string { - const positiveWords = ['growth', 'increase', 'positive', 'strong', 'excellent', 'opportunity']; - const negativeWords = ['decline', 'decrease', 'negative', 'weak', 'risk', 'challenge']; - - const lowerText = text.toLowerCase(); - const positiveCount = positiveWords.filter(word => lowerText.includes(word)).length; - const negativeCount = negativeWords.filter(word => lowerText.includes(word)).length; - - if (positiveCount > negativeCount) return 'positive'; - if (negativeCount > positiveCount) return 'negative'; - return 'neutral'; - } - - /** - * Assess complexity of the text - */ - private assessComplexity(text: string): string { - const words = text.split(/\s+/); - const avgWordLength = words.reduce((sum: number, word: string) => sum + word.length, 0) / words.length; - const sentenceCount = text.split(/[.!?]+/).length; - const avgSentenceLength = words.length / sentenceCount; - - if (avgWordLength > 6 || avgSentenceLength > 20) return 'high'; - if (avgWordLength > 5 || avgSentenceLength > 15) return 'medium'; - return 'low'; - } - - private convertJsonToMarkdown(data: CIMReview): string { - let markdown = `# BPCP CIM Review Template: ${data.dealOverview.targetCompanyName}\n\n`; - - // (A) Deal Overview - markdown += `## (A) Deal Overview\n\n`; - markdown += `- **Target Company Name:** ${data.dealOverview.targetCompanyName}\n`; - markdown += `- **Industry/Sector:** ${data.dealOverview.industrySector}\n`; - markdown += `- **Geography (HQ & Key Operations):** ${data.dealOverview.geography}\n`; - markdown += `- **Deal Source:** ${data.dealOverview.dealSource}\n`; - markdown += `- **Transaction Type:** ${data.dealOverview.transactionType}\n`; - markdown += `- **Date CIM Received:** ${data.dealOverview.dateCIMReceived}\n`; - markdown += `- **Date Reviewed:** ${data.dealOverview.dateReviewed}\n`; - markdown += `- **Reviewer(s):** ${data.dealOverview.reviewers}\n`; - markdown += `- **CIM Page Count:** ${data.dealOverview.cimPageCount}\n`; - markdown += `- **Stated Reason for Sale (if provided):** ${data.dealOverview.statedReasonForSale}\n\n`; - - // (B) Business Description - markdown += `## (B) Business Description\n\n`; - markdown += `- **Core Operations Summary (3-5 sentences):** ${data.businessDescription.coreOperationsSummary}\n`; - markdown += `- **Key Products/Services & Revenue Mix (Est. % if available):** ${data.businessDescription.keyProductsServices}\n`; - markdown += `- **Unique Value Proposition (UVP) / Why Customers Buy:** ${data.businessDescription.uniqueValueProposition}\n`; - markdown += `- **Customer Base Overview:**\n`; - markdown += ` - **Key Customer Segments/Types:** ${data.businessDescription.customerBaseOverview.keyCustomerSegments}\n`; - markdown += ` - **Customer Concentration Risk (Top 5 and/or Top 10 Customers as % Revenue - if stated/inferable):** ${data.businessDescription.customerBaseOverview.customerConcentrationRisk}\n`; - markdown += ` - **Typical Contract Length / Recurring Revenue % (if applicable):** ${data.businessDescription.customerBaseOverview.typicalContractLength}\n`; - markdown += `- **Key Supplier Overview (if critical & mentioned):**\n`; - markdown += ` - **Dependence/Concentration Risk:** ${data.businessDescription.keySupplierOverview.dependenceConcentrationRisk}\n\n`; - - // (C) Market & Industry Analysis - markdown += `## (C) Market & Industry Analysis\n\n`; - markdown += `- **Estimated Market Size (TAM/SAM - if provided):** ${data.marketIndustryAnalysis.estimatedMarketSize}\n`; - markdown += `- **Estimated Market Growth Rate (% CAGR - Historical & Projected):** ${data.marketIndustryAnalysis.estimatedMarketGrowthRate}\n`; - markdown += `- **Key Industry Trends & Drivers (Tailwinds/Headwinds):** ${data.marketIndustryAnalysis.keyIndustryTrends}\n`; - markdown += `- **Competitive Landscape:**\n`; - markdown += ` - **Key Competitors Identified:** ${data.marketIndustryAnalysis.competitiveLandscape.keyCompetitors}\n`; - markdown += ` - **Target's Stated Market Position/Rank:** ${data.marketIndustryAnalysis.competitiveLandscape.targetMarketPosition}\n`; - markdown += ` - **Basis of Competition:** ${data.marketIndustryAnalysis.competitiveLandscape.basisOfCompetition}\n`; - markdown += `- **Barriers to Entry / Competitive Moat (Stated/Inferred):** ${data.marketIndustryAnalysis.barriersToEntry}\n\n`; - - // (D) Financial Summary - markdown += `## (D) Financial Summary\n\n`; - markdown += `### Key Historical Financials\n\n`; - markdown += `| Metric | FY-3 (or earliest avail.) | FY-2 | FY-1 | LTM (Last Twelve Months) |\n`; - markdown += `| :--- | :---: | :---: | :---: | :---: |\n`; - - // Generate table rows from the financials data structure - const metricsToDisplay: Array<{ name: string; field: keyof typeof data.financialSummary.financials.fy3 }> = [ - { name: 'Revenue', field: 'revenue' }, - { name: '_Revenue Growth (%)_', field: 'revenueGrowth' }, - { name: 'Gross Profit (if avail.)', field: 'grossProfit' }, - { name: '_Gross Margin (%)_', field: 'grossMargin' }, - { name: 'EBITDA (Note Adjustments)', field: 'ebitda' }, - { name: '_EBITDA Margin (%)_', field: 'ebitdaMargin' } - ]; - - metricsToDisplay.forEach(metric => { - markdown += `| ${metric.name} | ${data.financialSummary.financials.fy3[metric.field] || 'N/A'} | ${data.financialSummary.financials.fy2[metric.field] || 'N/A'} | ${data.financialSummary.financials.fy1[metric.field] || 'N/A'} | ${data.financialSummary.financials.ltm[metric.field] || 'N/A'} |\n`; - }); - - markdown += `\n`; - - markdown += `### Key Financial Notes & Observations\n\n`; - markdown += `- **Quality of Earnings/Adjustments (Initial Impression):** ${data.financialSummary.qualityOfEarnings}\n`; - markdown += `- **Revenue Growth Drivers (Stated):** ${data.financialSummary.revenueGrowthDrivers}\n`; - markdown += `- **Margin Stability/Trend Analysis:** ${data.financialSummary.marginStabilityAnalysis}\n`; - markdown += `- **Capital Expenditures (Approx. LTM % of Revenue):** ${data.financialSummary.capitalExpenditures}\n`; - markdown += `- **Working Capital Intensity (Impression):** ${data.financialSummary.workingCapitalIntensity}\n`; - markdown += `- **Free Cash Flow (FCF) Proxy Quality (Impression):** ${data.financialSummary.freeCashFlowQuality}\n\n`; - - // (E) Management Team Overview - markdown += `## (E) Management Team Overview\n\n`; - markdown += `- **Key Leaders Identified (CEO, CFO, COO, Head of Sales, etc.):** ${data.managementTeamOverview.keyLeaders}\n`; - markdown += `- **Initial Assessment of Quality/Experience (Based on Bios):** ${data.managementTeamOverview.managementQualityAssessment}\n`; - markdown += `- **Management's Stated Post-Transaction Role/Intentions (if mentioned):** ${data.managementTeamOverview.postTransactionIntentions}\n`; - markdown += `- **Organizational Structure Overview (Impression):** ${data.managementTeamOverview.organizationalStructure}\n\n`; - - // (F) Preliminary Investment Thesis - markdown += `## (F) Preliminary Investment Thesis\n\n`; - markdown += `- **Key Attractions / Strengths (Why Invest?):** ${data.preliminaryInvestmentThesis.keyAttractions}\n`; - markdown += `- **Potential Risks / Concerns (Why Not Invest?):** ${data.preliminaryInvestmentThesis.potentialRisks}\n`; - markdown += `- **Initial Value Creation Levers (How PE Adds Value):** ${data.preliminaryInvestmentThesis.valueCreationLevers}\n`; - markdown += `- **Alignment with Fund Strategy:** ${data.preliminaryInvestmentThesis.alignmentWithFundStrategy}\n\n`; - - // (G) Key Questions & Next Steps - markdown += `## (G) Key Questions & Next Steps\n\n`; - markdown += `- **Critical Questions Arising from CIM Review:** ${data.keyQuestionsNextSteps.criticalQuestions}\n`; - markdown += `- **Key Missing Information / Areas for Diligence Focus:** ${data.keyQuestionsNextSteps.missingInformation}\n`; - markdown += `- **Preliminary Recommendation:** ${data.keyQuestionsNextSteps.preliminaryRecommendation}\n`; - markdown += `- **Rationale for Recommendation (Brief):** ${data.keyQuestionsNextSteps.rationaleForRecommendation}\n`; - markdown += `- **Proposed Next Steps:** ${data.keyQuestionsNextSteps.proposedNextSteps}\n\n`; - - return markdown; - } - - /** - * Get default template (fallback if BPCP template not found) - */ - private getDefaultTemplate(): string { - // This can be simplified as the template is now embodied in the JSON schema - return 'Provide a comprehensive analysis of the CIM document in the required JSON format.'; - } - - // eslint-disable-next-line @typescript-eslint/no-unused-vars - // @ts-ignore - private async combineChunkResults(chunkResults: any[]): Promise<{ summary: string; analysisData: CIMReview }> { - const combinedJson = this.mergeJsonObjects(chunkResults.map(r => r.jsonOutput)); - - // Final refinement step - const finalResult = await llmService.processCIMDocument(JSON.stringify(combinedJson), '', { refinementMode: true }); - - if (!finalResult.success || !finalResult.jsonOutput) { - logger.warn('Final refinement step failed, using combined JSON without refinement.'); - return { - summary: this.convertJsonToMarkdown(combinedJson), - analysisData: combinedJson - }; - } - - return { - summary: this.convertJsonToMarkdown(finalResult.jsonOutput), - analysisData: finalResult.jsonOutput - }; - } - - private mergeJsonObjects(objects: CIMReview[]): CIMReview { - if (!objects || objects.length === 0) { - // This should not happen if we have successful chunk results, but as a safeguard: - throw new Error("Cannot merge empty array of JSON objects."); - } - - // This is a simplified merge. A more sophisticated version would handle conflicts. - const base = JSON.parse(JSON.stringify(objects[0])); // Deep copy to avoid mutation issues - - for (let i = 1; i < objects.length; i++) { - const obj = objects[i]; - if (obj) { - // Simple merge - later objects overwrite earlier ones for most fields - Object.assign(base.dealOverview, obj.dealOverview); - Object.assign(base.businessDescription, obj.businessDescription); - Object.assign(base.marketIndustryAnalysis, obj.marketIndustryAnalysis); - Object.assign(base.financialSummary, obj.financialSummary); - Object.assign(base.managementTeamOverview, obj.managementTeamOverview); - Object.assign(base.preliminaryInvestmentThesis, obj.preliminaryInvestmentThesis); - Object.assign(base.keyQuestionsNextSteps, obj.keyQuestionsNextSteps); - } - } - return base; - } - - // Removed unused function parseAllChunkSections - - // Removed unused function parseCIMSections - - // Removed unused function getSectionKey - - /** - * Merge CIM sections from multiple chunks - */ - // Removed unused function mergeCIMSections - - // Removed unused function mergeSectionContent - - /** - * Build the final combined markdown from merged sections - */ - // Removed unused function buildCombinedMarkdown - - // Removed unused function getSectionTitle - - /** - * Save markdown file - */ - private async saveMarkdownFile(filePath: string, content: string): Promise { - try { - const fullPath = path.join(config.upload.uploadDir, filePath); - const dir = path.dirname(fullPath); - - if (!fs.existsSync(dir)) { - fs.mkdirSync(dir, { recursive: true }); - } - - fs.writeFileSync(fullPath, content, 'utf-8'); - logger.info(`Markdown file saved: ${filePath}`); - } catch (error) { - logger.error(`Failed to save markdown file: ${filePath}`, error); - throw new Error(`Failed to save markdown file: ${error instanceof Error ? error.message : 'Unknown error'}`); - } - } - - /** - * Detect document type based on content - */ - private detectDocumentType(text: string): string { - const lowerText = text.toLowerCase(); - - if (lowerText.includes('financial') || lowerText.includes('revenue') || lowerText.includes('profit')) { - return 'financial_report'; - } - - if (lowerText.includes('technical') || lowerText.includes('specification') || lowerText.includes('requirements')) { - return 'technical_document'; - } - - if (lowerText.includes('contract') || lowerText.includes('agreement') || lowerText.includes('legal')) { - return 'legal_document'; - } - - return 'general_document'; - } - - /** - * Estimate token count for text - */ - private estimateTokenCount(text: string): number { - // Rough estimation: 1 token ≈ 4 characters for English text - return Math.ceil(text.length / 4); - } - - /** - * Chunk text for processing with intelligent boundaries and overlap - */ - private chunkText(text: string, maxTokens: number = 4000): string[] { - const chunks: string[] = []; - const estimatedTokens = this.estimateTokenCount(text); - - if (estimatedTokens <= maxTokens) { - return [text]; - } - - // Calculate overlap size (20% of max tokens for continuity) - const overlapTokens = Math.floor(maxTokens * 0.2); - const overlapChars = overlapTokens * 4; // Rough conversion back to characters - - // Split by paragraphs first - const paragraphs = text.split(/\n\s*\n/).filter(p => p.trim()); - - if (paragraphs.length === 0) { - // If no paragraphs, split by sentences - const sentences = text.split(/[.!?]+/).filter(s => s.trim()); - return this.chunkBySentences(sentences, maxTokens, overlapChars); - } - - let currentChunk = ''; - let currentTokens = 0; - - for (let i = 0; i < paragraphs.length; i++) { - const paragraph = paragraphs[i]; - if (!paragraph) continue; - const paragraphTokens = this.estimateTokenCount(paragraph); - - // Check if adding this paragraph would exceed the limit - if (currentTokens + paragraphTokens > maxTokens && currentChunk) { - // Current chunk is full, save it - chunks.push(currentChunk.trim()); - - // Start new chunk with overlap from the end of previous chunk - if (chunks.length > 0 && overlapChars > 0) { - const previousChunk = chunks[chunks.length - 1]; - if (!previousChunk) continue; - const overlapText = previousChunk.slice(-overlapChars); - - // Find the last complete paragraph in the overlap - const overlapParagraphs = overlapText.split(/\n\s*\n/); - if (overlapParagraphs.length > 1) { - const lastCompleteParagraph = overlapParagraphs[overlapParagraphs.length - 1]; - currentChunk = lastCompleteParagraph + '\n\n'; - currentTokens = this.estimateTokenCount(currentChunk); - } else { - currentChunk = ''; - currentTokens = 0; - } - } else { - currentChunk = ''; - currentTokens = 0; - } - } - - // Add paragraph to current chunk - if (currentChunk) { - currentChunk += '\n\n' + paragraph; - } else { - currentChunk = paragraph || ''; - } - currentTokens += paragraphTokens; - } - - // Add the final chunk if it has content - if (currentChunk.trim()) { - chunks.push(currentChunk.trim()); - } - - // Ensure we don't have empty chunks - const validChunks = chunks.filter(chunk => chunk.trim().length > 0); - - logger.info('Text chunking completed', { - originalLength: text.length, - estimatedTokens, - maxTokens, - overlapTokens, - paragraphs: paragraphs.length, - chunks: validChunks.length, - chunkSizes: validChunks.map(chunk => chunk.length) - }); - - return validChunks; - } - - /** - * Chunk text by sentences when paragraph chunking isn't suitable - */ - private chunkBySentences(sentences: string[], maxTokens: number, overlapChars: number): string[] { - const chunks: string[] = []; - let currentChunk = ''; - let currentTokens = 0; - - for (let i = 0; i < sentences.length; i++) { - const sentence = sentences[i]; - if (!sentence) continue; - const sentenceTokens = this.estimateTokenCount(sentence); - - if (currentTokens + sentenceTokens > maxTokens && currentChunk) { - chunks.push(currentChunk.trim()); - - // Add overlap from previous chunk - if (chunks.length > 0 && overlapChars > 0) { - const previousChunk = chunks[chunks.length - 1]; - if (!previousChunk) continue; - const overlapText = previousChunk.slice(-overlapChars); - - // Find the last complete sentence in the overlap - const overlapSentences = overlapText.split(/[.!?]+/); - if (overlapSentences.length > 1) { - const lastCompleteSentence = overlapSentences[overlapSentences.length - 1]; - currentChunk = lastCompleteSentence + '. '; - currentTokens = this.estimateTokenCount(currentChunk); - } else { - currentChunk = ''; - currentTokens = 0; - } - } else { - currentChunk = ''; - currentTokens = 0; - } - } - - if (currentChunk) { - currentChunk += sentence + '. '; - } else { - currentChunk = sentence + '. '; - } - currentTokens += sentenceTokens; - } - - if (currentChunk.trim()) { - chunks.push(currentChunk.trim()); - } - - return chunks.filter(chunk => chunk.trim().length > 0); - } - - /** - * Refine the combined summary using LLM for better coherence and completeness - */ - // Removed unused function refineCombinedSummary - - /** - * Get processing job status - */ - async getProcessingJobStatus(jobId: string): Promise { - try { - const job = await ProcessingJobModel.findByJobId(jobId); - return job; - } catch (error) { - logger.error(`Failed to get processing job status: ${jobId}`, error); - throw error; - } - } - - /** - * Get document processing history - */ - async getDocumentProcessingHistory(documentId: string): Promise { - try { - const jobs = await ProcessingJobModel.findByDocumentId(documentId); - return jobs; - } catch (error) { - logger.error(`Failed to get document processing history: ${documentId}`, error); - throw error; - } - } - - /** - * Regenerate summary for an existing document - */ - async regenerateSummary(documentId: string): Promise { - try { - logger.info('Starting summary regeneration', { documentId }); - - // Get the document - const document = await DocumentModel.findById(documentId); - if (!document) { - throw new Error('Document not found'); - } - - if (!document.extracted_text) { - throw new Error('Document has no extracted text to regenerate summary from'); - } - - // Update status to processing - await this.updateDocumentStatus(documentId, 'processing_llm'); - - // Load template - const templatePath = path.join(process.cwd(), '..', 'BPCP CIM REVIEW TEMPLATE.md'); - let template = ''; - - try { - template = fs.readFileSync(templatePath, 'utf-8'); - logger.info('BPCP template loaded successfully', { templateLength: template.length }); - } catch (error) { - logger.warn('Could not load BPCP template, using default template', { error: error instanceof Error ? error.message : 'Unknown error' }); - template = this.getDefaultTemplate(); - } - - // Generate new summary - const summaryResult = await this.generateSummary(documentId, document.extracted_text, {}); - const newSummary = summaryResult.summary; - const newAnalysisData = summaryResult.analysisData; - - // Save new markdown file - const timestamp = new Date().toISOString().replace(/[:.]/g, '-'); - const markdownPath = `uploads/summaries/${documentId}_${timestamp}.md`; - const fullMarkdownPath = path.join(process.cwd(), markdownPath); - - await this.saveMarkdownFile(fullMarkdownPath, newSummary); - - // Generate PDF - const pdfPath = markdownPath.replace('.md', '.pdf'); - const fullPdfPath = path.join(process.cwd(), pdfPath); - - await pdfGenerationService.generatePDFFromMarkdown(newSummary, fullPdfPath); - - // Update document with new summary - const updateData = { - generated_summary: newSummary, - analysis_data: newAnalysisData, - summary_markdown_path: markdownPath, - summary_pdf_path: pdfPath, - status: 'completed' as const, - processing_completed_at: new Date() - }; - - const updated = await DocumentModel.updateById(documentId, updateData); - if (!updated) { - throw new Error('Failed to update document with new summary'); - } - - logger.info('Summary regeneration completed successfully', { - documentId, - newSummaryLength: newSummary.length, - markdownPath, - pdfPath - }); - - } catch (error) { - logger.error('Summary regeneration failed', { - documentId, - error: error instanceof Error ? error.message : 'Unknown error' - }); - - // Update status to failed - await this.updateDocumentStatus(documentId, 'failed'); - throw error; - } - } - - /** - * Clean up the original uploaded file after successful processing - */ - private async cleanupOriginalFile(documentId: string): Promise { - try { - const document = await DocumentModel.findById(documentId); - if (!document || !document.file_path) { - logger.warn(`No file path found for document: ${documentId}`); - return; - } - - // Check if file exists before attempting to delete - if (await fileStorageService.fileExists(document.file_path)) { - await fileStorageService.deleteFile(document.file_path); - logger.info(`Cleaned up original uploaded file: ${document.file_path}`); - } else { - logger.warn(`Original file not found for cleanup: ${document.file_path}`); - } - } catch (error) { - logger.error(`Failed to cleanup original file: ${documentId}`, error); - // Don't throw error - cleanup failure shouldn't fail the entire process - } - } -} - -export const documentProcessingService = new DocumentProcessingService(); -export default documentProcessingService; \ No newline at end of file diff --git a/backend/src/services/jobQueueService.ts b/backend/src/services/jobQueueService.ts index ff71dcf..0b42613 100644 --- a/backend/src/services/jobQueueService.ts +++ b/backend/src/services/jobQueueService.ts @@ -2,10 +2,18 @@ import { EventEmitter } from 'events'; import path from 'path'; import { logger, StructuredLogger } from '../utils/logger'; import { config } from '../config/env'; -import { ProcessingOptions } from './documentProcessingService'; import { unifiedDocumentProcessor } from './unifiedDocumentProcessor'; import { uploadMonitoringService } from './uploadMonitoringService'; +// Define ProcessingOptions interface locally since documentProcessingService was removed +export interface ProcessingOptions { + strategy?: string; + fileBuffer?: Buffer; + fileName?: string; + mimeType?: string; + [key: string]: any; +} + export interface Job { id: string; type: 'document_processing'; diff --git a/backend/src/services/ragDocumentProcessor.ts b/backend/src/services/ragDocumentProcessor.ts deleted file mode 100644 index ded2c54..0000000 --- a/backend/src/services/ragDocumentProcessor.ts +++ /dev/null @@ -1,410 +0,0 @@ -import { logger } from '../utils/logger'; -import { llmService } from './llmService'; - -import { CIMReview } from './llmSchemas'; - -interface DocumentSection { - id: string; - type: 'executive_summary' | 'business_description' | 'financial_analysis' | 'market_analysis' | 'management' | 'investment_thesis'; - content: string; - pageRange: [number, number]; - keyMetrics: Record; - relevanceScore: number; -} - -interface RAGQuery { - section: string; - context: string; - specificQuestions: string[]; -} - -interface RAGAnalysisResult { - success: boolean; - summary: string; - analysisData: CIMReview; - error?: string; - processingTime: number; - apiCalls: number; -} - -class RAGDocumentProcessor { - private sections: DocumentSection[] = []; - - private apiCallCount: number = 0; - - /** - * Process CIM document using RAG approach - */ - async processDocument(text: string, documentId: string): Promise { - const startTime = Date.now(); - this.apiCallCount = 0; - - logger.info('Starting RAG-based CIM processing', { documentId }); - - try { - // Step 1: Intelligent document segmentation - await this.segmentDocument(text); - - // Step 2: Extract key metrics and context - await this.extractKeyMetrics(); - - // Step 3: Generate comprehensive analysis using RAG - const analysis = await this.generateRAGAnalysis(); - - // Step 4: Create final summary - const summary = await this.createFinalSummary(analysis); - - const processingTime = Date.now() - startTime; - - logger.info('RAG processing completed successfully', { - documentId, - processingTime, - apiCalls: this.apiCallCount, - sections: this.sections.length - }); - - return { - success: true, - summary, - analysisData: analysis, - processingTime, - apiCalls: this.apiCallCount - }; - - } catch (error) { - const processingTime = Date.now() - startTime; - logger.error('RAG processing failed', { - documentId, - error: error instanceof Error ? error.message : 'Unknown error', - processingTime, - apiCalls: this.apiCallCount - }); - - return { - success: false, - summary: '', - analysisData: {} as CIMReview, - error: error instanceof Error ? error.message : 'Unknown error', - processingTime, - apiCalls: this.apiCallCount - }; - } - } - - /** - * Segment document into logical sections with metadata - */ - private async segmentDocument(text: string): Promise { - logger.info('Segmenting document into logical sections'); - - // Use LLM to identify and segment document sections - const segmentationPrompt = ` - Analyze this CIM document and identify its logical sections. For each section, provide: - 1. Section type (executive_summary, business_description, financial_analysis, market_analysis, management, investment_thesis) - 2. Start and end page numbers - 3. Key topics covered - 4. Relevance to investment analysis (1-10 scale) - - Document text: - ${text.substring(0, 50000)} // First 50K chars for section identification - - Return as JSON array of sections. - `; - - const segmentationResult = await this.callLLM({ - prompt: segmentationPrompt, - systemPrompt: 'You are an expert at analyzing CIM document structure. Identify logical sections accurately.', - maxTokens: 2000, - temperature: 0.1 - }); - - if (segmentationResult.success) { - try { - const sections = JSON.parse(segmentationResult.content); - this.sections = sections.map((section: any, index: number) => ({ - id: `section_${index}`, - type: section.type, - content: this.extractSectionContent(text, section.pageRange), - pageRange: section.pageRange, - keyMetrics: {}, - relevanceScore: section.relevanceScore - })); - } catch (error) { - logger.error('Failed to parse section segmentation', { error }); - // Fallback to rule-based segmentation - this.sections = this.fallbackSegmentation(text); - } - } - } - - /** - * Extract key metrics from each section - */ - private async extractKeyMetrics(): Promise { - logger.info('Extracting key metrics from document sections'); - - for (const section of this.sections) { - const metricsPrompt = ` - Extract key financial and business metrics from this section: - - Section Type: ${section.type} - Content: ${section.content.substring(0, 10000)} - - Focus on: - - Revenue, EBITDA, margins - - Growth rates, market size - - Customer metrics, employee count - - Key risks and opportunities - - Return as JSON object. - `; - - const metricsResult = await this.callLLM({ - prompt: metricsPrompt, - systemPrompt: 'Extract precise numerical and qualitative metrics from CIM sections.', - maxTokens: 1500, - temperature: 0.1 - }); - - if (metricsResult.success) { - try { - section.keyMetrics = JSON.parse(metricsResult.content); - } catch (error) { - logger.warn('Failed to parse metrics for section', { sectionId: section.id, error }); - } - } - } - } - - /** - * Generate analysis using RAG approach - */ - private async generateRAGAnalysis(): Promise { - logger.info('Generating RAG-based analysis'); - - // Create queries for each section of the BPCP template - const queries: RAGQuery[] = [ - { - section: 'dealOverview', - context: 'Extract deal-specific information including company name, industry, geography, transaction details', - specificQuestions: [ - 'What is the target company name?', - 'What industry/sector does it operate in?', - 'Where is the company headquartered?', - 'What type of transaction is this?', - 'What is the stated reason for sale?' - ] - }, - { - section: 'businessDescription', - context: 'Analyze the company\'s core operations, products/services, and customer base', - specificQuestions: [ - 'What are the core operations?', - 'What are the key products/services?', - 'What is the revenue mix?', - 'Who are the key customers?', - 'What is the unique value proposition?' - ] - }, - { - section: 'financialSummary', - context: 'Extract and analyze financial performance, trends, and quality metrics', - specificQuestions: [ - 'What are the revenue trends?', - 'What are the EBITDA margins?', - 'What is the quality of earnings?', - 'What are the growth drivers?', - 'What is the working capital intensity?' - ] - }, - { - section: 'marketIndustryAnalysis', - context: 'Analyze market size, growth, competition, and industry trends', - specificQuestions: [ - 'What is the market size (TAM/SAM)?', - 'What is the market growth rate?', - 'Who are the key competitors?', - 'What are the barriers to entry?', - 'What are the key industry trends?' - ] - }, - { - section: 'managementTeamOverview', - context: 'Evaluate management team quality, experience, and post-transaction intentions', - specificQuestions: [ - 'Who are the key leaders?', - 'What is their experience level?', - 'What are their post-transaction intentions?', - 'How is the organization structured?' - ] - }, - { - section: 'preliminaryInvestmentThesis', - context: 'Develop investment thesis based on all available information', - specificQuestions: [ - 'What are the key attractions?', - 'What are the potential risks?', - 'What are the value creation levers?', - 'How does this align with BPCP strategy?' - ] - } - ]; - - const analysis: any = {}; - - // Process each query using RAG - for (const query of queries) { - const relevantSections = this.findRelevantSections(query); - const queryContext = this.buildQueryContext(relevantSections, query); - - const analysisResult = await this.callLLM({ - prompt: this.buildRAGPrompt(query, queryContext), - systemPrompt: 'You are an expert investment analyst. Provide precise, structured analysis based on the provided context.', - maxTokens: 2000, - temperature: 0.1 - }); - - if (analysisResult.success) { - try { - analysis[query.section] = JSON.parse(analysisResult.content); - } catch (error) { - logger.warn('Failed to parse analysis for section', { section: query.section, error }); - } - } - } - - return analysis as CIMReview; - } - - /** - * Find sections relevant to a specific query - */ - private findRelevantSections(query: RAGQuery): DocumentSection[] { - const relevanceMap: Record = { - dealOverview: ['executive_summary'], - businessDescription: ['business_description', 'executive_summary'], - financialSummary: ['financial_analysis', 'executive_summary'], - marketIndustryAnalysis: ['market_analysis', 'executive_summary'], - managementTeamOverview: ['management', 'executive_summary'], - preliminaryInvestmentThesis: ['investment_thesis', 'executive_summary', 'business_description'] - }; - - const relevantTypes = relevanceMap[query.section] || []; - return this.sections.filter(section => - relevantTypes.includes(section.type) && section.relevanceScore >= 5 - ); - } - - /** - * Build context for a specific query - */ - private buildQueryContext(sections: DocumentSection[], query: RAGQuery): string { - let context = `Query: ${query.context}\n\n`; - context += `Specific Questions:\n${query.specificQuestions.map(q => `- ${q}`).join('\n')}\n\n`; - context += `Relevant Document Sections:\n\n`; - - for (const section of sections) { - context += `Section: ${section.type}\n`; - context += `Relevance Score: ${section.relevanceScore}/10\n`; - context += `Key Metrics: ${JSON.stringify(section.keyMetrics, null, 2)}\n`; - context += `Content: ${section.content.substring(0, 5000)}\n\n`; - } - - return context; - } - - /** - * Build RAG prompt for specific analysis - */ - private buildRAGPrompt(query: RAGQuery, context: string): string { - return ` - Based on the following context from a CIM document, provide a comprehensive analysis for the ${query.section} section. - - ${context} - - Please provide your analysis in the exact JSON format required for the BPCP CIM Review Template. - Focus on answering the specific questions listed above. - Use "Not specified in CIM" for any information not available in the provided context. - `; - } - - /** - * Create final summary from RAG analysis - */ - private async createFinalSummary(analysis: CIMReview): Promise { - logger.info('Creating final summary from RAG analysis'); - - const summaryPrompt = ` - Create a comprehensive markdown summary from the following BPCP CIM analysis: - - ${JSON.stringify(analysis, null, 2)} - - Format as a professional BPCP CIM Review Template with proper markdown structure. - `; - - const summaryResult = await this.callLLM({ - prompt: summaryPrompt, - systemPrompt: 'Create a professional, well-structured markdown summary for BPCP investment committee.', - maxTokens: 3000, - temperature: 0.1 - }); - - return summaryResult.success ? summaryResult.content : 'Summary generation failed'; - } - - /** - * Fallback segmentation if LLM segmentation fails - */ - private fallbackSegmentation(text: string): DocumentSection[] { - // Rule-based segmentation as fallback - const sections: DocumentSection[] = []; - const patterns = [ - { type: 'executive_summary', pattern: /(?:executive\s+summary|overview|introduction)/i }, - { type: 'business_description', pattern: /(?:business\s+description|company\s+overview|operations)/i }, - { type: 'financial_analysis', pattern: /(?:financial|financials|performance|results)/i }, - { type: 'market_analysis', pattern: /(?:market|industry|competitive)/i }, - { type: 'management', pattern: /(?:management|leadership|team)/i }, - { type: 'investment_thesis', pattern: /(?:investment|opportunity|thesis)/i } - ]; - - // Simple text splitting based on patterns - const textLength = text.length; - const sectionSize = Math.floor(textLength / patterns.length); - - patterns.forEach((pattern, index) => { - const start = index * sectionSize; - const end = Math.min((index + 1) * sectionSize, textLength); - - sections.push({ - id: `section_${index}`, - type: pattern.type as any, - content: text.substring(start, end), - pageRange: [Math.floor(start / 1000), Math.floor(end / 1000)], - keyMetrics: {}, - relevanceScore: 7 - }); - }); - - return sections; - } - - /** - * Extract content for specific page range - */ - private extractSectionContent(text: string, pageRange: [number, number]): string { - // Rough estimation: 1000 characters per page - const startChar = pageRange[0] * 1000; - const endChar = pageRange[1] * 1000; - return text.substring(startChar, endChar); - } - - /** - * Wrapper for LLM calls to track API usage - */ - private async callLLM(request: any): Promise { - this.apiCallCount++; - return await llmService.processCIMDocument(request.prompt, '', {}); - } -} - -export const ragDocumentProcessor = new RAGDocumentProcessor(); \ No newline at end of file diff --git a/backend/src/services/unifiedDocumentProcessor.ts b/backend/src/services/unifiedDocumentProcessor.ts index 3e087ee..472d082 100644 --- a/backend/src/services/unifiedDocumentProcessor.ts +++ b/backend/src/services/unifiedDocumentProcessor.ts @@ -1,37 +1,124 @@ import { logger } from '../utils/logger'; import { config } from '../config/env'; -import { documentProcessingService } from './documentProcessingService'; -import { ragDocumentProcessor } from './ragDocumentProcessor'; import { optimizedAgenticRAGProcessor } from './optimizedAgenticRAGProcessor'; -import { documentAiGenkitProcessor } from './documentAiGenkitProcessor'; +import { documentAiProcessor } from './documentAiProcessor'; import { CIMReview } from './llmSchemas'; -import { documentController } from '../controllers/documentController'; + +// Default empty CIMReview object +const defaultCIMReview: CIMReview = { + dealOverview: { + targetCompanyName: '', + industrySector: '', + geography: '', + dealSource: '', + transactionType: '', + dateCIMReceived: '', + dateReviewed: '', + reviewers: '', + cimPageCount: '', + statedReasonForSale: '', + employeeCount: '' + }, + businessDescription: { + coreOperationsSummary: '', + keyProductsServices: '', + uniqueValueProposition: '', + customerBaseOverview: { + keyCustomerSegments: '', + customerConcentrationRisk: '', + typicalContractLength: '' + }, + keySupplierOverview: { + dependenceConcentrationRisk: '' + } + }, + marketIndustryAnalysis: { + estimatedMarketSize: '', + estimatedMarketGrowthRate: '', + keyIndustryTrends: '', + competitiveLandscape: { + keyCompetitors: '', + targetMarketPosition: '', + basisOfCompetition: '' + }, + barriersToEntry: '' + }, + financialSummary: { + financials: { + fy3: { + revenue: '', + revenueGrowth: '', + grossProfit: '', + grossMargin: '', + ebitda: '', + ebitdaMargin: '' + }, + fy2: { + revenue: '', + revenueGrowth: '', + grossProfit: '', + grossMargin: '', + ebitda: '', + ebitdaMargin: '' + }, + fy1: { + revenue: '', + revenueGrowth: '', + grossProfit: '', + grossMargin: '', + ebitda: '', + ebitdaMargin: '' + }, + ltm: { + revenue: '', + revenueGrowth: '', + grossProfit: '', + grossMargin: '', + ebitda: '', + ebitdaMargin: '' + } + }, + qualityOfEarnings: '', + revenueGrowthDrivers: '', + marginStabilityAnalysis: '', + capitalExpenditures: '', + workingCapitalIntensity: '', + freeCashFlowQuality: '' + }, + managementTeamOverview: { + keyLeaders: '', + managementQualityAssessment: '', + postTransactionIntentions: '', + organizationalStructure: '' + }, + preliminaryInvestmentThesis: { + keyAttractions: '', + potentialRisks: '', + valueCreationLevers: '', + alignmentWithFundStrategy: '' + }, + keyQuestionsNextSteps: { + criticalQuestions: '', + missingInformation: '', + preliminaryRecommendation: '', + rationaleForRecommendation: '', + proposedNextSteps: '' + } +}; interface ProcessingResult { success: boolean; summary: string; analysisData: CIMReview; - processingStrategy: 'chunking' | 'rag' | 'agentic_rag' | 'optimized_agentic_rag' | 'document_ai_genkit'; + processingStrategy: 'document_ai_agentic_rag'; processingTime: number; apiCalls: number; error: string | undefined; } -interface ComparisonResult { - chunking: ProcessingResult; - rag: ProcessingResult; - agenticRag: ProcessingResult; - winner: 'chunking' | 'rag' | 'agentic_rag' | 'tie'; - performanceMetrics: { - timeDifference: number; - apiCallDifference: number; - qualityScore: number; - }; -} - class UnifiedDocumentProcessor { /** - * Process document using the configured strategy + * Process document using Document AI + Agentic RAG strategy */ async processDocument( documentId: string, @@ -39,218 +126,88 @@ class UnifiedDocumentProcessor { text: string, options: any = {} ): Promise { - const strategy = options.strategy || config.processingStrategy; + const strategy = options.strategy || 'document_ai_agentic_rag'; logger.info('Processing document with unified processor', { documentId, strategy, - configStrategy: config.processingStrategy, textLength: text.length }); - if (strategy === 'rag') { - return await this.processWithRAG(documentId, text); - } else if (strategy === 'agentic_rag') { - return await this.processWithAgenticRAG(documentId, userId, text); - } else if (strategy === 'optimized_agentic_rag') { - return await this.processWithOptimizedAgenticRAG(documentId, userId, text, options); - } else if (strategy === 'document_ai_genkit') { - return await this.processWithDocumentAiGenkit(documentId, userId, text, options); + // Only support document_ai_agentic_rag strategy + if (strategy === 'document_ai_agentic_rag') { + return await this.processWithDocumentAiAgenticRag(documentId, userId, text, options); } else { - return await this.processWithChunking(documentId, userId, text, options); + throw new Error(`Unsupported processing strategy: ${strategy}. Only 'document_ai_agentic_rag' is supported.`); } } /** - * Process document using RAG approach + * Process document using Document AI + Agentic RAG approach */ - private async processWithRAG(documentId: string, text: string): Promise { - logger.info('Using RAG processing strategy', { documentId }); - - const result = await ragDocumentProcessor.processDocument(text, documentId); - - return { - success: result.success, - summary: result.summary, - analysisData: result.analysisData, - processingStrategy: 'rag', - processingTime: result.processingTime, - apiCalls: result.apiCalls, - error: result.error || undefined - }; - } - - /** - * Process document using agentic RAG approach - */ - private async processWithAgenticRAG( - documentId: string, - _userId: string, - text: string - ): Promise { - logger.info('Using agentic RAG processing strategy', { documentId }); - - try { - // If text is empty, extract it from the document - let extractedText = text; - if (!text || text.length === 0) { - logger.info('Extracting text for agentic RAG processing', { documentId }); - extractedText = await documentController.getDocumentText(documentId); - } - - const result = await optimizedAgenticRAGProcessor.processLargeDocument(documentId, extractedText, {}); - - return { - success: result.success, - summary: result.summary || '', - analysisData: result.analysisData || {} as CIMReview, - processingStrategy: 'agentic_rag', - processingTime: result.processingTime, - apiCalls: Math.ceil(result.processedChunks / 5), // Estimate API calls - error: result.error || undefined - }; - } catch (error) { - logger.error('Agentic RAG processing failed', { documentId, error }); - - return { - success: false, - summary: '', - analysisData: {} as CIMReview, - processingStrategy: 'agentic_rag', - processingTime: 0, - apiCalls: 0, - error: error instanceof Error ? error.message : 'Unknown error' - }; - } - } - - /** - * Process document using optimized agentic RAG approach for large documents - */ - private async processWithOptimizedAgenticRAG( - documentId: string, - _userId: string, - text: string, - _options: any - ): Promise { - logger.info('Using optimized agentic RAG processing strategy', { documentId, textLength: text.length }); - - const startTime = Date.now(); - - try { - // If text is empty, extract it from the document - let extractedText = text; - if (!text || text.length === 0) { - logger.info('Extracting text for optimized agentic RAG processing', { documentId }); - extractedText = await documentController.getDocumentText(documentId); - } - - // Use the optimized processor for large documents - const optimizedResult = await optimizedAgenticRAGProcessor.processLargeDocument( - documentId, - extractedText, - { - enableSemanticChunking: true, - enableMetadataEnrichment: true, - similarityThreshold: 0.8 - } - ); - - // Return the complete result from the optimized processor - return { - success: optimizedResult.success, - summary: optimizedResult.summary || `Document successfully processed with optimized agentic RAG. Created ${optimizedResult.processedChunks} chunks with ${optimizedResult.averageChunkSize} average size.`, - analysisData: optimizedResult.analysisData || {} as CIMReview, - processingStrategy: 'optimized_agentic_rag', - processingTime: optimizedResult.processingTime, - apiCalls: Math.ceil(optimizedResult.processedChunks / 5), // Estimate API calls - error: optimizedResult.error - }; - } catch (error) { - logger.error('Optimized agentic RAG processing failed', { documentId, error }); - - console.log('❌ Unified document processor - optimized agentic RAG failed for document:', documentId); - console.log('❌ Error:', error instanceof Error ? error.message : String(error)); - - return { - success: false, - summary: '', - analysisData: {} as CIMReview, - processingStrategy: 'optimized_agentic_rag', - processingTime: Date.now() - startTime, - apiCalls: 0, - error: error instanceof Error ? error.message : 'Unknown error' - }; - } - } - - /** - * Process document using Document AI + Genkit approach - */ - private async processWithDocumentAiGenkit( + private async processWithDocumentAiAgenticRag( documentId: string, userId: string, text: string, options: any ): Promise { - logger.info('Using Document AI + Genkit processing strategy', { documentId }); - - const startTime = Date.now(); + logger.info('Using Document AI + Agentic RAG processing strategy', { documentId }); try { - // Get the file buffer from options if available, otherwise use text - const fileBuffer = options.fileBuffer || Buffer.from(text); - const fileName = options.fileName || `document-${documentId}.pdf`; - const mimeType = options.mimeType || 'application/pdf'; + const startTime = Date.now(); - logger.info('Document AI processing with file data', { + // Extract file buffer from options + const { fileBuffer, fileName, mimeType } = options; + + if (!fileBuffer || !fileName || !mimeType) { + throw new Error('Missing required options: fileBuffer, fileName, mimeType'); + } + + // Process with Document AI + Agentic RAG + const result = await documentAiProcessor.processDocument( documentId, - fileSize: fileBuffer.length, + userId, + fileBuffer, fileName, mimeType - }); - - const result = await documentAiGenkitProcessor.processDocument( - documentId, - userId, - fileBuffer, - fileName, - mimeType ); - - if (!result.success) { - logger.error('Document AI processing failed', { - documentId, - error: result.error, - metadata: result.metadata - }); + + const processingTime = Date.now() - startTime; + + if (result.success) { + return { + success: true, + summary: result.content, + analysisData: result.metadata?.agenticRagResult?.analysisData || {}, + processingStrategy: 'document_ai_agentic_rag', + processingTime, + apiCalls: result.metadata?.agenticRagResult?.apiCalls || 0, + error: undefined + }; + } else { + return { + success: false, + summary: '', + analysisData: defaultCIMReview, + processingStrategy: 'document_ai_agentic_rag', + processingTime, + apiCalls: 0, + error: result.error || 'Unknown processing error' + }; } - - return { - success: result.success, - summary: result.content || '', - analysisData: (result.metadata?.agenticRagResult?.analysisData as CIMReview) || {} as CIMReview, - processingStrategy: 'document_ai_genkit', - processingTime: Date.now() - startTime, - apiCalls: 1, // Document AI + Agentic RAG typically uses fewer API calls - error: result.error || undefined - }; } catch (error) { - const errorMessage = error instanceof Error ? error.message : String(error); - const errorStack = error instanceof Error ? error.stack : undefined; - - logger.error('Document AI + Genkit processing failed with exception', { - documentId, - error: errorMessage, - stack: errorStack + const errorMessage = error instanceof Error ? error.message : 'Unknown error'; + logger.error('Document AI + Agentic RAG processing failed', { + documentId, + error: errorMessage }); return { success: false, summary: '', - analysisData: {} as CIMReview, - processingStrategy: 'document_ai_genkit', - processingTime: Date.now() - startTime, + analysisData: defaultCIMReview, + processingStrategy: 'document_ai_agentic_rag', + processingTime: 0, apiCalls: 0, error: errorMessage }; @@ -258,213 +215,31 @@ class UnifiedDocumentProcessor { } /** - * Process document using chunking approach - */ - private async processWithChunking( - documentId: string, - userId: string, - text: string, - options: any - ): Promise { - logger.info('Using chunking processing strategy', { documentId }); - - const startTime = Date.now(); - - try { - const result = await documentProcessingService.processDocument(documentId, userId, options); - - // Estimate API calls for chunking (this is approximate) - const estimatedApiCalls = this.estimateChunkingApiCalls(text); - - return { - success: result.success, - summary: result.summary || '', - analysisData: (result.analysis as CIMReview) || {} as CIMReview, - processingStrategy: 'chunking', - processingTime: Date.now() - startTime, - apiCalls: estimatedApiCalls, - error: result.error || undefined - }; - } catch (error) { - return { - success: false, - summary: '', - analysisData: {} as CIMReview, - processingStrategy: 'chunking', - processingTime: Date.now() - startTime, - apiCalls: 0, - error: error instanceof Error ? error.message : 'Unknown error' - }; - } - } - - /** - * Compare all processing strategies - */ - async compareProcessingStrategies( - documentId: string, - userId: string, - text: string, - options: any = {} - ): Promise { - logger.info('Comparing processing strategies', { documentId }); - - // Process with all strategies - const [chunkingResult, ragResult, agenticRagResult] = await Promise.all([ - this.processWithChunking(documentId, userId, text, options), - this.processWithRAG(documentId, text), - this.processWithAgenticRAG(documentId, userId, text) - ]); - - // Calculate performance metrics - const timeDifference = chunkingResult.processingTime - ragResult.processingTime; - const apiCallDifference = chunkingResult.apiCalls - ragResult.apiCalls; - const qualityScore = this.calculateQualityScore(chunkingResult, ragResult); - - // Determine winner - let winner: 'chunking' | 'rag' | 'agentic_rag' | 'tie' = 'tie'; - - // Check which strategies were successful - const successfulStrategies: Array<{ name: string; result: ProcessingResult }> = []; - if (chunkingResult.success) successfulStrategies.push({ name: 'chunking', result: chunkingResult }); - if (ragResult.success) successfulStrategies.push({ name: 'rag', result: ragResult }); - if (agenticRagResult.success) successfulStrategies.push({ name: 'agentic_rag', result: agenticRagResult }); - - if (successfulStrategies.length === 0) { - winner = 'tie'; - } else if (successfulStrategies.length === 1) { - winner = successfulStrategies[0]?.name as 'chunking' | 'rag' | 'agentic_rag' || 'tie'; - } else { - // Multiple successful strategies, compare performance - const scores = successfulStrategies.map(strategy => { - const result = strategy.result; - const quality = this.calculateQualityScore(result, result); // Self-comparison for baseline - const timeScore = 1 / (1 + result.processingTime / 60000); // Normalize to 1 minute - const apiScore = 1 / (1 + result.apiCalls / 10); // Normalize to 10 API calls - return { - name: strategy.name, - score: quality * 0.5 + timeScore * 0.25 + apiScore * 0.25 - }; - }); - - scores.sort((a, b) => b.score - a.score); - winner = scores[0]?.name as 'chunking' | 'rag' | 'agentic_rag' || 'tie'; - } - - return { - chunking: chunkingResult, - rag: ragResult, - agenticRag: agenticRagResult, - winner, - performanceMetrics: { - timeDifference, - apiCallDifference, - qualityScore - } - }; - } - - /** - * Estimate API calls for chunking approach - */ - private estimateChunkingApiCalls(text: string): number { - const chunkSize = config.llm.chunkSize; - const estimatedTokens = Math.ceil(text.length / 4); // Rough token estimation - const chunks = Math.ceil(estimatedTokens / chunkSize); - return chunks + 1; // +1 for final synthesis - } - - /** - * Calculate quality score based on result completeness - */ - private calculateQualityScore(chunkingResult: ProcessingResult, ragResult: ProcessingResult): number { - if (!chunkingResult.success && !ragResult.success) return 0.5; - if (!chunkingResult.success) return 1.0; - if (!ragResult.success) return 0.0; - - // Compare summary length and structure - const chunkingScore = this.analyzeSummaryQuality(chunkingResult.summary); - const ragScore = this.analyzeSummaryQuality(ragResult.summary); - - return ragScore / (chunkingScore + ragScore); - } - - /** - * Analyze summary quality based on length and structure - */ - private analyzeSummaryQuality(summary: string): number { - if (!summary) return 0; - - // Check for markdown structure - const hasHeaders = (summary.match(/#{1,6}\s/g) || []).length; - const hasLists = (summary.match(/[-*+]\s/g) || []).length; - const hasBold = (summary.match(/\*\*.*?\*\*/g) || []).length; - - // Length factor (longer summaries tend to be more comprehensive) - const lengthFactor = Math.min(summary.length / 5000, 1); - - // Structure factor - const structureFactor = Math.min((hasHeaders + hasLists + hasBold) / 10, 1); - - return (lengthFactor * 0.7) + (structureFactor * 0.3); - } - - /** - * Get processing statistics + * Get processing statistics (simplified) */ async getProcessingStats(): Promise<{ totalDocuments: number; - chunkingSuccess: number; - ragSuccess: number; - agenticRagSuccess: number; + documentAiAgenticRagSuccess: number; averageProcessingTime: { - chunking: number; - rag: number; - agenticRag: number; + documentAiAgenticRag: number; }; averageApiCalls: { - chunking: number; - rag: number; - agenticRag: number; + documentAiAgenticRag: number; }; }> { - // This would typically query a database for processing statistics - // For now, return mock data + // This would need to be implemented based on actual database queries + // For now, return placeholder data return { totalDocuments: 0, - chunkingSuccess: 0, - ragSuccess: 0, - agenticRagSuccess: 0, + documentAiAgenticRagSuccess: 0, averageProcessingTime: { - chunking: 0, - rag: 0, - agenticRag: 0 + documentAiAgenticRag: 0 }, averageApiCalls: { - chunking: 0, - rag: 0, - agenticRag: 0 + documentAiAgenticRag: 0 } }; } - - /** - * Switch processing strategy for a document - */ - async switchStrategy( - documentId: string, - userId: string, - text: string, - newStrategy: 'chunking' | 'rag' | 'agentic_rag', - options: any = {} - ): Promise { - logger.info('Switching processing strategy', { documentId, newStrategy }); - - return await this.processDocument(documentId, userId, text, { - ...options, - strategy: newStrategy - }); - } } export const unifiedDocumentProcessor = new UnifiedDocumentProcessor(); \ No newline at end of file diff --git a/backend/src/services/vectorDocumentProcessor.ts b/backend/src/services/vectorDocumentProcessor.ts deleted file mode 100644 index 1a813e4..0000000 --- a/backend/src/services/vectorDocumentProcessor.ts +++ /dev/null @@ -1,500 +0,0 @@ -import { vectorDatabaseService } from './vectorDatabaseService'; -import { llmService } from './llmService'; -import { logger } from '../utils/logger'; -import { DocumentChunk } from '../models/VectorDatabaseModel'; - -export interface ChunkingOptions { - chunkSize: number; - chunkOverlap: number; - maxChunks: number; -} - -export interface VectorProcessingResult { - totalChunks: number; - chunksWithEmbeddings: number; - processingTime: number; - averageChunkSize: number; -} - -export interface TextBlock { - type: 'paragraph' | 'table' | 'heading' | 'list_item'; - content: string; -} - -export class VectorDocumentProcessor { - - /** - * Store enriched chunks with metadata from agenticRAGProcessor - */ - async storeDocumentChunks(enrichedChunks: Array<{ - content: string; - chunkIndex: number; - startPosition: number; - endPosition: number; - sectionType?: string; - metadata?: { - hasFinancialData: boolean; - hasMetrics: boolean; - keyTerms: string[]; - importance: 'high' | 'medium' | 'low'; - conceptDensity: number; - }; - }>, options?: { - documentId: string; - indexingStrategy?: string; - similarity_threshold?: number; - enable_hybrid_search?: boolean; - }): Promise { - const startTime = Date.now(); - - try { - const documentChunks: DocumentChunk[] = []; - - for (const chunk of enrichedChunks) { - // Generate embedding for the chunk - const embedding = await vectorDatabaseService.generateEmbeddings(chunk.content); - - // Create DocumentChunk with enhanced metadata - const documentChunk: DocumentChunk = { - id: `${options?.documentId}-chunk-${chunk.chunkIndex}`, - documentId: options?.documentId || '', - content: chunk.content, - embedding, - chunkIndex: chunk.chunkIndex, - metadata: { - ...chunk.metadata, - sectionType: chunk.sectionType, - chunkSize: chunk.content.length, - processingStrategy: options?.indexingStrategy || 'hierarchical', - startPosition: chunk.startPosition, - endPosition: chunk.endPosition - }, - createdAt: new Date(), - updatedAt: new Date() - }; - - documentChunks.push(documentChunk); - } - - // Store all chunks in vector database - await vectorDatabaseService.storeDocumentChunks(documentChunks); - - const processingTime = Date.now() - startTime; - const averageImportance = this.calculateAverageImportance(enrichedChunks); - - logger.info(`Stored ${documentChunks.length} enriched chunks`, { - documentId: options?.documentId, - processingTime, - averageImportance, - indexingStrategy: options?.indexingStrategy - }); - - } catch (error) { - logger.error('Failed to store enriched chunks', error); - throw error; - } - } - - /** - * Calculate average importance score for logging - */ - private calculateAverageImportance(chunks: Array<{ metadata?: { importance: string } }>): string { - const importanceScores = chunks - .map(c => c.metadata?.importance) - .filter(Boolean); - - if (importanceScores.length === 0) return 'unknown'; - - const highCount = importanceScores.filter(i => i === 'high').length; - const mediumCount = importanceScores.filter(i => i === 'medium').length; - - if (highCount > importanceScores.length / 2) return 'high'; - if (mediumCount + highCount > importanceScores.length / 2) return 'medium'; - return 'low'; - } - - /** - * Identifies structured blocks of text from a raw string using heuristics. - * This is the core of the improved ingestion pipeline. - * @param text The raw text from a PDF extraction. - */ - private identifyTextBlocks(text: string): TextBlock[] { - const blocks: TextBlock[] = []; - // Normalize line endings and remove excessive blank lines to regularize input - const lines = text.replace(/\n/g, '\n').split('\n'); - - let currentParagraph = ''; - - for (let i = 0; i < lines.length; i++) { - const line = lines[i]; - if (line === undefined) continue; - const trimmedLine = line.trim(); - - // If we encounter a blank line, the current paragraph (if any) has ended. - if (trimmedLine === '') { - if (currentParagraph.trim()) { - blocks.push({ type: 'paragraph', content: currentParagraph.trim() }); - currentParagraph = ''; - } - continue; - } - - // Heuristic for tables: A line with at least 2 instances of multiple spaces is likely a table row. - // This is a strong indicator of columnar data in plain text. - const isTableLike = /(\s{2,}.*){2,}/.test(line); - - if (isTableLike) { - if (currentParagraph.trim()) { - blocks.push({ type: 'paragraph', content: currentParagraph.trim() }); - currentParagraph = ''; - } - // Greedily consume subsequent lines that also look like part of the table. - let tableContent = line; - while (i + 1 < lines.length && /(\s{2,}.*){2,}/.test(lines[i + 1] || '')) { - i++; - tableContent += '\n' + lines[i]; - } - blocks.push({ type: 'table', content: tableContent }); - continue; - } - - // Heuristic for headings: A short line (under 80 chars) that doesn't end with a period. - // Often in Title Case, but we won't strictly enforce that to be more flexible. - const isHeadingLike = trimmedLine.length < 80 && !trimmedLine.endsWith('.'); - if (i + 1 < lines.length && (lines[i+1] || '').trim() === '' && isHeadingLike) { - if (currentParagraph.trim()) { - blocks.push({ type: 'paragraph', content: currentParagraph.trim() }); - currentParagraph = ''; - } - blocks.push({ type: 'heading', content: trimmedLine }); - i++; // Skip the blank line after the heading - continue; - } - - // Heuristic for list items - if (trimmedLine.match(/^(\*|-\d+\.)\s/)) { - if (currentParagraph.trim()) { - blocks.push({ type: 'paragraph', content: currentParagraph.trim() }); - currentParagraph = ''; - } - blocks.push({ type: 'list_item', content: trimmedLine }); - continue; - } - - // Otherwise, append the line to the current paragraph. - currentParagraph += (currentParagraph ? ' ' : '') + trimmedLine; - } - - // Add the last remaining paragraph if it exists. - if (currentParagraph.trim()) { - blocks.push({ type: 'paragraph', content: currentParagraph.trim() }); - } - - logger.info(`Identified ${blocks.length} semantic blocks from text.`); - return blocks; - } - - /** - * Generates a text summary for a table to be used for embedding. - * @param tableText The raw text of the table. - */ - private async getSummaryForTable(tableText: string): Promise { - const prompt = `The following text is an OCR'd table from a financial document. It may be messy.\n Summarize the key information in this table in a few clear, narrative sentences.\n Focus on the main metrics, trends, and time periods.\n Do not return a markdown table. Return only a natural language summary.\n\n Table Text:\n ---\n ${tableText}\n ---\n Summary:`; - - try { - const result = await llmService.processCIMDocument(prompt, '', { agentName: 'table_summarizer' }); - // Handle both string and object responses from the LLM - if (result.success) { - if (typeof result.jsonOutput === 'string') { - return result.jsonOutput; - } - if (typeof result.jsonOutput === 'object' && (result.jsonOutput as any)?.summary) { - return (result.jsonOutput as any).summary; - } - } - logger.warn('Table summarization failed or returned invalid format, falling back to raw text.', { tableText }); - return tableText; // Fallback - } catch (error) { - logger.error('Error during table summarization', { error }); - return tableText; // Fallback - } - } - - /** - * Process document text into chunks and generate embeddings using the new heuristic-based strategy. - */ - async processDocumentForVectorSearch( - documentId: string, - text: string, - metadata: Record = {} - ): Promise { - const startTime = Date.now(); - - try { - logger.info(`Starting HEURISTIC vector processing for document: ${documentId}`); - - // Step 1: Identify semantic blocks from the document text - const blocks = this.identifyTextBlocks(text); - - // Step 2: Generate embeddings for each block, with differential processing - const chunksWithEmbeddings = await this.generateEmbeddingsForBlocks( - documentId, - blocks, - metadata - ); - - // Step 3: Store chunks in vector database - await vectorDatabaseService.storeDocumentChunks(chunksWithEmbeddings); - - const processingTime = Date.now() - startTime; - const averageChunkSize = chunksWithEmbeddings.length > 0 ? chunksWithEmbeddings.reduce((sum: number, chunk: any) => sum + chunk.content.length, 0) / chunksWithEmbeddings.length : 0; - - logger.info(`Heuristic vector processing completed for document: ${documentId}`, { - totalChunks: blocks.length, - chunksWithEmbeddings: chunksWithEmbeddings.length, - processingTime, - averageChunkSize: Math.round(averageChunkSize) - }); - - return { - totalChunks: blocks.length, - chunksWithEmbeddings: chunksWithEmbeddings.length, - processingTime, - averageChunkSize: Math.round(averageChunkSize) - }; - } catch (error) { - logger.error(`Heuristic vector processing failed for document: ${documentId}`, error); - throw error; - } - } - - /** - * Generates embeddings for the identified text blocks, applying special logic for tables. - */ - private async generateEmbeddingsForBlocks( - documentId: string, - blocks: TextBlock[], - metadata: Record - ): Promise { - const chunksWithEmbeddings: DocumentChunk[] = []; - - for (let i = 0; i < blocks.length; i++) { - const block = blocks[i]; - if (!block || !block.content) continue; - - let contentToEmbed = block.content; - const blockMetadata: any = { - ...metadata, - block_type: block.type, - chunkIndex: i, - totalChunks: blocks.length, - chunkSize: block.content.length, - }; - - try { - // Differential processing for tables - if (block.type === 'table') { - logger.info(`Summarizing table chunk ${i}...`); - contentToEmbed = await this.getSummaryForTable(block.content); - // Store the original table text in the metadata for later retrieval - blockMetadata.original_table = block.content; - } - - const embedding = await vectorDatabaseService.generateEmbeddings(contentToEmbed); - - const documentChunk: DocumentChunk = { - id: `${documentId}-chunk-${i}`, - documentId, - content: contentToEmbed, // This is the summary for tables, or the raw text for others - metadata: blockMetadata, - embedding, - chunkIndex: i, - createdAt: new Date(), - updatedAt: new Date() - }; - - chunksWithEmbeddings.push(documentChunk); - - if (blocks.length > 10 && (i + 1) % 10 === 0) { - logger.info(`Generated embeddings for ${i + 1}/${blocks.length} blocks`); - } - } catch (error) { - logger.error(`Failed to generate embedding for block ${i}`, { error, blockType: block.type }); - // Continue with other chunks, do not halt the entire process - } - } - - return chunksWithEmbeddings; - } - - /** - * Enhanced search with intelligent filtering and ranking - */ - async searchRelevantContent( - query: string, - options: { - documentId?: string; - limit?: number; - similarityThreshold?: number; - filters?: Record; - prioritizeFinancial?: boolean; - boostImportance?: boolean; - enableReranking?: boolean; - } = {} - ) { - try { - // Enhanced search parameters - const searchOptions = { - ...options, - limit: Math.min(options.limit || 5, 20), // Cap at 20 for performance - similarityThreshold: options.similarityThreshold || 0.7, // Higher threshold for quality - }; - - // Add metadata filters for better relevance - if (options.prioritizeFinancial) { - searchOptions.filters = { - ...searchOptions.filters, - 'metadata.hasFinancialData': true - }; - } - - const rawResults = await vectorDatabaseService.search(query, searchOptions); - - // Post-process results for enhanced ranking - const enhancedResults = this.rankSearchResults(rawResults, query, options); - - // Apply reranking if enabled - let finalResults = enhancedResults; - if (options.enableReranking !== false) { - finalResults = await this.rerankResults(query, enhancedResults, options.limit || 5); - } - - logger.info(`Enhanced vector search completed`, { - query: query.substring(0, 100) + (query.length > 100 ? '...' : ''), - rawResultsCount: rawResults.length, - enhancedResultsCount: enhancedResults.length, - finalResultsCount: finalResults.length, - documentId: options.documentId, - prioritizeFinancial: options.prioritizeFinancial, - enableReranking: options.enableReranking !== false, - avgRelevanceScore: finalResults.length > 0 ? - Math.round((finalResults.reduce((sum: number, r: any) => sum + (r.similarity || 0), 0) / finalResults.length) * 100) / 100 : 0 - }); - - return finalResults; - } catch (error) { - logger.error('Enhanced vector search failed', { query, options, error }); - throw error; - } - } - - /** - * Rank search results based on multiple criteria - */ - private rankSearchResults(results: any[], query: string, options: any): any[] { - return results - .map(result => ({ - ...result, - enhancedScore: this.calculateEnhancedScore(result, query, options) - })) - .sort((a, b) => b.enhancedScore - a.enhancedScore) - .slice(0, options.limit || 5); - } - - /** - * Calculate enhanced relevance score - */ - private calculateEnhancedScore(result: any, query: string, options: any): number { - let score = result.similarity || 0; - - // Boost based on importance - if (options.boostImportance && result.metadata?.importance) { - if (result.metadata.importance === 'high') score += 0.2; - else if (result.metadata.importance === 'medium') score += 0.1; - } - - // Boost based on concept density - if (result.metadata?.conceptDensity) { - score += result.metadata.conceptDensity * 0.1; - } - - // Boost financial content if query suggests financial context - if (/financial|revenue|profit|ebitda|margin|cost|cash|debt/i.test(query)) { - if (result.metadata?.hasFinancialData) score += 0.15; - if (result.metadata?.hasMetrics) score += 0.1; - } - - // Boost based on section type relevance - if (result.metadata?.sectionType) { - const sectionBoosts: Record = { - 'executive_summary': 0.1, - 'financial': 0.15, - 'market_analysis': 0.1, - 'management': 0.05 - }; - score += sectionBoosts[result.metadata.sectionType] || 0; - } - - // Boost if query terms appear in key terms - if (result.metadata?.keyTerms) { - const queryWords = query.toLowerCase().split(/\s+/); - const keyTermMatches = result.metadata.keyTerms.filter((term: string) => - queryWords.some(word => term.toLowerCase().includes(word)) - ).length; - score += keyTermMatches * 0.05; - } - - return Math.min(score, 1.0); // Cap at 1.0 - } - - /** - * Rerank results using cross-encoder approach - */ - private async rerankResults(query: string, candidates: any[], topK: number = 5): Promise { - try { - // Create reranking prompt - const rerankingPrompt = `Given the query: "${query}" - -Please rank the following document chunks by relevance (1 = most relevant, ${candidates.length} = least relevant). Consider: -- Semantic similarity to the query -- Financial/business relevance -- Information completeness -- Factual accuracy - -Document chunks: -${candidates.map((c, i) => `${i + 1}. ${c.content.substring(0, 200)}...`).join('\n')} - -Return only a JSON array of indices in order of relevance: [1, 3, 2, ...]`; - - const result = await llmService.processCIMDocument(rerankingPrompt, '', { - agentName: 'reranker', - maxTokens: 1000 - }); - - if (result.success && typeof result.jsonOutput === 'object') { - const ranking = Array.isArray(result.jsonOutput) ? result.jsonOutput as number[] : null; - if (ranking) { - // Apply the ranking - const reranked = ranking - .map(index => candidates[index - 1]) // Convert 1-based to 0-based - .filter(Boolean) // Remove any undefined entries - .slice(0, topK); - - logger.info(`Reranked ${candidates.length} candidates to ${reranked.length} results`); - return reranked; - } - } - - // Fallback to original ranking if reranking fails - logger.warn('Reranking failed, using original ranking'); - return candidates.slice(0, topK); - } catch (error) { - logger.error('Reranking failed', error); - return candidates.slice(0, topK); - } - } - - // ... other methods like findSimilarDocuments, etc. remain unchanged ... -} - -export const vectorDocumentProcessor = new VectorDocumentProcessor(); \ No newline at end of file diff --git a/backend/src/test/__tests__/uploadPipeline.integration.test.ts b/backend/src/test/__tests__/uploadPipeline.integration.test.ts index 203d1f5..5da3e98 100644 --- a/backend/src/test/__tests__/uploadPipeline.integration.test.ts +++ b/backend/src/test/__tests__/uploadPipeline.integration.test.ts @@ -4,7 +4,6 @@ import { fileStorageService } from '../../services/fileStorageService'; import { documentController } from '../../controllers/documentController'; import { unifiedDocumentProcessor } from '../../services/unifiedDocumentProcessor'; import { uploadMonitoringService } from '../../services/uploadMonitoringService'; -import { handleFileUpload } from '../../middleware/upload'; import { verifyFirebaseToken } from '../../middleware/firebaseAuth'; import { addCorrelationId } from '../../middleware/validation'; @@ -13,10 +12,11 @@ jest.mock('../../services/fileStorageService'); jest.mock('../../services/unifiedDocumentProcessor'); jest.mock('../../services/uploadMonitoringService'); jest.mock('../../middleware/firebaseAuth'); -jest.mock('../../middleware/upload'); // Mock Firebase Admin jest.mock('firebase-admin', () => ({ + apps: [], + initializeApp: jest.fn(), auth: () => ({ verifyIdToken: jest.fn().mockResolvedValue({ uid: 'test-user-id', @@ -36,18 +36,9 @@ jest.mock('../../models/DocumentModel', () => ({ }, })); -describe('Upload Pipeline Integration Tests', () => { +describe('Firebase Storage Direct Upload Pipeline Tests', () => { let app: express.Application; - const mockFile = { - originalname: 'test-document.pdf', - filename: '1234567890-abc123.pdf', - path: '/tmp/1234567890-abc123.pdf', - size: 1024, - mimetype: 'application/pdf', - buffer: Buffer.from('test file content'), - }; - const mockUser = { uid: 'test-user-id', email: 'test@example.com', @@ -62,24 +53,40 @@ describe('Upload Pipeline Integration Tests', () => { next(); }); - (handleFileUpload as jest.Mock).mockImplementation((req: any, res: any, next: any) => { - req.file = mockFile; - next(); + // Mock file storage service for new upload flow + (fileStorageService.generateSignedUploadUrl as jest.Mock).mockResolvedValue( + 'https://storage.googleapis.com/test-bucket/uploads/test-user-id/1234567890-test-document.pdf?signature=...' + ); + + (fileStorageService.getFile as jest.Mock).mockResolvedValue(Buffer.from('test file content')); + + // Mock document model + const { DocumentModel } = require('../../models/DocumentModel'); + DocumentModel.create.mockResolvedValue({ + id: '123e4567-e89b-12d3-a456-426614174000', + user_id: mockUser.uid, + original_file_name: 'test-document.pdf', + file_path: 'uploads/test-user-id/1234567890-test-document.pdf', + file_size: 1024, + status: 'uploading', + created_at: new Date(), + updated_at: new Date() }); - (fileStorageService.storeFile as jest.Mock).mockResolvedValue({ - success: true, - fileInfo: { - originalName: 'test-document.pdf', - filename: '1234567890-abc123.pdf', - path: 'uploads/test-user-id/1234567890-abc123.pdf', - size: 1024, - mimetype: 'application/pdf', - uploadedAt: new Date(), - gcsPath: 'uploads/test-user-id/1234567890-abc123.pdf', - }, + DocumentModel.findById.mockResolvedValue({ + id: '123e4567-e89b-12d3-a456-426614174000', + user_id: mockUser.uid, + original_file_name: 'test-document.pdf', + file_path: 'uploads/test-user-id/1234567890-test-document.pdf', + file_size: 1024, + status: 'uploading', + created_at: new Date(), + updated_at: new Date() }); + DocumentModel.updateById.mockResolvedValue(true); + + // Mock unified document processor (unifiedDocumentProcessor.processDocument as jest.Mock).mockResolvedValue({ success: true, documentId: '123e4567-e89b-12d3-a456-426614174000', @@ -93,166 +100,115 @@ describe('Upload Pipeline Integration Tests', () => { app.use(express.json()); app.use(verifyFirebaseToken); app.use(addCorrelationId); - app.post('/upload', handleFileUpload, documentController.uploadDocument); + + // Add routes for testing + app.post('/upload-url', documentController.getUploadUrl); + app.post('/:id/confirm-upload', documentController.confirmUpload); }); - describe('Complete Upload Pipeline', () => { - it('should successfully process a complete file upload', async () => { + describe('Upload URL Generation', () => { + it('should successfully get upload URL', async () => { const response = await request(app) - .post('/upload') - .attach('file', Buffer.from('test content'), 'test-document.pdf') + .post('/upload-url') + .send({ + fileName: 'test-document.pdf', + fileSize: 1024, + contentType: 'application/pdf' + }) + .expect(200); + + expect(response.body.documentId).toBeDefined(); + expect(response.body.uploadUrl).toBeDefined(); + expect(response.body.filePath).toBeDefined(); + }); + + it('should reject non-PDF files', async () => { + const response = await request(app) + .post('/upload-url') + .send({ + fileName: 'test-document.txt', + fileSize: 1024, + contentType: 'text/plain' + }) + .expect(400); + + expect(response.body.error).toBe('Only PDF files are supported'); + }); + + it('should reject files larger than 50MB', async () => { + const response = await request(app) + .post('/upload-url') + .send({ + fileName: 'large-document.pdf', + fileSize: 60 * 1024 * 1024, // 60MB + contentType: 'application/pdf' + }) + .expect(400); + + expect(response.body.error).toBe('File size exceeds 50MB limit'); + }); + + it('should handle missing required fields', async () => { + const response = await request(app) + .post('/upload-url') + .send({ + fileName: 'test-document.pdf' + // Missing fileSize and contentType + }) + .expect(400); + + expect(response.body.error).toBe('Missing required fields: fileName, fileSize, contentType'); + }); + }); + + describe('Upload Confirmation', () => { + it('should successfully confirm upload and trigger processing', async () => { + // First create a document record + const { DocumentModel } = require('../../models/DocumentModel'); + const document = await DocumentModel.create({ + user_id: mockUser.uid, + original_file_name: 'test-document.pdf', + file_path: 'uploads/test-user-id/1234567890-test-document.pdf', + file_size: 1024, + status: 'uploading' + }); + + const response = await request(app) + .post(`/${document.id}/confirm-upload`) .expect(200); expect(response.body.success).toBe(true); - expect(response.body.documentId).toBeDefined(); + expect(response.body.documentId).toBe(document.id); expect(response.body.status).toBe('processing'); - - // Verify file storage was called - expect(fileStorageService.storeFile).toHaveBeenCalledWith(mockFile, mockUser.uid); - - // Verify document processing was called - expect(unifiedDocumentProcessor.processDocument).toHaveBeenCalledWith( - expect.objectContaining({ - userId: mockUser.uid, - fileInfo: expect.objectContaining({ - originalName: 'test-document.pdf', - gcsPath: 'uploads/test-user-id/1234567890-abc123.pdf', - }), - }) - ); - - // Verify monitoring was called - expect(uploadMonitoringService.trackUploadEvent).toHaveBeenCalled(); }); - it('should handle file storage failures gracefully', async () => { - (fileStorageService.storeFile as jest.Mock).mockResolvedValue({ - success: false, - error: 'GCS upload failed', - }); - + it('should handle confirm upload for non-existent document', async () => { + const fakeId = '12345678-1234-1234-1234-123456789012'; + const response = await request(app) - .post('/upload') - .attach('file', Buffer.from('test content'), 'test-document.pdf') - .expect(500); + .post(`/${fakeId}/confirm-upload`) + .expect(404); - expect(response.body.error).toContain('Failed to store file'); - expect(unifiedDocumentProcessor.processDocument).not.toHaveBeenCalled(); - }); - - it('should handle document processing failures gracefully', async () => { - (unifiedDocumentProcessor.processDocument as jest.Mock).mockResolvedValue({ - success: false, - error: 'Processing failed', - }); - - const response = await request(app) - .post('/upload') - .attach('file', Buffer.from('test content'), 'test-document.pdf') - .expect(500); - - expect(response.body.error).toContain('Failed to process document'); - }); - - it('should handle large file uploads', async () => { - const largeFile = { - ...mockFile, - size: 50 * 1024 * 1024, // 50MB - }; - - (handleFileUpload as jest.Mock).mockImplementation((req: any, res: any, next: any) => { - req.file = largeFile; - next(); - }); - - await request(app) - .post('/upload') - .attach('file', Buffer.alloc(50 * 1024 * 1024), 'large-document.pdf') - .expect(200); - - expect(fileStorageService.storeFile).toHaveBeenCalledWith(largeFile, mockUser.uid); - }); - - it('should handle unsupported file types', async () => { - const unsupportedFile = { - ...mockFile, - mimetype: 'application/exe', - originalname: 'malicious.exe', - }; - - (handleFileUpload as jest.Mock).mockImplementation((req: any, res: any, next: any) => { - req.file = unsupportedFile; - next(); - }); - - const response = await request(app) - .post('/upload') - .attach('file', Buffer.from('test content'), 'malicious.exe') - .expect(400); - - expect(response.body.error).toContain('Unsupported file type'); - }); - - it('should track upload progress correctly', async () => { - const response = await request(app) - .post('/upload') - .attach('file', Buffer.from('test content'), 'test-document.pdf') - .expect(200); - - // Verify monitoring events were tracked - expect(uploadMonitoringService.trackUploadEvent).toHaveBeenCalledWith( - expect.objectContaining({ - userId: mockUser.uid, - fileInfo: expect.objectContaining({ - originalName: 'test-document.pdf', - size: 1024, - }), - status: 'success', - stage: 'file_storage', - }) - ); + expect(response.body.error).toBe('Document not found'); }); }); - describe('Error Scenarios and Recovery', () => { - it('should handle GCS connection failures', async () => { - (fileStorageService.storeFile as jest.Mock).mockRejectedValue( + describe('Error Handling', () => { + it('should handle GCS connection failures during URL generation', async () => { + (fileStorageService.generateSignedUploadUrl as jest.Mock).mockRejectedValue( new Error('GCS connection timeout') ); const response = await request(app) - .post('/upload') - .attach('file', Buffer.from('test content'), 'test-document.pdf') + .post('/upload-url') + .send({ + fileName: 'test-document.pdf', + fileSize: 1024, + contentType: 'application/pdf' + }) .expect(500); - expect(response.body.error).toContain('Internal server error'); - }); - - it('should handle partial upload failures', async () => { - // Mock storage success but processing failure - (fileStorageService.storeFile as jest.Mock).mockResolvedValue({ - success: true, - fileInfo: { - originalName: 'test-document.pdf', - filename: '1234567890-abc123.pdf', - path: 'uploads/test-user-id/1234567890-abc123.pdf', - size: 1024, - mimetype: 'application/pdf', - uploadedAt: new Date(), - gcsPath: 'uploads/test-user-id/1234567890-abc123.pdf', - }, - }); - - (unifiedDocumentProcessor.processDocument as jest.Mock).mockRejectedValue( - new Error('Processing service unavailable') - ); - - const response = await request(app) - .post('/upload') - .attach('file', Buffer.from('test content'), 'test-document.pdf') - .expect(500); - - expect(response.body.error).toContain('Failed to process document'); + expect(response.body.error).toBe('Failed to generate upload URL'); }); it('should handle authentication failures', async () => { @@ -261,106 +217,44 @@ describe('Upload Pipeline Integration Tests', () => { }); const response = await request(app) - .post('/upload') - .attach('file', Buffer.from('test content'), 'test-document.pdf') + .post('/upload-url') + .send({ + fileName: 'test-document.pdf', + fileSize: 1024, + contentType: 'application/pdf' + }) .expect(401); expect(response.body.error).toBe('Invalid token'); }); - - it('should handle missing file uploads', async () => { - (handleFileUpload as jest.Mock).mockImplementation((req: any, res: any, next: any) => { - req.file = undefined; - next(); - }); - - const response = await request(app) - .post('/upload') - .expect(400); - - expect(response.body.error).toContain('No file uploaded'); - }); }); describe('Performance and Scalability', () => { - it('should handle concurrent uploads', async () => { + it('should handle concurrent upload URL requests', async () => { const concurrentRequests = 5; - const promises = []; + const promises: any[] = []; for (let i = 0; i < concurrentRequests; i++) { promises.push( request(app) - .post('/upload') - .attach('file', Buffer.from(`test content ${i}`), `test-document-${i}.pdf`) + .post('/upload-url') + .send({ + fileName: `test-document-${i}.pdf`, + fileSize: 1024, + contentType: 'application/pdf' + }) ); } const responses = await Promise.all(promises); - responses.forEach(response => { + responses.forEach((response: any) => { expect(response.status).toBe(200); - expect(response.body.success).toBe(true); + expect(response.body.documentId).toBeDefined(); + expect(response.body.uploadUrl).toBeDefined(); }); - expect(fileStorageService.storeFile).toHaveBeenCalledTimes(concurrentRequests); - }); - - it('should handle upload timeout scenarios', async () => { - (fileStorageService.storeFile as jest.Mock).mockImplementation( - () => new Promise(resolve => setTimeout(() => resolve({ - success: true, - fileInfo: { - originalName: 'test-document.pdf', - filename: '1234567890-abc123.pdf', - path: 'uploads/test-user-id/1234567890-abc123.pdf', - size: 1024, - mimetype: 'application/pdf', - uploadedAt: new Date(), - gcsPath: 'uploads/test-user-id/1234567890-abc123.pdf', - }, - }), 30000)) // 30 second delay - ); - - await request(app) - .post('/upload') - .attach('file', Buffer.from('test content'), 'test-document.pdf') - .timeout(35000) // 35 second timeout - .expect(200); - }); - }); - - describe('Data Integrity and Validation', () => { - it('should validate file metadata correctly', async () => { - const response = await request(app) - .post('/upload') - .attach('file', Buffer.from('test content'), 'test-document.pdf') - .expect(200); - - // Verify file metadata is preserved - expect(fileStorageService.storeFile).toHaveBeenCalledWith( - expect.objectContaining({ - originalname: 'test-document.pdf', - size: 1024, - mimetype: 'application/pdf', - }), - mockUser.uid - ); - }); - - it('should generate unique file paths for each upload', async () => { - const uploads = []; - for (let i = 0; i < 3; i++) { - uploads.push( - request(app) - .post('/upload') - .attach('file', Buffer.from(`test content ${i}`), `test-document-${i}.pdf`) - ); - } - - const responses = await Promise.all(uploads); - - // Verify each upload was called - expect(fileStorageService.storeFile).toHaveBeenCalledTimes(3); + expect(fileStorageService.generateSignedUploadUrl).toHaveBeenCalledTimes(concurrentRequests); }); }); }); \ No newline at end of file diff --git a/backend/src/types/express.d.ts b/backend/src/types/express.d.ts index 95bf120..688bf08 100644 --- a/backend/src/types/express.d.ts +++ b/backend/src/types/express.d.ts @@ -3,8 +3,6 @@ import { Request } from 'express'; declare global { namespace Express { interface Request { - file?: Express.Multer.File; - files?: Express.Multer.File[]; correlationId?: string; } } diff --git a/currrent_output.json b/currrent_output.json index f7736e9..b69f1be 100644 --- a/currrent_output.json +++ b/currrent_output.json @@ -294,7 +294,7 @@ "processedAt": "2025-08-01T01:36:18.949+00:00", "uploadedBy": "UthFrGPrQLY6bzNL46aIOHck4yi1", "fileSize": 5768711, - "summary": "# CIM Analysis: 2025-04-23 Stax Holding Company, LLC Confidential Information Presentation for Stax Holding Company, LLC - April 2025.pdf\n\n## Executive Summary\nSample analysis generated by Document AI + Genkit integration.\n\n## Key Findings\n- Document processed successfully\n- AI analysis completed\n- Integration working as expected\n\n---\n*Generated by Document AI + Genkit integration*", + "summary": "# CIM Analysis: 2025-04-23 Stax Holding Company, LLC Confidential Information Presentation for Stax Holding Company, LLC - April 2025.pdf\n\n## Executive Summary\nSample analysis generated by Document AI + Agentic RAG integration.\n\n## Key Findings\n- Document processed successfully\n- AI analysis completed\n- Integration working as expected\n\n---\n*Generated by Document AI + Agentic RAG integration*", "error": null }, { diff --git a/frontend/src/services/documentService.ts b/frontend/src/services/documentService.ts index ef9de35..67a204c 100644 --- a/frontend/src/services/documentService.ts +++ b/frontend/src/services/documentService.ts @@ -390,91 +390,7 @@ class DocumentService { }); } - /** - * Legacy multipart upload method (kept for compatibility) - */ - async uploadDocumentLegacy( - file: File, - onProgress?: (progress: number) => void, - signal?: AbortSignal - ): Promise { - try { - // Check authentication before upload - const token = await authService.getToken(); - if (!token) { - throw new Error('Authentication required. Please log in to upload documents.'); - } - console.log('📤 Starting legacy multipart upload...'); - console.log('📤 File:', file.name, 'Size:', file.size, 'Type:', file.type); - console.log('📤 Token available:', !!token); - - const formData = new FormData(); - formData.append('document', file); - - // Always use optimized agentic RAG processing - no strategy selection needed - formData.append('processingStrategy', 'optimized_agentic_rag'); - - const response = await apiClient.post('/documents/upload', formData, { - headers: { - 'Content-Type': 'multipart/form-data', - }, - signal, // Add abort signal support - onUploadProgress: (progressEvent) => { - if (onProgress && progressEvent.total) { - const progress = Math.round((progressEvent.loaded * 100) / progressEvent.total); - onProgress(progress); - } - }, - }); - - console.log('✅ Legacy document upload successful:', response.data); - return response.data; - } catch (error: any) { - console.error('❌ Legacy document upload failed:', error); - - // Provide more specific error messages - if (error.response?.status === 401) { - if (error.response?.data?.error === 'No valid authorization header') { - throw new Error('Authentication required. Please log in to upload documents.'); - } else if (error.response?.data?.error === 'Token expired') { - throw new Error('Your session has expired. Please log in again.'); - } else if (error.response?.data?.error === 'Invalid token') { - throw new Error('Authentication failed. Please log in again.'); - } else { - throw new Error('Authentication error. Please log in again.'); - } - } else if (error.response?.status === 400) { - if (error.response?.data?.error === 'No file uploaded') { - throw new Error('No file was selected for upload.'); - } else if (error.response?.data?.error === 'File too large') { - throw new Error('File is too large. Please select a smaller file.'); - } else if (error.response?.data?.error === 'File type not allowed') { - throw new Error('File type not supported. Please upload a PDF or text file.'); - } else { - throw new Error(`Upload failed: ${error.response?.data?.error || 'Bad request'}`); - } - } else if (error.response?.status === 413) { - throw new Error('File is too large. Please select a smaller file.'); - } else if (error.response?.status >= 500) { - throw new Error('Server error. Please try again later.'); - } else if (error.code === 'ERR_NETWORK') { - throw new Error('Network error. Please check your connection and try again.'); - } else if (error.name === 'AbortError') { - throw new Error('Upload was cancelled.'); - } - - // Handle GCS-specific errors - if (error.response?.data?.type === 'storage_error' || - error.message?.includes('GCS') || - error.message?.includes('storage.googleapis.com')) { - throw GCSErrorHandler.createGCSError(error, 'upload'); - } - - // Generic error fallback - throw new Error(error.response?.data?.error || error.message || 'Upload failed'); - } - } /** * Get all documents for the current user