Pre Kiro
This commit is contained in:
@@ -2,304 +2,416 @@
|
||||
|
||||
## Overview
|
||||
|
||||
This design addresses the systematic cleanup of a document processing application that has accumulated technical debt during migration from local deployment to Firebase/GCloud infrastructure. The application currently suffers from configuration inconsistencies, redundant files, and document upload errors that need to be resolved through a structured cleanup and debugging approach.
|
||||
|
||||
### Current Architecture Analysis
|
||||
|
||||
The application consists of:
|
||||
- **Backend**: Node.js/TypeScript API deployed on Google Cloud Run
|
||||
- **Frontend**: React/TypeScript SPA deployed on Firebase Hosting
|
||||
- **Database**: Supabase (PostgreSQL) for document metadata
|
||||
- **Storage**: Currently using local file storage (MUST migrate to GCS)
|
||||
- **Processing**: Document AI + Agentic RAG pipeline
|
||||
- **Authentication**: Firebase Auth
|
||||
|
||||
### Key Issues Identified
|
||||
|
||||
1. **Configuration Drift**: Multiple environment files with conflicting settings
|
||||
2. **Local Dependencies**: Still using local file storage and local PostgreSQL references (MUST use only Supabase)
|
||||
3. **Upload Errors**: Invalid UUID errors in document retrieval
|
||||
4. **Deployment Complexity**: Mixed local/cloud deployment artifacts
|
||||
5. **Error Handling**: Insufficient error logging and debugging capabilities
|
||||
6. **Architecture Inconsistency**: Local storage and database incompatible with cloud deployment
|
||||
This design addresses the systematic cleanup and stabilization of the CIM Document Processor backend, with a focus on fixing the processing pipeline and integrating Firebase Storage as the primary file storage solution. The design identifies critical issues in the current codebase and provides a comprehensive solution to ensure reliable document processing from upload through final PDF generation.
|
||||
|
||||
## Architecture
|
||||
|
||||
### Current Issues Identified
|
||||
|
||||
Based on error analysis and code review, the following critical issues have been identified:
|
||||
|
||||
1. **Database Query Issues**: UUID validation errors when non-UUID strings are passed to document queries
|
||||
2. **Service Dependencies**: Circular dependencies and missing service imports
|
||||
3. **Firebase Storage Integration**: Incomplete migration from Google Cloud Storage to Firebase Storage
|
||||
4. **Error Handling**: Insufficient error handling and logging throughout the pipeline
|
||||
5. **Configuration Management**: Environment variable validation issues in serverless environments
|
||||
6. **Processing Pipeline**: Broken service orchestration in the document processing flow
|
||||
|
||||
### Target Architecture
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Frontend (Firebase Hosting)"
|
||||
A[React App] --> B[Document Upload Component]
|
||||
B --> C[Auth Context]
|
||||
end
|
||||
|
||||
subgraph "Backend (Cloud Run)"
|
||||
D[Express API] --> E[Document Controller]
|
||||
E --> F[Upload Middleware]
|
||||
F --> G[File Storage Service]
|
||||
G --> H[GCS Bucket]
|
||||
E --> I[Document Model]
|
||||
I --> J[Supabase DB]
|
||||
end
|
||||
|
||||
subgraph "Processing Pipeline"
|
||||
K[Job Queue] --> L[Document AI]
|
||||
L --> M[Agentic RAG]
|
||||
M --> N[PDF Generation]
|
||||
end
|
||||
|
||||
A --> D
|
||||
E --> K
|
||||
|
||||
subgraph "Authentication"
|
||||
O[Firebase Auth] --> A
|
||||
O --> D
|
||||
end
|
||||
```
|
||||
|
||||
### Configuration Management Strategy
|
||||
|
||||
1. **Environment Separation**: Clear distinction between development, staging, and production
|
||||
2. **Service-Specific Configs**: Separate Firebase, GCloud, and Supabase configurations
|
||||
3. **Secret Management**: Proper handling of API keys and service account credentials
|
||||
4. **Deployment Consistency**: Single deployment strategy per environment
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ FRONTEND (React) │
|
||||
├─────────────────────────────────────────────────────────────────────────────┤
|
||||
│ Firebase Auth + Document Upload → Firebase Storage │
|
||||
└─────────────────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼ HTTPS API
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ BACKEND (Node.js) │
|
||||
├─────────────────────────────────────────────────────────────────────────────┤
|
||||
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
|
||||
│ │ API Routes │ │ Middleware │ │ Error Handler │ │
|
||||
│ │ - Documents │ │ - Auth │ │ - Global │ │
|
||||
│ │ - Monitoring │ │ - Validation │ │ - Correlation │ │
|
||||
│ │ - Vector │ │ - CORS │ │ - Logging │ │
|
||||
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
|
||||
│ │
|
||||
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
|
||||
│ │ Core Services │ │ Processing │ │ External APIs │ │
|
||||
│ │ - Document │ │ - Agentic RAG │ │ - Document AI │ │
|
||||
│ │ - Upload │ │ - LLM Service │ │ - Claude AI │ │
|
||||
│ │ - Session │ │ - PDF Gen │ │ - Firebase │ │
|
||||
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ STORAGE & DATABASE │
|
||||
├─────────────────────────────────────────────────────────────────────────────┤
|
||||
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
|
||||
│ │ Firebase │ │ Supabase │ │ Vector │ │
|
||||
│ │ Storage │ │ Database │ │ Database │ │
|
||||
│ │ - File Upload │ │ - Documents │ │ - Embeddings │ │
|
||||
│ │ - Security │ │ - Sessions │ │ - Chunks │ │
|
||||
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Components and Interfaces
|
||||
|
||||
### 1. Configuration Cleanup Service
|
||||
### 1. Enhanced Configuration Management
|
||||
|
||||
**Purpose**: Consolidate and standardize environment configurations
|
||||
**Purpose**: Robust environment variable validation and configuration management for both development and production environments.
|
||||
|
||||
**Key Features**:
|
||||
- Graceful handling of missing environment variables in serverless environments
|
||||
- Runtime configuration validation
|
||||
- Fallback values for non-critical settings
|
||||
- Clear error messages for missing critical configuration
|
||||
|
||||
**Interface**:
|
||||
```typescript
|
||||
interface ConfigurationService {
|
||||
validateEnvironment(): Promise<ValidationResult>;
|
||||
consolidateConfigs(): Promise<void>;
|
||||
removeRedundantFiles(): Promise<string[]>;
|
||||
updateDeploymentConfigs(): Promise<void>;
|
||||
interface Config {
|
||||
// Core settings
|
||||
env: string;
|
||||
port: number;
|
||||
|
||||
// Firebase configuration
|
||||
firebase: {
|
||||
projectId: string;
|
||||
storageBucket: string;
|
||||
apiKey: string;
|
||||
authDomain: string;
|
||||
};
|
||||
|
||||
// Database configuration
|
||||
supabase: {
|
||||
url: string;
|
||||
anonKey: string;
|
||||
serviceKey: string;
|
||||
};
|
||||
|
||||
// External services
|
||||
googleCloud: {
|
||||
projectId: string;
|
||||
documentAiLocation: string;
|
||||
documentAiProcessorId: string;
|
||||
applicationCredentials: string;
|
||||
};
|
||||
|
||||
// LLM configuration
|
||||
llm: {
|
||||
provider: 'anthropic' | 'openai';
|
||||
anthropicApiKey?: string;
|
||||
openaiApiKey?: string;
|
||||
model: string;
|
||||
maxTokens: number;
|
||||
temperature: number;
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
**Responsibilities**:
|
||||
- Remove duplicate/conflicting environment files
|
||||
- Standardize Firebase and GCloud configurations
|
||||
- Validate required environment variables
|
||||
- Update deployment scripts and configurations
|
||||
### 2. Firebase Storage Service
|
||||
|
||||
### 2. Storage Migration Service
|
||||
**Purpose**: Complete Firebase Storage integration replacing Google Cloud Storage for file operations.
|
||||
|
||||
**Purpose**: Complete migration from local storage to Google Cloud Storage (no local storage going forward)
|
||||
**Key Features**:
|
||||
- Secure file upload with Firebase Authentication
|
||||
- Proper file organization and naming conventions
|
||||
- File metadata management
|
||||
- Download URL generation
|
||||
- File cleanup and lifecycle management
|
||||
|
||||
**Interface**:
|
||||
```typescript
|
||||
interface StorageMigrationService {
|
||||
migrateExistingFiles(): Promise<MigrationResult>;
|
||||
replaceFileStorageService(): Promise<void>;
|
||||
validateGCSConfiguration(): Promise<boolean>;
|
||||
removeAllLocalStorageDependencies(): Promise<void>;
|
||||
updateDatabaseReferences(): Promise<void>;
|
||||
interface FirebaseStorageService {
|
||||
uploadFile(file: Buffer, fileName: string, userId: string): Promise<string>;
|
||||
getDownloadUrl(filePath: string): Promise<string>;
|
||||
deleteFile(filePath: string): Promise<void>;
|
||||
getFileMetadata(filePath: string): Promise<FileMetadata>;
|
||||
generateUploadUrl(fileName: string, userId: string): Promise<string>;
|
||||
}
|
||||
```
|
||||
|
||||
**Responsibilities**:
|
||||
- Migrate ALL existing uploaded files to GCS
|
||||
- Completely replace file storage service to use ONLY GCS
|
||||
- Update all file path references in database to GCS URLs
|
||||
- Remove ALL local storage code and dependencies
|
||||
- Ensure no fallback to local storage exists
|
||||
### 3. Enhanced Document Service
|
||||
|
||||
### 3. Upload Error Diagnostic Service
|
||||
**Purpose**: Centralized document management with proper error handling and validation.
|
||||
|
||||
**Purpose**: Identify and resolve document upload errors
|
||||
**Key Features**:
|
||||
- UUID validation for all document operations
|
||||
- Proper error handling for database operations
|
||||
- Document lifecycle management
|
||||
- Status tracking and updates
|
||||
- Metadata management
|
||||
|
||||
**Interface**:
|
||||
```typescript
|
||||
interface UploadDiagnosticService {
|
||||
analyzeUploadErrors(): Promise<ErrorAnalysis>;
|
||||
validateUploadPipeline(): Promise<ValidationResult>;
|
||||
fixRouteHandling(): Promise<void>;
|
||||
improveErrorLogging(): Promise<void>;
|
||||
interface DocumentService {
|
||||
createDocument(data: CreateDocumentData): Promise<Document>;
|
||||
getDocument(id: string): Promise<Document | null>;
|
||||
updateDocument(id: string, updates: Partial<Document>): Promise<Document>;
|
||||
deleteDocument(id: string): Promise<void>;
|
||||
listDocuments(userId: string, filters?: DocumentFilters): Promise<Document[]>;
|
||||
validateDocumentId(id: string): boolean;
|
||||
}
|
||||
```
|
||||
|
||||
**Responsibilities**:
|
||||
- Analyze current upload error patterns
|
||||
- Fix UUID validation issues in routes
|
||||
- Improve error handling and logging
|
||||
- Validate complete upload pipeline
|
||||
### 4. Improved Processing Pipeline
|
||||
|
||||
### 4. Deployment Standardization Service
|
||||
**Purpose**: Reliable document processing pipeline with proper error handling and recovery.
|
||||
|
||||
**Purpose**: Standardize deployment processes and remove legacy artifacts
|
||||
**Key Features**:
|
||||
- Step-by-step processing with checkpoints
|
||||
- Error recovery and retry mechanisms
|
||||
- Progress tracking and status updates
|
||||
- Partial result preservation
|
||||
- Processing timeout handling
|
||||
|
||||
**Interface**:
|
||||
```typescript
|
||||
interface DeploymentService {
|
||||
standardizeDeploymentScripts(): Promise<void>;
|
||||
removeLocalDeploymentArtifacts(): Promise<string[]>;
|
||||
validateCloudDeployment(): Promise<ValidationResult>;
|
||||
updateDocumentation(): Promise<void>;
|
||||
interface ProcessingPipeline {
|
||||
processDocument(documentId: string, options: ProcessingOptions): Promise<ProcessingResult>;
|
||||
getProcessingStatus(documentId: string): Promise<ProcessingStatus>;
|
||||
retryProcessing(documentId: string, fromStep?: string): Promise<ProcessingResult>;
|
||||
cancelProcessing(documentId: string): Promise<void>;
|
||||
}
|
||||
```
|
||||
|
||||
**Responsibilities**:
|
||||
- Remove local deployment scripts and configurations
|
||||
- Standardize Cloud Run and Firebase deployment
|
||||
- Update package.json scripts
|
||||
- Create deployment documentation
|
||||
### 5. Robust Error Handling System
|
||||
|
||||
**Purpose**: Comprehensive error handling with correlation tracking and proper logging.
|
||||
|
||||
**Key Features**:
|
||||
- Correlation ID generation for request tracking
|
||||
- Structured error logging
|
||||
- Error categorization and handling strategies
|
||||
- User-friendly error messages
|
||||
- Error recovery mechanisms
|
||||
|
||||
**Interface**:
|
||||
```typescript
|
||||
interface ErrorHandler {
|
||||
handleError(error: Error, context: ErrorContext): ErrorResponse;
|
||||
logError(error: Error, correlationId: string, context: any): void;
|
||||
createCorrelationId(): string;
|
||||
categorizeError(error: Error): ErrorCategory;
|
||||
}
|
||||
```
|
||||
|
||||
## Data Models
|
||||
|
||||
### Configuration Validation Model
|
||||
### Enhanced Document Model
|
||||
|
||||
```typescript
|
||||
interface ConfigValidation {
|
||||
environment: 'development' | 'staging' | 'production';
|
||||
requiredVars: string[];
|
||||
optionalVars: string[];
|
||||
conflicts: ConfigConflict[];
|
||||
missing: string[];
|
||||
status: 'valid' | 'invalid' | 'warning';
|
||||
interface Document {
|
||||
id: string; // UUID
|
||||
userId: string;
|
||||
originalFileName: string;
|
||||
filePath: string; // Firebase Storage path
|
||||
fileSize: number;
|
||||
mimeType: string;
|
||||
status: DocumentStatus;
|
||||
extractedText?: string;
|
||||
generatedSummary?: string;
|
||||
summaryPdfPath?: string;
|
||||
analysisData?: CIMReview;
|
||||
processingSteps: ProcessingStep[];
|
||||
errorLog?: ErrorEntry[];
|
||||
createdAt: Date;
|
||||
updatedAt: Date;
|
||||
}
|
||||
|
||||
interface ConfigConflict {
|
||||
variable: string;
|
||||
values: string[];
|
||||
files: string[];
|
||||
resolution: string;
|
||||
enum DocumentStatus {
|
||||
UPLOADED = 'uploaded',
|
||||
PROCESSING = 'processing',
|
||||
COMPLETED = 'completed',
|
||||
FAILED = 'failed',
|
||||
CANCELLED = 'cancelled'
|
||||
}
|
||||
|
||||
interface ProcessingStep {
|
||||
step: string;
|
||||
status: 'pending' | 'in_progress' | 'completed' | 'failed';
|
||||
startedAt?: Date;
|
||||
completedAt?: Date;
|
||||
error?: string;
|
||||
metadata?: Record<string, any>;
|
||||
}
|
||||
```
|
||||
|
||||
### Migration Status Model
|
||||
### Processing Session Model
|
||||
|
||||
```typescript
|
||||
interface MigrationStatus {
|
||||
totalFiles: number;
|
||||
migratedFiles: number;
|
||||
failedFiles: FileError[];
|
||||
storageUsage: {
|
||||
local: number;
|
||||
cloud: number;
|
||||
};
|
||||
status: 'pending' | 'in-progress' | 'completed' | 'failed';
|
||||
}
|
||||
|
||||
interface FileError {
|
||||
filePath: string;
|
||||
error: string;
|
||||
retryCount: number;
|
||||
lastAttempt: Date;
|
||||
}
|
||||
```
|
||||
|
||||
### Upload Error Analysis Model
|
||||
|
||||
```typescript
|
||||
interface UploadErrorAnalysis {
|
||||
errorTypes: {
|
||||
[key: string]: {
|
||||
count: number;
|
||||
examples: string[];
|
||||
severity: 'low' | 'medium' | 'high';
|
||||
};
|
||||
};
|
||||
affectedRoutes: string[];
|
||||
timeRange: {
|
||||
start: Date;
|
||||
end: Date;
|
||||
};
|
||||
recommendations: string[];
|
||||
interface ProcessingSession {
|
||||
id: string;
|
||||
documentId: string;
|
||||
strategy: string;
|
||||
status: SessionStatus;
|
||||
steps: ProcessingStep[];
|
||||
totalSteps: number;
|
||||
completedSteps: number;
|
||||
failedSteps: number;
|
||||
processingTimeMs?: number;
|
||||
apiCallsCount: number;
|
||||
totalCost: number;
|
||||
errorLog: ErrorEntry[];
|
||||
createdAt: Date;
|
||||
completedAt?: Date;
|
||||
}
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
### Upload Error Resolution Strategy
|
||||
### Error Categories and Strategies
|
||||
|
||||
1. **Route Parameter Validation**: Fix UUID validation in document routes
|
||||
2. **Error Logging Enhancement**: Add structured logging with correlation IDs
|
||||
3. **Graceful Degradation**: Implement fallback mechanisms for upload failures
|
||||
4. **User Feedback**: Provide clear error messages to users
|
||||
1. **Validation Errors**
|
||||
- UUID format validation
|
||||
- File type and size validation
|
||||
- Required field validation
|
||||
- Strategy: Return 400 Bad Request with detailed error message
|
||||
|
||||
### Configuration Error Handling
|
||||
2. **Authentication Errors**
|
||||
- Invalid or expired tokens
|
||||
- Missing authentication
|
||||
- Strategy: Return 401 Unauthorized, trigger token refresh
|
||||
|
||||
1. **Validation on Startup**: Validate all configurations before service startup
|
||||
2. **Fallback Configurations**: Provide sensible defaults for non-critical settings
|
||||
3. **Environment Detection**: Automatically detect and configure for deployment environment
|
||||
4. **Configuration Monitoring**: Monitor configuration drift in production
|
||||
3. **Authorization Errors**
|
||||
- Insufficient permissions
|
||||
- Resource access denied
|
||||
- Strategy: Return 403 Forbidden with clear message
|
||||
|
||||
### Storage Error Handling
|
||||
4. **Resource Not Found**
|
||||
- Document not found
|
||||
- File not found
|
||||
- Strategy: Return 404 Not Found
|
||||
|
||||
1. **Retry Logic**: Implement exponential backoff for GCS operations
|
||||
2. **Migration Safety**: Backup existing files before migration, then remove local storage completely
|
||||
3. **Integrity Checks**: Validate file integrity after migration to GCS
|
||||
4. **GCS-Only Operations**: All storage operations must use GCS exclusively (no local fallbacks)
|
||||
5. **External Service Errors**
|
||||
- Firebase Storage errors
|
||||
- Document AI failures
|
||||
- LLM API errors
|
||||
- Strategy: Retry with exponential backoff, fallback options
|
||||
|
||||
6. **Processing Errors**
|
||||
- Text extraction failures
|
||||
- PDF generation errors
|
||||
- Database operation failures
|
||||
- Strategy: Preserve partial results, enable retry from checkpoint
|
||||
|
||||
7. **System Errors**
|
||||
- Memory issues
|
||||
- Timeout errors
|
||||
- Network failures
|
||||
- Strategy: Graceful degradation, error logging, monitoring alerts
|
||||
|
||||
### Error Response Format
|
||||
|
||||
```typescript
|
||||
interface ErrorResponse {
|
||||
success: false;
|
||||
error: {
|
||||
code: string;
|
||||
message: string;
|
||||
details?: any;
|
||||
correlationId: string;
|
||||
timestamp: string;
|
||||
retryable: boolean;
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Configuration Testing
|
||||
### Unit Testing
|
||||
- Service layer testing with mocked dependencies
|
||||
- Utility function testing
|
||||
- Configuration validation testing
|
||||
- Error handling testing
|
||||
|
||||
1. **Environment Validation Tests**: Verify all required configurations are present
|
||||
2. **Configuration Conflict Tests**: Detect and report configuration conflicts
|
||||
3. **Deployment Tests**: Validate deployment configurations work correctly
|
||||
4. **Integration Tests**: Test configuration changes don't break existing functionality
|
||||
### Integration Testing
|
||||
- Firebase Storage integration
|
||||
- Database operations
|
||||
- External API integrations
|
||||
- End-to-end processing pipeline
|
||||
|
||||
### Upload Pipeline Testing
|
||||
### Error Scenario Testing
|
||||
- Network failure simulation
|
||||
- API rate limit testing
|
||||
- Invalid input handling
|
||||
- Timeout scenario testing
|
||||
|
||||
1. **Unit Tests**: Test individual upload components
|
||||
2. **Integration Tests**: Test complete upload pipeline
|
||||
3. **Error Scenario Tests**: Test various error conditions and recovery
|
||||
4. **Performance Tests**: Validate upload performance after changes
|
||||
|
||||
### Storage Migration Testing
|
||||
|
||||
1. **Migration Tests**: Test file migration process
|
||||
2. **Data Integrity Tests**: Verify files are correctly migrated
|
||||
3. **Rollback Tests**: Test ability to rollback migration
|
||||
4. **Performance Tests**: Compare storage performance before/after migration
|
||||
|
||||
### End-to-End Testing
|
||||
|
||||
1. **User Journey Tests**: Test complete user upload journey
|
||||
2. **Cross-Environment Tests**: Verify functionality across all environments
|
||||
3. **Regression Tests**: Ensure cleanup doesn't break existing features
|
||||
4. **Load Tests**: Validate system performance under load
|
||||
### Performance Testing
|
||||
- Large file upload testing
|
||||
- Concurrent processing testing
|
||||
- Memory usage monitoring
|
||||
- API response time testing
|
||||
|
||||
## Implementation Phases
|
||||
|
||||
### Phase 1: Analysis and Planning
|
||||
- Audit current configuration files and identify conflicts
|
||||
- Analyze upload error patterns and root causes
|
||||
- Document current deployment process and identify issues
|
||||
- Create detailed cleanup and migration plan
|
||||
### Phase 1: Core Infrastructure Cleanup
|
||||
1. Fix configuration management and environment variable handling
|
||||
2. Implement proper UUID validation for database queries
|
||||
3. Set up comprehensive error handling and logging
|
||||
4. Fix service dependency issues
|
||||
|
||||
### Phase 2: Configuration Cleanup
|
||||
- Remove redundant and conflicting configuration files
|
||||
- Standardize environment variable naming and structure
|
||||
- Update deployment configurations for consistency
|
||||
- Validate configurations across all environments
|
||||
### Phase 2: Firebase Storage Integration
|
||||
1. Implement Firebase Storage service
|
||||
2. Update file upload endpoints
|
||||
3. Migrate existing file operations
|
||||
4. Update frontend integration
|
||||
|
||||
### Phase 3: Storage Migration
|
||||
- Implement Google Cloud Storage integration
|
||||
- Migrate existing files from local storage to GCS
|
||||
- Update file storage service and database references
|
||||
- Test and validate storage functionality
|
||||
### Phase 3: Processing Pipeline Stabilization
|
||||
1. Fix service orchestration issues
|
||||
2. Implement proper error recovery
|
||||
3. Add processing checkpoints
|
||||
4. Enhance monitoring and logging
|
||||
|
||||
### Phase 4: Upload Error Resolution
|
||||
- Fix UUID validation issues in document routes
|
||||
- Improve error handling and logging throughout upload pipeline
|
||||
- Implement better user feedback for upload errors
|
||||
- Add monitoring and alerting for upload failures
|
||||
### Phase 4: Testing and Optimization
|
||||
1. Comprehensive testing suite
|
||||
2. Performance optimization
|
||||
3. Error scenario testing
|
||||
4. Documentation updates
|
||||
|
||||
### Phase 5: Deployment Standardization
|
||||
- Remove local deployment artifacts and scripts
|
||||
- Standardize Cloud Run and Firebase deployment processes
|
||||
- Update documentation and deployment guides
|
||||
- Implement automated deployment validation
|
||||
## Security Considerations
|
||||
|
||||
### Phase 6: Testing and Validation
|
||||
- Comprehensive testing of all changes
|
||||
- Performance validation and optimization
|
||||
- User acceptance testing
|
||||
- Production deployment and monitoring
|
||||
### Firebase Storage Security
|
||||
- Proper Firebase Security Rules
|
||||
- User-based file access control
|
||||
- File type and size validation
|
||||
- Secure download URL generation
|
||||
|
||||
### API Security
|
||||
- Request validation and sanitization
|
||||
- Rate limiting and abuse prevention
|
||||
- Correlation ID tracking
|
||||
- Secure error messages (no sensitive data exposure)
|
||||
|
||||
### Data Protection
|
||||
- User data isolation
|
||||
- Secure file deletion
|
||||
- Audit logging
|
||||
- GDPR compliance considerations
|
||||
|
||||
## Monitoring and Observability
|
||||
|
||||
### Key Metrics
|
||||
- Document processing success rate
|
||||
- Average processing time per document
|
||||
- API response times
|
||||
- Error rates by category
|
||||
- Firebase Storage usage
|
||||
- Database query performance
|
||||
|
||||
### Logging Strategy
|
||||
- Structured logging with correlation IDs
|
||||
- Error categorization and tracking
|
||||
- Performance metrics logging
|
||||
- User activity logging
|
||||
- External service interaction logging
|
||||
|
||||
### Health Checks
|
||||
- Service availability checks
|
||||
- Database connectivity
|
||||
- External service status
|
||||
- File storage accessibility
|
||||
- Processing pipeline health
|
||||
@@ -2,61 +2,74 @@
|
||||
|
||||
## Introduction
|
||||
|
||||
This feature focuses on cleaning up the codebase that has accumulated technical debt during the migration from local deployment to Firebase/GCloud solution, and resolving persistent document upload errors. The cleanup will improve code maintainability, remove redundant configurations, and establish a clear deployment strategy while fixing the core document upload functionality.
|
||||
The CIM Document Processor is experiencing backend processing failures that prevent the full document processing pipeline from working correctly. The system has a complex architecture with multiple services (Document AI, LLM processing, PDF generation, vector database, etc.) that need to be cleaned up and properly integrated to ensure reliable document processing from upload through final PDF generation.
|
||||
|
||||
## Requirements
|
||||
|
||||
### Requirement 1
|
||||
|
||||
**User Story:** As a developer, I want a clean and organized codebase, so that I can easily maintain and extend the application without confusion from legacy configurations.
|
||||
**User Story:** As a developer, I want a clean and properly functioning backend codebase, so that I can reliably process CIM documents without errors.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
1. WHEN reviewing the codebase THEN the system SHALL have only necessary environment files and configurations
|
||||
2. WHEN examining deployment configurations THEN the system SHALL have a single, clear deployment strategy for each environment
|
||||
3. WHEN looking at service configurations THEN the system SHALL have consistent Firebase/GCloud integration without local deployment remnants
|
||||
4. WHEN reviewing file structure THEN the system SHALL have organized directories without redundant or conflicting files
|
||||
1. WHEN the backend starts THEN all services SHALL initialize without errors
|
||||
2. WHEN environment variables are loaded THEN all required configuration SHALL be validated and available
|
||||
3. WHEN database connections are established THEN all database operations SHALL work correctly
|
||||
4. WHEN external service integrations are tested THEN Google Document AI, Claude AI, and Firebase Storage SHALL be properly connected
|
||||
|
||||
### Requirement 2
|
||||
|
||||
**User Story:** As a user, I want to upload documents successfully, so that I can process and analyze my files without encountering errors.
|
||||
**User Story:** As a user, I want to upload PDF documents successfully, so that I can process CIM documents for analysis.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
1. WHEN a user uploads a document THEN the system SHALL accept the file and begin processing without errors
|
||||
2. WHEN document upload fails THEN the system SHALL provide clear error messages indicating the specific issue
|
||||
3. WHEN processing a document THEN the system SHALL handle all file types supported by the Document AI service
|
||||
4. WHEN upload completes THEN the system SHALL store the document in the correct Firebase/GCloud storage location
|
||||
1. WHEN a user uploads a PDF file THEN the file SHALL be stored in Firebase storage
|
||||
2. WHEN upload is confirmed THEN a processing job SHALL be created in the database
|
||||
3. WHEN upload fails THEN the user SHALL receive clear error messages
|
||||
4. WHEN upload monitoring is active THEN real-time progress SHALL be tracked and displayed
|
||||
|
||||
### Requirement 3
|
||||
|
||||
**User Story:** As a developer, I want clear error logging and debugging capabilities, so that I can quickly identify and resolve issues in the document processing pipeline.
|
||||
**User Story:** As a user, I want the document processing pipeline to work end-to-end, so that I can get structured CIM analysis results.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
1. WHEN an error occurs during upload THEN the system SHALL log detailed error information including stack traces
|
||||
2. WHEN debugging upload issues THEN the system SHALL provide clear logging at each step of the process
|
||||
3. WHEN errors occur THEN the system SHALL distinguish between client-side and server-side issues
|
||||
4. WHEN reviewing logs THEN the system SHALL have structured logging with appropriate log levels
|
||||
1. WHEN a document is uploaded THEN Google Document AI SHALL extract text successfully
|
||||
2. WHEN text is extracted THEN the optimized agentic RAG processor SHALL chunk and process the content
|
||||
3. WHEN chunks are processed THEN vector embeddings SHALL be generated and stored
|
||||
4. WHEN LLM analysis is triggered THEN Claude AI SHALL generate structured CIM review data
|
||||
5. WHEN analysis is complete THEN a PDF summary SHALL be generated using Puppeteer
|
||||
6. WHEN processing fails at any step THEN error handling SHALL provide graceful degradation
|
||||
|
||||
### Requirement 4
|
||||
|
||||
**User Story:** As a system administrator, I want consistent and secure configuration management, so that the application can be deployed reliably across different environments.
|
||||
**User Story:** As a developer, I want proper error handling and logging throughout the system, so that I can diagnose and fix issues quickly.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
1. WHEN deploying to different environments THEN the system SHALL use environment-specific configurations
|
||||
2. WHEN handling sensitive data THEN the system SHALL properly manage API keys and credentials
|
||||
3. WHEN configuring services THEN the system SHALL have consistent Firebase/GCloud service initialization
|
||||
4. WHEN reviewing security THEN the system SHALL have proper authentication and authorization for file uploads
|
||||
1. WHEN errors occur THEN they SHALL be logged with correlation IDs for tracking
|
||||
2. WHEN API calls fail THEN retry logic SHALL be implemented with exponential backoff
|
||||
3. WHEN processing fails THEN partial results SHALL be preserved where possible
|
||||
4. WHEN system health is checked THEN monitoring endpoints SHALL provide accurate status information
|
||||
|
||||
### Requirement 5
|
||||
|
||||
**User Story:** As a developer, I want to understand the current system architecture, so that I can make informed decisions about cleanup priorities and upload error resolution.
|
||||
**User Story:** As a user, I want the frontend to properly communicate with the backend, so that I can see processing status and results in real-time.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
1. WHEN analyzing the codebase THEN the system SHALL have documented service dependencies and data flow
|
||||
2. WHEN reviewing upload process THEN the system SHALL have clear understanding of each processing step
|
||||
3. WHEN examining errors THEN the system SHALL identify specific failure points in the upload pipeline
|
||||
4. WHEN planning cleanup THEN the system SHALL prioritize changes that don't break existing functionality
|
||||
1. WHEN frontend makes API calls THEN authentication SHALL work correctly
|
||||
2. WHEN processing is in progress THEN real-time status updates SHALL be displayed
|
||||
3. WHEN processing is complete THEN results SHALL be downloadable
|
||||
4. WHEN errors occur THEN user-friendly error messages SHALL be shown
|
||||
|
||||
### Requirement 6
|
||||
|
||||
**User Story:** As a developer, I want clean service dependencies and proper separation of concerns, so that the codebase is maintainable and testable.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
1. WHEN services are initialized THEN dependencies SHALL be properly injected
|
||||
2. WHEN business logic is executed THEN it SHALL be separated from API routing
|
||||
3. WHEN database operations are performed THEN they SHALL use proper connection pooling
|
||||
4. WHEN external APIs are called THEN they SHALL have proper rate limiting and error handling
|
||||
@@ -1,85 +1,217 @@
|
||||
# Implementation Plan
|
||||
|
||||
- [x] 1. Audit and analyze current codebase configuration issues
|
||||
- Identify all environment files and their conflicts
|
||||
- Document current local dependencies (storage and database)
|
||||
- Analyze upload error patterns from logs
|
||||
- Map current deployment artifacts and scripts
|
||||
- _Requirements: 1.1, 1.2, 1.3, 1.4_
|
||||
## 1. Fix Core Configuration and Environment Management
|
||||
|
||||
- [x] 2. Remove redundant and conflicting configuration files
|
||||
- Delete duplicate .env files (.env.backup, .env.backup.hybrid, .env.development, .env.document-ai-template)
|
||||
- Consolidate environment variables into single .env.example and production configs
|
||||
- Remove local PostgreSQL configuration references from env.ts
|
||||
- Update config validation schema to require only cloud services
|
||||
- _Requirements: 1.1, 4.1, 4.2_
|
||||
- [ ] 1.1 Update environment configuration validation
|
||||
- Modify `backend/src/config/env.ts` to handle serverless environment variable loading gracefully
|
||||
- Add runtime configuration validation with proper fallbacks
|
||||
- Implement configuration health check endpoint
|
||||
- Add Firebase configuration validation
|
||||
- _Requirements: 1.1, 1.2_
|
||||
|
||||
- [x] 3. Implement Google Cloud Storage service integration
|
||||
- Create / confirm GCS-only file storage service replacing current local storage
|
||||
- Implement GCS bucket operations (upload, download, delete, list)
|
||||
- Add proper error handling and retry logic for GCS operations
|
||||
- Configure GCS authentication using service account
|
||||
- _Requirements: 2.1, 2.2, 4.3_
|
||||
- [ ] 1.2 Implement robust error handling middleware
|
||||
- Create enhanced error handler in `backend/src/middleware/errorHandler.ts`
|
||||
- Add correlation ID generation and tracking
|
||||
- Implement structured error logging with Winston
|
||||
- Add error categorization and response formatting
|
||||
- _Requirements: 4.1, 4.2_
|
||||
|
||||
- [ ] 4. Migrate existing files from local storage to GCS
|
||||
- Create migration script to upload all files from backend/uploads to GCS
|
||||
- Update database file_path references to use GCS URLs instead of local paths
|
||||
- Verify file integrity after migration
|
||||
- Create backup of local files before cleanup
|
||||
- _Requirements: 2.1, 2.2_
|
||||
- [ ] 1.3 Fix database query validation issues
|
||||
- Update all document-related database queries to validate UUID format before execution
|
||||
- Add proper input sanitization in `backend/src/models/DocumentModel.ts`
|
||||
- Implement UUID validation utility function
|
||||
- Fix the "invalid input syntax for type uuid" errors seen in logs
|
||||
- _Requirements: 1.3, 4.1_
|
||||
|
||||
- [x] 5. Update file storage service to use GCS exclusively
|
||||
- Replace fileStorageService.ts to use only Google Cloud Storage
|
||||
- Remove all local file system operations (fs.readFileSync, fs.writeFileSync, etc.)
|
||||
- Update upload middleware to work with GCS temporary URLs
|
||||
- Remove local upload directory creation and management
|
||||
- _Requirements: 2.1, 2.2, 2.3_
|
||||
## 2. Implement Firebase Storage Integration
|
||||
|
||||
- [x] 6. Fix document upload route UUID validation errors
|
||||
- Analyze and fix invalid UUID errors in document routes
|
||||
- Add proper UUID validation middleware for document ID parameters
|
||||
- Improve error messages for invalid document ID requests
|
||||
- Add request correlation IDs for better error tracking
|
||||
- _Requirements: 2.2, 3.1, 3.2, 3.3_
|
||||
- [ ] 2.1 Create Firebase Storage service
|
||||
- Implement `backend/src/services/firebaseStorageService.ts` with complete Firebase Storage integration
|
||||
- Add file upload, download, delete, and metadata operations
|
||||
- Implement secure file path generation and user-based access control
|
||||
- Add proper error handling and retry logic for Firebase operations
|
||||
- _Requirements: 2.1, 2.5_
|
||||
|
||||
- [x] 7. Remove all local storage dependencies and cleanup
|
||||
- Delete backend/uploads directory and all local file references
|
||||
- Remove local storage configuration from env.ts and related files
|
||||
- Update upload middleware to remove local file system operations
|
||||
- Remove cleanup functions for local files
|
||||
- [ ] 2.2 Update file upload endpoints
|
||||
- Modify `backend/src/routes/documents.ts` to use Firebase Storage instead of Google Cloud Storage
|
||||
- Update upload URL generation to use Firebase Storage signed URLs
|
||||
- Implement proper file validation (type, size, security)
|
||||
- Add upload progress tracking and monitoring
|
||||
- _Requirements: 2.1, 2.4_
|
||||
|
||||
- [x] 8. Standardize deployment configurations for cloud-only architecture
|
||||
- Update Firebase deployment configurations for both frontend and backend
|
||||
- Remove any local deployment scripts and references
|
||||
- Standardize Cloud Run deployment configuration
|
||||
- Update package.json scripts to remove local development dependencies
|
||||
- _Requirements: 1.1, 1.4, 4.1_
|
||||
- [ ] 2.3 Update document processing pipeline for Firebase Storage
|
||||
- Modify `backend/src/services/unifiedDocumentProcessor.ts` to work with Firebase Storage
|
||||
- Update file retrieval operations in processing services
|
||||
- Ensure PDF generation service can access files from Firebase Storage
|
||||
- Update all file path references throughout the codebase
|
||||
- _Requirements: 2.1, 3.1_
|
||||
|
||||
- [x] 9. Enhance error logging and monitoring for upload pipeline
|
||||
- Add structured logging with correlation IDs throughout upload process
|
||||
- Implement better error categorization and reporting
|
||||
- Add monitoring for upload success/failure rates
|
||||
- Create error dashboards for upload pipeline debugging
|
||||
- _Requirements: 3.1, 3.2, 3.3_
|
||||
## 3. Fix Service Dependencies and Orchestration
|
||||
|
||||
- [x] 10. Update frontend to handle GCS-based file operations
|
||||
- Update DocumentUpload component to work with GCS URLs
|
||||
- Modify file progress monitoring to work with cloud storage
|
||||
- Update error handling for GCS-specific errors
|
||||
- Test upload functionality with new GCS backend
|
||||
- _Requirements: 2.1, 2.2, 3.4_
|
||||
- [ ] 3.1 Resolve service import and dependency issues
|
||||
- Fix circular dependencies in service imports
|
||||
- Ensure all required services are properly imported and initialized
|
||||
- Add dependency injection pattern for better testability
|
||||
- Update service initialization order in `backend/src/index.ts`
|
||||
- _Requirements: 6.1, 6.2_
|
||||
|
||||
- [x] 11. Create comprehensive tests for cloud-only architecture
|
||||
- Write unit tests for GCS file storage service
|
||||
- Create integration tests for complete upload pipeline
|
||||
- Add tests for error scenarios and recovery
|
||||
- Test deployment configurations in staging environment
|
||||
- _Requirements: 1.4, 2.1, 2.2, 2.3_
|
||||
- [ ] 3.2 Fix document processing pipeline orchestration
|
||||
- Update `backend/src/services/unifiedDocumentProcessor.ts` to handle all processing steps correctly
|
||||
- Ensure proper error handling between processing steps
|
||||
- Add processing checkpoints and recovery mechanisms
|
||||
- Implement proper status updates throughout the pipeline
|
||||
- _Requirements: 3.1, 3.2, 3.6_
|
||||
|
||||
- [x] 12. Validate and test complete system functionality
|
||||
- Perform end-to-end testing of document upload and processing
|
||||
- Validate all environment configurations work correctly
|
||||
- Test error handling and user feedback mechanisms
|
||||
- Verify no local dependencies remain in the system
|
||||
- _Requirements: 1.1, 1.2, 1.4, 2.1, 2.2, 2.3, 2.4_
|
||||
- [ ] 3.3 Enhance optimized agentic RAG processor
|
||||
- Fix any issues in `backend/src/services/optimizedAgenticRAGProcessor.ts`
|
||||
- Ensure proper memory management and garbage collection
|
||||
- Add better error handling for LLM API calls
|
||||
- Implement proper retry logic with exponential backoff
|
||||
- _Requirements: 3.3, 3.4, 4.2_
|
||||
|
||||
## 4. Improve LLM Service Integration
|
||||
|
||||
- [ ] 4.1 Fix LLM service configuration and initialization
|
||||
- Update `backend/src/services/llmService.ts` to handle configuration properly
|
||||
- Fix model selection logic and API key validation
|
||||
- Add proper timeout handling for LLM API calls
|
||||
- Implement cost tracking and usage monitoring
|
||||
- _Requirements: 1.4, 3.4_
|
||||
|
||||
- [ ] 4.2 Enhance LLM error handling and retry logic
|
||||
- Add comprehensive error handling for both Anthropic and OpenAI APIs
|
||||
- Implement retry logic with exponential backoff for API failures
|
||||
- Add fallback model selection when primary model fails
|
||||
- Implement proper JSON parsing and validation for LLM responses
|
||||
- _Requirements: 3.4, 4.2_
|
||||
|
||||
- [ ] 4.3 Add LLM response validation and self-correction
|
||||
- Enhance JSON extraction from LLM responses
|
||||
- Add schema validation for CIM review data
|
||||
- Implement self-correction mechanism for invalid responses
|
||||
- Add quality scoring and validation for generated content
|
||||
- _Requirements: 3.4, 4.3_
|
||||
|
||||
## 5. Fix PDF Generation and File Operations
|
||||
|
||||
- [ ] 5.1 Update PDF generation service for Firebase Storage
|
||||
- Modify `backend/src/services/pdfGenerationService.ts` to work with Firebase Storage
|
||||
- Ensure generated PDFs are properly stored in Firebase Storage
|
||||
- Add proper error handling for PDF generation failures
|
||||
- Implement PDF generation progress tracking
|
||||
- _Requirements: 3.5, 2.1_
|
||||
|
||||
- [ ] 5.2 Implement proper file cleanup and lifecycle management
|
||||
- Add automatic cleanup of temporary files during processing
|
||||
- Implement file lifecycle management (retention policies)
|
||||
- Add proper error handling for file operations
|
||||
- Ensure no orphaned files are left in storage
|
||||
- _Requirements: 2.1, 4.3_
|
||||
|
||||
## 6. Enhance Database Operations and Models
|
||||
|
||||
- [ ] 6.1 Fix document model and database operations
|
||||
- Update `backend/src/models/DocumentModel.ts` with proper UUID validation
|
||||
- Add comprehensive error handling for all database operations
|
||||
- Implement proper connection pooling and retry logic
|
||||
- Add database operation logging and monitoring
|
||||
- _Requirements: 1.3, 6.3_
|
||||
|
||||
- [ ] 6.2 Implement processing session tracking
|
||||
- Enhance agentic RAG session management in database models
|
||||
- Add proper session lifecycle tracking
|
||||
- Implement session cleanup and archival
|
||||
- Add session analytics and reporting
|
||||
- _Requirements: 3.6, 4.4_
|
||||
|
||||
- [ ] 6.3 Add vector database integration fixes
|
||||
- Ensure vector database service is properly integrated
|
||||
- Fix any issues with embedding generation and storage
|
||||
- Add proper error handling for vector operations
|
||||
- Implement vector database health checks
|
||||
- _Requirements: 3.3, 1.3_
|
||||
|
||||
## 7. Implement Comprehensive Monitoring and Logging
|
||||
|
||||
- [ ] 7.1 Add structured logging throughout the application
|
||||
- Implement correlation ID tracking across all services
|
||||
- Add comprehensive logging for all processing steps
|
||||
- Create structured log format for better analysis
|
||||
- Add log aggregation and monitoring setup
|
||||
- _Requirements: 4.1, 4.4_
|
||||
|
||||
- [ ] 7.2 Implement health check and monitoring endpoints
|
||||
- Create comprehensive health check endpoint that tests all services
|
||||
- Add monitoring endpoints for processing statistics
|
||||
- Implement real-time status monitoring
|
||||
- Add alerting for critical failures
|
||||
- _Requirements: 4.4, 1.1_
|
||||
|
||||
- [ ] 7.3 Add performance monitoring and metrics
|
||||
- Implement processing time tracking for all operations
|
||||
- Add memory usage monitoring and alerts
|
||||
- Create API response time monitoring
|
||||
- Add cost tracking for external service usage
|
||||
- _Requirements: 4.4, 3.6_
|
||||
|
||||
## 8. Update Frontend Integration
|
||||
|
||||
- [ ] 8.1 Update frontend Firebase Storage integration
|
||||
- Modify frontend upload components to work with Firebase Storage
|
||||
- Update authentication flow for Firebase Storage access
|
||||
- Add proper error handling and user feedback for upload operations
|
||||
- Implement upload progress tracking on frontend
|
||||
- _Requirements: 5.1, 5.4_
|
||||
|
||||
- [ ] 8.2 Enhance frontend error handling and user experience
|
||||
- Add proper error message display for all error scenarios
|
||||
- Implement retry mechanisms for failed operations
|
||||
- Add loading states and progress indicators
|
||||
- Ensure real-time status updates work correctly
|
||||
- _Requirements: 5.2, 5.4_
|
||||
|
||||
## 9. Testing and Quality Assurance
|
||||
|
||||
- [ ] 9.1 Create comprehensive unit tests
|
||||
- Write unit tests for all service functions
|
||||
- Add tests for error handling scenarios
|
||||
- Create tests for configuration validation
|
||||
- Add tests for UUID validation and database operations
|
||||
- _Requirements: 6.4_
|
||||
|
||||
- [ ] 9.2 Implement integration tests
|
||||
- Create end-to-end tests for document processing pipeline
|
||||
- Add tests for Firebase Storage integration
|
||||
- Create tests for external API integrations
|
||||
- Add tests for error recovery scenarios
|
||||
- _Requirements: 6.4_
|
||||
|
||||
- [ ] 9.3 Add performance and load testing
|
||||
- Create tests for large file processing
|
||||
- Add concurrent processing tests
|
||||
- Implement memory leak detection tests
|
||||
- Add API rate limiting tests
|
||||
- _Requirements: 6.4_
|
||||
|
||||
## 10. Documentation and Deployment
|
||||
|
||||
- [ ] 10.1 Update configuration documentation
|
||||
- Document all required environment variables
|
||||
- Create setup guides for Firebase Storage configuration
|
||||
- Add troubleshooting guides for common issues
|
||||
- Update deployment documentation
|
||||
- _Requirements: 1.2_
|
||||
|
||||
- [ ] 10.2 Create operational runbooks
|
||||
- Document error recovery procedures
|
||||
- Create monitoring and alerting setup guides
|
||||
- Add performance tuning guidelines
|
||||
- Create backup and disaster recovery procedures
|
||||
- _Requirements: 4.4_
|
||||
|
||||
- [ ] 10.3 Final integration testing and deployment
|
||||
- Perform comprehensive end-to-end testing
|
||||
- Validate all error scenarios work correctly
|
||||
- Test deployment in staging environment
|
||||
- Perform production deployment with monitoring
|
||||
- _Requirements: 1.1, 2.1, 3.1_
|
||||
@@ -85,7 +85,7 @@ Document Uploaded
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ 1. Text │ ──► Google Document AI extracts text from PDF
|
||||
│ Extraction │ (documentAiGenkitProcessor or direct Document AI)
|
||||
│ Extraction │ (documentAiProcessor or direct Document AI)
|
||||
└─────────┬───────┘
|
||||
│
|
||||
▼
|
||||
|
||||
@@ -92,7 +92,7 @@
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ 4. Text │ ──► Google Document AI extracts text from PDF
|
||||
│ Extraction │ (documentAiGenkitProcessor or direct Document AI)
|
||||
│ Extraction │ (documentAiProcessor or direct Document AI)
|
||||
└─────────┬───────┘
|
||||
│
|
||||
▼
|
||||
|
||||
@@ -35,7 +35,7 @@ This report analyzes the dependencies in both backend and frontend packages to i
|
||||
- `joi` - Used for environment validation and middleware validation
|
||||
- `zod` - Used in llmSchemas.ts and llmService.ts for schema validation
|
||||
- `multer` - Used in upload middleware (legacy multipart upload)
|
||||
- `pdf-parse` - Used in documentAiGenkitProcessor.ts (legacy processor)
|
||||
- `pdf-parse` - Used in documentAiProcessor.ts (Document AI fallback)
|
||||
|
||||
#### ⚠️ **Potentially Unused Dependencies**
|
||||
- `redis` - Only imported in sessionService.ts but may not be actively used
|
||||
@@ -108,7 +108,7 @@ None identified - all dependencies appear to be used somewhere in the codebase.
|
||||
### Current Active Strategy
|
||||
Based on the code analysis, the current processing strategy is:
|
||||
- **Primary**: `optimized_agentic_rag` (most actively used)
|
||||
- **Fallback**: `document_ai_genkit` (legacy implementation)
|
||||
- **Fallback**: `document_ai_agentic_rag` (Document AI + Agentic RAG)
|
||||
|
||||
### Unused Processing Strategies
|
||||
The following strategies are implemented but not actively used:
|
||||
@@ -130,7 +130,7 @@ The following strategies are implemented but not actively used:
|
||||
|
||||
#### ⚠️ **Legacy Services (Can be removed)**
|
||||
- `documentProcessingService` - Legacy chunking service
|
||||
- `documentAiGenkitProcessor` - Legacy Document AI processor
|
||||
- `documentAiProcessor` - Document AI + Agentic RAG processor
|
||||
- `ragDocumentProcessor` - Basic RAG processor
|
||||
|
||||
## Outdated Packages Analysis
|
||||
@@ -223,7 +223,7 @@ The following strategies are implemented but not actively used:
|
||||
|
||||
1. **Remove legacy processing services**:
|
||||
- `documentProcessingService.ts`
|
||||
- `documentAiGenkitProcessor.ts`
|
||||
- `documentAiProcessor.ts`
|
||||
- `ragDocumentProcessor.ts`
|
||||
|
||||
2. **Simplify unifiedDocumentProcessor**:
|
||||
@@ -299,7 +299,7 @@ The following strategies are implemented but not actively used:
|
||||
|
||||
### Key Findings
|
||||
1. **Unused Dependencies**: 2 frontend dependencies (`clsx`, `tailwind-merge`) are completely unused
|
||||
2. **Legacy Services**: 3 processing services can be removed (`documentProcessingService`, `documentAiGenkitProcessor`, `ragDocumentProcessor`)
|
||||
2. **Legacy Services**: 2 processing services can be removed (`documentProcessingService`, `ragDocumentProcessor`)
|
||||
3. **Redundant Dependencies**: Both `joi` and `zod` for validation, both `pg` and Supabase for database
|
||||
4. **Outdated Packages**: 21 backend and 15 frontend packages have updates available
|
||||
5. **Major Version Updates**: Many packages require major version updates with potential breaking changes
|
||||
|
||||
@@ -1,10 +1,10 @@
|
||||
# Document AI + Genkit Integration Guide
|
||||
# Document AI + Agentic RAG Integration Guide
|
||||
|
||||
## Overview
|
||||
|
||||
This guide explains how to integrate Google Cloud Document AI with Genkit for enhanced CIM document processing. This approach provides superior text extraction and structured analysis compared to traditional PDF parsing.
|
||||
This guide explains how to integrate Google Cloud Document AI with Agentic RAG for enhanced CIM document processing. This approach provides superior text extraction and structured analysis compared to traditional PDF parsing.
|
||||
|
||||
## 🎯 **Benefits of Document AI + Genkit**
|
||||
## 🎯 **Benefits of Document AI + Agentic RAG**
|
||||
|
||||
### **Document AI Advantages:**
|
||||
- **Superior text extraction** from complex PDF layouts
|
||||
@@ -13,7 +13,7 @@ This guide explains how to integrate Google Cloud Document AI with Genkit for en
|
||||
- **Layout understanding** maintains document structure
|
||||
- **Multi-format support** (PDF, images, scanned documents)
|
||||
|
||||
### **Genkit Advantages:**
|
||||
### **Agentic RAG Advantages:**
|
||||
- **Structured AI workflows** with type safety
|
||||
- **Map-reduce processing** for large documents
|
||||
- **Timeout handling** and error recovery
|
||||
@@ -77,7 +77,7 @@ Add these to your `package.json`:
|
||||
"dependencies": {
|
||||
"@google-cloud/documentai": "^8.0.0",
|
||||
"@google-cloud/storage": "^7.0.0",
|
||||
"genkit": "^0.1.0",
|
||||
"@google-cloud/documentai": "^8.0.0",
|
||||
"zod": "^3.25.76"
|
||||
}
|
||||
}
|
||||
@@ -95,7 +95,7 @@ type ProcessingStrategy =
|
||||
| 'rag' // Retrieval-Augmented Generation
|
||||
| 'agentic_rag' // Multi-agent RAG system
|
||||
| 'optimized_agentic_rag' // Optimized multi-agent system
|
||||
| 'document_ai_genkit'; // Document AI + Genkit (NEW)
|
||||
| 'document_ai_agentic_rag'; // Document AI + Agentic RAG (NEW)
|
||||
```
|
||||
|
||||
### **2. Environment Configuration**
|
||||
@@ -120,14 +120,14 @@ const envSchema = Joi.object({
|
||||
|
||||
```typescript
|
||||
// Set as default strategy
|
||||
PROCESSING_STRATEGY=document_ai_genkit
|
||||
PROCESSING_STRATEGY=document_ai_agentic_rag
|
||||
|
||||
// Or select per document
|
||||
const result = await unifiedDocumentProcessor.processDocument(
|
||||
documentId,
|
||||
userId,
|
||||
text,
|
||||
{ strategy: 'document_ai_genkit' }
|
||||
{ strategy: 'document_ai_agentic_rag' }
|
||||
);
|
||||
```
|
||||
|
||||
@@ -136,7 +136,7 @@ const result = await unifiedDocumentProcessor.processDocument(
|
||||
### **1. Basic Document Processing**
|
||||
|
||||
```typescript
|
||||
import { processCimDocumentServerAction } from './documentAiGenkitProcessor';
|
||||
import { processCimDocumentServerAction } from './documentAiProcessor';
|
||||
|
||||
const result = await processCimDocumentServerAction({
|
||||
fileDataUri: 'data:application/pdf;base64,JVBERi0xLjc...',
|
||||
@@ -154,9 +154,9 @@ export const documentController = {
|
||||
async uploadDocument(req: Request, res: Response): Promise<void> {
|
||||
// ... existing upload logic
|
||||
|
||||
// Use Document AI + Genkit strategy
|
||||
// Use Document AI + Agentic RAG strategy
|
||||
const processingOptions = {
|
||||
strategy: 'document_ai_genkit',
|
||||
strategy: 'document_ai_agentic_rag',
|
||||
enableTableExtraction: true,
|
||||
enableEntityRecognition: true
|
||||
};
|
||||
@@ -179,11 +179,11 @@ const comparison = await unifiedDocumentProcessor.compareProcessingStrategies(
|
||||
documentId,
|
||||
userId,
|
||||
text,
|
||||
{ includeDocumentAiGenkit: true }
|
||||
{ includeDocumentAiAgenticRag: true }
|
||||
);
|
||||
|
||||
console.log('Best strategy:', comparison.winner);
|
||||
console.log('Document AI + Genkit result:', comparison.documentAiGenkit);
|
||||
console.log('Document AI + Agentic RAG result:', comparison.documentAiAgenticRag);
|
||||
```
|
||||
|
||||
## 📊 **Performance Comparison**
|
||||
@@ -195,7 +195,7 @@ console.log('Document AI + Genkit result:', comparison.documentAiGenkit);
|
||||
| Chunking | 3-5 minutes | 9-12 | 7/10 | $2-3 |
|
||||
| RAG | 2-3 minutes | 6-8 | 8/10 | $1.5-2 |
|
||||
| Agentic RAG | 4-6 minutes | 15-20 | 9/10 | $3-4 |
|
||||
| **Document AI + Genkit** | **1-2 minutes** | **1-2** | **9.5/10** | **$1-1.5** |
|
||||
| **Document AI + Agentic RAG** | **1-2 minutes** | **1-2** | **9.5/10** | **$1-1.5** |
|
||||
|
||||
### **Key Advantages:**
|
||||
- **50% faster** than traditional chunking
|
||||
@@ -218,7 +218,7 @@ try {
|
||||
}
|
||||
}
|
||||
|
||||
// 2. Genkit Flow Timeouts
|
||||
// 2. Agentic RAG Flow Timeouts
|
||||
const TIMEOUT_DURATION_FLOW = 1800000; // 30 minutes
|
||||
const TIMEOUT_DURATION_ACTION = 2100000; // 35 minutes
|
||||
|
||||
@@ -236,10 +236,10 @@ try {
|
||||
### **1. Unit Tests**
|
||||
|
||||
```typescript
|
||||
// Test Document AI + Genkit processor
|
||||
describe('DocumentAiGenkitProcessor', () => {
|
||||
// Test Document AI + Agentic RAG processor
|
||||
describe('DocumentAiProcessor', () => {
|
||||
it('should process CIM document successfully', async () => {
|
||||
const processor = new DocumentAiGenkitProcessor();
|
||||
const processor = new DocumentAiProcessor();
|
||||
const result = await processor.processDocument(
|
||||
'test-doc-id',
|
||||
'test-user-id',
|
||||
@@ -258,7 +258,7 @@ describe('DocumentAiGenkitProcessor', () => {
|
||||
|
||||
```typescript
|
||||
// Test full pipeline
|
||||
describe('Document AI + Genkit Integration', () => {
|
||||
describe('Document AI + Agentic RAG Integration', () => {
|
||||
it('should process real CIM document', async () => {
|
||||
const fileDataUri = await loadTestPdfAsDataUri();
|
||||
const result = await processCimDocumentServerAction({
|
||||
@@ -326,7 +326,7 @@ const metrics = {
|
||||
|
||||
```typescript
|
||||
// Log detailed error information
|
||||
logger.error('Document AI + Genkit processing failed', {
|
||||
logger.error('Document AI + Agentic RAG processing failed', {
|
||||
documentId,
|
||||
error: error.message,
|
||||
stack: error.stack,
|
||||
@@ -341,7 +341,7 @@ logger.error('Document AI + Genkit processing failed', {
|
||||
2. **Configure environment variables** with your project details
|
||||
3. **Test with sample CIM documents** to validate extraction quality
|
||||
4. **Compare performance** with existing strategies
|
||||
5. **Gradually migrate** from chunking to Document AI + Genkit
|
||||
5. **Gradually migrate** from chunking to Document AI + Agentic RAG
|
||||
6. **Monitor costs and performance** in production
|
||||
|
||||
## 📞 **Support**
|
||||
@@ -349,7 +349,7 @@ logger.error('Document AI + Genkit processing failed', {
|
||||
For issues with:
|
||||
- **Google Cloud setup**: Check Google Cloud documentation
|
||||
- **Document AI**: Review processor configuration and permissions
|
||||
- **Genkit integration**: Verify API keys and model configuration
|
||||
- **Agentic RAG integration**: Verify API keys and model configuration
|
||||
- **Performance**: Monitor logs and adjust timeout settings
|
||||
|
||||
This integration provides a significant upgrade to your CIM processing capabilities with better quality, faster processing, and lower costs.
|
||||
@@ -1,8 +1,8 @@
|
||||
# Document AI + Genkit Integration Summary
|
||||
# Document AI + Agentic RAG Integration Summary
|
||||
|
||||
## 🎉 **Integration Complete!**
|
||||
|
||||
We have successfully set up Google Cloud Document AI + Genkit integration for your CIM processing system. Here's what we've accomplished:
|
||||
We have successfully set up Google Cloud Document AI + Agentic RAG integration for your CIM processing system. Here's what we've accomplished:
|
||||
|
||||
## ✅ **What's Been Set Up:**
|
||||
|
||||
@@ -16,9 +16,9 @@ We have successfully set up Google Cloud Document AI + Genkit integration for yo
|
||||
- ✅ **Permissions**: Document AI API User, Storage Object Admin
|
||||
|
||||
### **2. Code Integration**
|
||||
- ✅ **New Processor**: `DocumentAiGenkitProcessor` class
|
||||
- ✅ **New Processor**: `DocumentAiProcessor` class
|
||||
- ✅ **Environment Config**: Updated with Document AI settings
|
||||
- ✅ **Unified Processor**: Added `document_ai_genkit` strategy
|
||||
- ✅ **Unified Processor**: Added `document_ai_agentic_rag` strategy
|
||||
- ✅ **Dependencies**: Installed `@google-cloud/documentai` and `@google-cloud/storage`
|
||||
|
||||
### **3. Testing & Validation**
|
||||
@@ -54,15 +54,15 @@ node scripts/test-integration-with-mock.js
|
||||
node scripts/test-document-ai-integration.js
|
||||
```
|
||||
|
||||
### **4. Switch to Document AI + Genkit Strategy**
|
||||
### **4. Switch to Document AI + Agentic RAG Strategy**
|
||||
Update your environment or processing options:
|
||||
```bash
|
||||
PROCESSING_STRATEGY=document_ai_genkit
|
||||
PROCESSING_STRATEGY=document_ai_agentic_rag
|
||||
```
|
||||
|
||||
## 📊 **Expected Performance Improvements:**
|
||||
|
||||
| Metric | Current (Chunking) | Document AI + Genkit | Improvement |
|
||||
| Metric | Current (Chunking) | Document AI + Agentic RAG | Improvement |
|
||||
|--------|-------------------|---------------------|-------------|
|
||||
| **Processing Time** | 3-5 minutes | 1-2 minutes | **50% faster** |
|
||||
| **API Calls** | 9-12 calls | 1-2 calls | **90% reduction** |
|
||||
@@ -80,7 +80,7 @@ CIM Document Upload
|
||||
↓
|
||||
Text + Entities + Tables
|
||||
↓
|
||||
Genkit AI Analysis
|
||||
Agentic RAG AI Analysis
|
||||
↓
|
||||
Structured CIM Analysis
|
||||
```
|
||||
@@ -93,15 +93,15 @@ Your system now supports **5 processing strategies**:
|
||||
2. **`rag`** - Retrieval-Augmented Generation
|
||||
3. **`agentic_rag`** - Multi-agent RAG system
|
||||
4. **`optimized_agentic_rag`** - Optimized multi-agent system
|
||||
5. **`document_ai_genkit`** - Document AI + Genkit (NEW)
|
||||
5. **`document_ai_agentic_rag`** - Document AI + Agentic RAG (NEW)
|
||||
|
||||
## 📁 **Generated Files:**
|
||||
|
||||
- `backend/.env.document-ai-template` - Environment configuration template
|
||||
- `backend/DOCUMENT_AI_SETUP_INSTRUCTIONS.md` - Detailed setup instructions
|
||||
- `backend/scripts/` - Various test and setup scripts
|
||||
- `backend/src/services/documentAiGenkitProcessor.ts` - Integration processor
|
||||
- `DOCUMENT_AI_GENKIT_INTEGRATION.md` - Comprehensive integration guide
|
||||
- `backend/src/services/documentAiProcessor.ts` - Integration processor
|
||||
- `DOCUMENT_AI_AGENTIC_RAG_INTEGRATION.md` - Comprehensive integration guide
|
||||
|
||||
## 🚀 **Next Steps:**
|
||||
|
||||
@@ -118,7 +118,7 @@ Your system now supports **5 processing strategies**:
|
||||
- **Layout understanding** maintains document structure
|
||||
- **Lower costs** with better quality
|
||||
- **Faster processing** with fewer API calls
|
||||
- **Type-safe workflows** with Genkit
|
||||
- **Type-safe workflows** with Agentic RAG
|
||||
|
||||
## 🔍 **Troubleshooting:**
|
||||
|
||||
@@ -131,7 +131,7 @@ Your system now supports **5 processing strategies**:
|
||||
|
||||
- **Google Cloud Console**: https://console.cloud.google.com
|
||||
- **Document AI Documentation**: https://cloud.google.com/document-ai
|
||||
- **Genkit Documentation**: https://genkit.ai
|
||||
- **Agentic RAG Documentation**: See optimizedAgenticRAGProcessor.ts
|
||||
- **Generated Instructions**: `backend/DOCUMENT_AI_SETUP_INSTRUCTIONS.md`
|
||||
|
||||
---
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
# Document AI + Genkit Setup Instructions
|
||||
# Document AI + Agentic RAG Setup Instructions
|
||||
|
||||
## ✅ Completed Steps:
|
||||
1. Google Cloud Project: cim-summarizer
|
||||
@@ -27,7 +27,7 @@ Go to: https://console.cloud.google.com/ai/document-ai/processors
|
||||
Run: node scripts/test-integration-with-mock.js
|
||||
|
||||
### 4. Integrate with Existing System
|
||||
1. Update PROCESSING_STRATEGY=document_ai_genkit
|
||||
1. Update PROCESSING_STRATEGY=document_ai_agentic_rag
|
||||
2. Test with real CIM documents
|
||||
3. Monitor performance and costs
|
||||
|
||||
@@ -45,4 +45,4 @@ Run: node scripts/test-integration-with-mock.js
|
||||
## 📞 Support:
|
||||
- Google Cloud Console: https://console.cloud.google.com
|
||||
- Document AI Documentation: https://cloud.google.com/document-ai
|
||||
- Genkit Documentation: https://genkit.ai
|
||||
- Agentic RAG Documentation: See optimizedAgenticRAGProcessor.ts
|
||||
|
||||
@@ -45,7 +45,6 @@
|
||||
"joi": "^17.11.0",
|
||||
"jsonwebtoken": "^9.0.2",
|
||||
"morgan": "^1.10.0",
|
||||
"multer": "^1.4.5-lts.1",
|
||||
"openai": "^5.10.2",
|
||||
"pdf-parse": "^1.1.1",
|
||||
"pg": "^8.11.3",
|
||||
@@ -62,7 +61,6 @@
|
||||
"@types/jest": "^29.5.8",
|
||||
"@types/jsonwebtoken": "^9.0.5",
|
||||
"@types/morgan": "^1.9.9",
|
||||
"@types/multer": "^1.4.11",
|
||||
"@types/node": "^20.9.0",
|
||||
"@types/pdf-parse": "^1.1.4",
|
||||
"@types/pg": "^8.10.7",
|
||||
|
||||
@@ -10,7 +10,7 @@ const GCS_BUCKET_NAME = 'cim-summarizer-uploads';
|
||||
const DOCUMENT_AI_OUTPUT_BUCKET_NAME = 'cim-summarizer-document-ai-output';
|
||||
|
||||
async function setupComplete() {
|
||||
console.log('🚀 Complete Document AI + Genkit Setup\n');
|
||||
console.log('🚀 Complete Document AI + Agentic RAG Setup\n');
|
||||
|
||||
try {
|
||||
// Check current setup
|
||||
@@ -57,7 +57,7 @@ GCS_BUCKET_NAME=${GCS_BUCKET_NAME}
|
||||
DOCUMENT_AI_OUTPUT_BUCKET_NAME=${DOCUMENT_AI_OUTPUT_BUCKET_NAME}
|
||||
|
||||
# Processing Strategy
|
||||
PROCESSING_STRATEGY=document_ai_genkit
|
||||
PROCESSING_STRATEGY=document_ai_agentic_rag
|
||||
|
||||
# Google Cloud Authentication
|
||||
GOOGLE_APPLICATION_CREDENTIALS=./serviceAccountKey.json
|
||||
@@ -91,7 +91,7 @@ MAX_FILE_SIZE=104857600
|
||||
// Generate setup instructions
|
||||
console.log('\n3. Setup Instructions...');
|
||||
|
||||
const instructions = `# Document AI + Genkit Setup Instructions
|
||||
const instructions = `# Document AI + Agentic RAG Setup Instructions
|
||||
|
||||
## ✅ Completed Steps:
|
||||
1. Google Cloud Project: ${PROJECT_ID}
|
||||
@@ -120,7 +120,7 @@ Go to: https://console.cloud.google.com/ai/document-ai/processors
|
||||
Run: node scripts/test-integration-with-mock.js
|
||||
|
||||
### 4. Integrate with Existing System
|
||||
1. Update PROCESSING_STRATEGY=document_ai_genkit
|
||||
1. Update PROCESSING_STRATEGY=document_ai_agentic_rag
|
||||
2. Test with real CIM documents
|
||||
3. Monitor performance and costs
|
||||
|
||||
@@ -138,7 +138,7 @@ Run: node scripts/test-integration-with-mock.js
|
||||
## 📞 Support:
|
||||
- Google Cloud Console: https://console.cloud.google.com
|
||||
- Document AI Documentation: https://cloud.google.com/document-ai
|
||||
- Genkit Documentation: https://genkit.ai
|
||||
- Agentic RAG Documentation: See optimizedAgenticRAGProcessor.ts
|
||||
`;
|
||||
|
||||
const instructionsPath = path.join(__dirname, '../DOCUMENT_AI_SETUP_INSTRUCTIONS.md');
|
||||
@@ -177,7 +177,7 @@ Run: node scripts/test-integration-with-mock.js
|
||||
console.log('1. Create Document AI processor in console');
|
||||
console.log('2. Update .env file with processor ID');
|
||||
console.log('3. Test with real CIM documents');
|
||||
console.log('4. Switch to document_ai_genkit strategy');
|
||||
console.log('4. Switch to document_ai_agentic_rag strategy');
|
||||
|
||||
console.log('\n📁 Generated Files:');
|
||||
console.log(` - ${envPath}`);
|
||||
|
||||
@@ -94,7 +94,7 @@ APPENDIX
|
||||
}
|
||||
|
||||
async function testFullIntegration() {
|
||||
console.log('🧪 Testing Full Document AI + Genkit Integration...\n');
|
||||
console.log('🧪 Testing Full Document AI + Agentic RAG Integration...\n');
|
||||
|
||||
let testFile = null;
|
||||
|
||||
@@ -236,20 +236,20 @@ async function testFullIntegration() {
|
||||
console.log(` 🏷️ Entities found: ${documentAiOutput.entities.length}`);
|
||||
console.log(` 📋 Tables found: ${documentAiOutput.tables.length}`);
|
||||
|
||||
// Step 6: Test Genkit Integration (Simulated)
|
||||
console.log('\n6. Testing Genkit AI Analysis...');
|
||||
// Step 6: Test Agentic RAG Integration (Simulated)
|
||||
console.log('\n6. Testing Agentic RAG AI Analysis...');
|
||||
|
||||
// Simulate Genkit processing with the Document AI output
|
||||
const genkitInput = {
|
||||
// Simulate Agentic RAG processing with the Document AI output
|
||||
const agenticRagInput = {
|
||||
extractedText: documentAiOutput.text,
|
||||
fileName: testFile.testFileName,
|
||||
documentAiOutput: documentAiOutput
|
||||
};
|
||||
|
||||
console.log(' 🤖 Simulating Genkit AI analysis...');
|
||||
console.log(' 🤖 Simulating Agentic RAG AI analysis...');
|
||||
|
||||
// Simulate Genkit output based on the CIM analysis prompt
|
||||
const genkitOutput = {
|
||||
// Simulate Agentic RAG output based on the CIM analysis prompt
|
||||
const agenticRagOutput = {
|
||||
markdownOutput: `# CIM Investment Analysis: TechFlow Solutions Inc.
|
||||
|
||||
## Executive Summary
|
||||
@@ -360,19 +360,19 @@ async function testFullIntegration() {
|
||||
5. Team background verification
|
||||
|
||||
---
|
||||
*Analysis generated by Document AI + Genkit integration*
|
||||
*Analysis generated by Document AI + Agentic RAG integration*
|
||||
`
|
||||
};
|
||||
|
||||
console.log(` ✅ Genkit analysis completed`);
|
||||
console.log(` 📊 Analysis length: ${genkitOutput.markdownOutput.length} characters`);
|
||||
console.log(` ✅ Agentic RAG analysis completed`);
|
||||
console.log(` 📊 Analysis length: ${agenticRagOutput.markdownOutput.length} characters`);
|
||||
|
||||
// Step 7: Final Integration Test
|
||||
console.log('\n7. Final Integration Test...');
|
||||
|
||||
const finalResult = {
|
||||
success: true,
|
||||
summary: genkitOutput.markdownOutput,
|
||||
summary: agenticRagOutput.markdownOutput,
|
||||
analysisData: {
|
||||
company: 'TechFlow Solutions Inc.',
|
||||
industry: 'SaaS / Enterprise Software',
|
||||
@@ -393,7 +393,7 @@ async function testFullIntegration() {
|
||||
],
|
||||
exitStrategy: 'IPO within 3-4 years, $500M-$1B valuation'
|
||||
},
|
||||
processingStrategy: 'document_ai_genkit',
|
||||
processingStrategy: 'document_ai_agentic_rag',
|
||||
processingTime: Date.now(),
|
||||
apiCalls: 1,
|
||||
metadata: {
|
||||
@@ -430,7 +430,7 @@ async function testFullIntegration() {
|
||||
console.log('✅ Document AI text extraction simulated');
|
||||
console.log('✅ Entity recognition working (20 entities found)');
|
||||
console.log('✅ Table structure preserved');
|
||||
console.log('✅ Genkit AI analysis completed');
|
||||
console.log('✅ Agentic RAG AI analysis completed');
|
||||
console.log('✅ Full pipeline integration working');
|
||||
console.log('✅ Cleanup operations successful');
|
||||
|
||||
@@ -439,11 +439,11 @@ async function testFullIntegration() {
|
||||
console.log(` 📊 Extracted text: ${documentAiOutput.text.length} characters`);
|
||||
console.log(` 🏷️ Entities recognized: ${documentAiOutput.entities.length}`);
|
||||
console.log(` 📋 Tables extracted: ${documentAiOutput.tables.length}`);
|
||||
console.log(` 🤖 AI analysis length: ${genkitOutput.markdownOutput.length} characters`);
|
||||
console.log(` ⚡ Processing strategy: document_ai_genkit`);
|
||||
console.log(` 🤖 AI analysis length: ${agenticRagOutput.markdownOutput.length} characters`);
|
||||
console.log(` ⚡ Processing strategy: document_ai_agentic_rag`);
|
||||
|
||||
console.log('\n🚀 Ready for Production!');
|
||||
console.log('Your Document AI + Genkit integration is fully operational and ready to process real CIM documents.');
|
||||
console.log('Your Document AI + Agentic RAG integration is fully operational and ready to process real CIM documents.');
|
||||
|
||||
return finalResult;
|
||||
|
||||
|
||||
@@ -153,7 +153,7 @@ IPO or strategic acquisition within 5 years
|
||||
Expected return: 3-5x
|
||||
`,
|
||||
metadata: {
|
||||
processingStrategy: 'document_ai_genkit',
|
||||
processingStrategy: 'document_ai_agentic_rag',
|
||||
documentAiOutput: mockDocumentAiOutput,
|
||||
processingTime: Date.now(),
|
||||
fileSize: sampleCIM.length,
|
||||
|
||||
@@ -166,7 +166,7 @@ IPO or strategic acquisition within 5 years
|
||||
Expected return: 3-5x
|
||||
`,
|
||||
metadata: {
|
||||
processingStrategy: 'document_ai_genkit',
|
||||
processingStrategy: 'document_ai_agentic_rag',
|
||||
documentAiOutput: mockDocumentAiOutput,
|
||||
processingTime: Date.now(),
|
||||
fileSize: sampleCIM.length,
|
||||
@@ -193,7 +193,7 @@ GCS_BUCKET_NAME=${GCS_BUCKET_NAME}
|
||||
DOCUMENT_AI_OUTPUT_BUCKET_NAME=${DOCUMENT_AI_OUTPUT_BUCKET_NAME}
|
||||
|
||||
# Processing Strategy
|
||||
PROCESSING_STRATEGY=document_ai_genkit
|
||||
PROCESSING_STRATEGY=document_ai_agentic_rag
|
||||
|
||||
# Google Cloud Authentication
|
||||
GOOGLE_APPLICATION_CREDENTIALS=./serviceAccountKey.json
|
||||
@@ -212,7 +212,7 @@ GOOGLE_APPLICATION_CREDENTIALS=./serviceAccountKey.json
|
||||
console.log('\n📋 Next Steps:');
|
||||
console.log('1. Add the environment variables to your .env file');
|
||||
console.log('2. Test with real PDF CIM documents');
|
||||
console.log('3. Switch to document_ai_genkit strategy');
|
||||
console.log('3. Switch to document_ai_agentic_rag strategy');
|
||||
console.log('4. Monitor performance and quality');
|
||||
|
||||
return processingResult;
|
||||
|
||||
@@ -74,9 +74,7 @@ const envSchema = Joi.object({
|
||||
LOG_FILE: Joi.string().default('logs/app.log'),
|
||||
|
||||
// Processing Strategy
|
||||
PROCESSING_STRATEGY: Joi.string().valid('chunking', 'rag', 'agentic_rag', 'document_ai_genkit').default('chunking'),
|
||||
ENABLE_RAG_PROCESSING: Joi.boolean().default(false),
|
||||
ENABLE_PROCESSING_COMPARISON: Joi.boolean().default(false),
|
||||
PROCESSING_STRATEGY: Joi.string().valid('document_ai_agentic_rag').default('document_ai_agentic_rag'),
|
||||
|
||||
// Agentic RAG Configuration
|
||||
AGENTIC_RAG_ENABLED: Joi.boolean().default(false),
|
||||
|
||||
@@ -197,7 +197,7 @@ export const documentController = {
|
||||
userId,
|
||||
'', // Text is not needed for this strategy
|
||||
{
|
||||
strategy: 'document_ai_genkit',
|
||||
strategy: 'document_ai_agentic_rag',
|
||||
fileBuffer: fileBuffer,
|
||||
fileName: document.original_file_name,
|
||||
mimeType: 'application/pdf'
|
||||
@@ -294,153 +294,7 @@ export const documentController = {
|
||||
}
|
||||
},
|
||||
|
||||
async uploadDocument(req: Request, res: Response): Promise<void> {
|
||||
const startTime = Date.now();
|
||||
|
||||
// 🔍 COMPREHENSIVE DEBUG: Log everything about the request
|
||||
console.log('🚀 =========================');
|
||||
console.log('🚀 DOCUMENT AI UPLOAD STARTED');
|
||||
console.log('🚀 Method:', req.method);
|
||||
console.log('🚀 URL:', req.url);
|
||||
console.log('🚀 Content-Type:', req.get('Content-Type'));
|
||||
console.log('🚀 Content-Length:', req.get('Content-Length'));
|
||||
console.log('🚀 Authorization header present:', !!req.get('Authorization'));
|
||||
console.log('🚀 User from token:', req.user?.uid || 'NOT_FOUND');
|
||||
|
||||
// Debug body in detail
|
||||
console.log('🚀 Has body:', !!req.body);
|
||||
console.log('🚀 Body type:', typeof req.body);
|
||||
console.log('🚀 Body constructor:', req.body?.constructor?.name);
|
||||
console.log('🚀 Body length:', req.body?.length || 0);
|
||||
console.log('🚀 Is Buffer?:', Buffer.isBuffer(req.body));
|
||||
|
||||
// Debug all headers
|
||||
console.log('🚀 All headers:', JSON.stringify(req.headers, null, 2));
|
||||
|
||||
// Debug request properties
|
||||
console.log('🚀 Request readable:', req.readable);
|
||||
console.log('🚀 Request complete:', req.complete);
|
||||
|
||||
// If body exists, show first few bytes
|
||||
if (req.body && req.body.length > 0) {
|
||||
const preview = req.body.slice(0, 100).toString('hex');
|
||||
console.log('🚀 Body preview (hex):', preview);
|
||||
|
||||
// Try to see if it contains multipart boundary
|
||||
const bodyStr = req.body.toString('utf8', 0, Math.min(500, req.body.length));
|
||||
console.log('🚀 Body preview (string):', bodyStr.substring(0, 200));
|
||||
}
|
||||
|
||||
console.log('🚀 =========================');
|
||||
|
||||
try {
|
||||
const userId = req.user?.uid;
|
||||
if (!userId) {
|
||||
console.log('❌ Authentication failed - no userId');
|
||||
res.status(401).json({
|
||||
error: 'User not authenticated',
|
||||
correlationId: req.correlationId
|
||||
});
|
||||
return;
|
||||
}
|
||||
|
||||
console.log('✅ Authentication successful for user:', userId);
|
||||
|
||||
// Get raw body buffer for Document AI processing
|
||||
const rawBody = req.body;
|
||||
if (!rawBody || rawBody.length === 0) {
|
||||
res.status(400).json({
|
||||
error: 'No file data received',
|
||||
correlationId: req.correlationId,
|
||||
debug: {
|
||||
method: req.method,
|
||||
contentType: req.get('Content-Type'),
|
||||
contentLength: req.get('Content-Length'),
|
||||
hasRawBody: !!rawBody,
|
||||
rawBodySize: rawBody?.length || 0,
|
||||
bodyType: typeof rawBody
|
||||
}
|
||||
});
|
||||
return;
|
||||
}
|
||||
|
||||
console.log('✅ Found raw body buffer:', rawBody.length, 'bytes');
|
||||
|
||||
// Create document record first
|
||||
const document = await DocumentModel.create({
|
||||
user_id: userId,
|
||||
original_file_name: 'uploaded-document.pdf',
|
||||
file_path: '',
|
||||
file_size: rawBody.length,
|
||||
status: 'processing_llm'
|
||||
});
|
||||
|
||||
console.log('✅ Document record created:', document.id);
|
||||
|
||||
// Process with Document AI directly
|
||||
const { DocumentAiGenkitProcessor } = await import('../services/documentAiGenkitProcessor');
|
||||
const processor = new DocumentAiGenkitProcessor();
|
||||
|
||||
console.log('✅ Starting Document AI processing...');
|
||||
const result = await processor.processDocument(
|
||||
document.id,
|
||||
userId,
|
||||
rawBody,
|
||||
'uploaded-document.pdf',
|
||||
'application/pdf'
|
||||
);
|
||||
|
||||
if (result.success) {
|
||||
await DocumentModel.updateById(document.id, {
|
||||
status: 'completed',
|
||||
generated_summary: result.content,
|
||||
processing_completed_at: new Date()
|
||||
});
|
||||
|
||||
console.log('✅ Document AI processing completed successfully');
|
||||
|
||||
res.status(201).json({
|
||||
id: document.id,
|
||||
name: 'uploaded-document.pdf',
|
||||
originalName: 'uploaded-document.pdf',
|
||||
status: 'completed',
|
||||
uploadedAt: document.created_at,
|
||||
uploadedBy: userId,
|
||||
fileSize: rawBody.length,
|
||||
summary: result.content,
|
||||
correlationId: req.correlationId || undefined
|
||||
});
|
||||
return;
|
||||
} else {
|
||||
console.log('❌ Document AI processing failed:', result.error);
|
||||
await DocumentModel.updateById(document.id, {
|
||||
status: 'failed',
|
||||
error_message: result.error
|
||||
});
|
||||
|
||||
res.status(500).json({
|
||||
error: 'Document processing failed',
|
||||
message: result.error,
|
||||
correlationId: req.correlationId || undefined
|
||||
});
|
||||
return;
|
||||
}
|
||||
|
||||
} catch (error) {
|
||||
console.log('❌ Upload error:', error);
|
||||
|
||||
logger.error('Upload document failed', {
|
||||
error,
|
||||
correlationId: req.correlationId
|
||||
});
|
||||
|
||||
res.status(500).json({
|
||||
error: 'Upload failed',
|
||||
message: error instanceof Error ? error.message : 'Unknown error',
|
||||
correlationId: req.correlationId || undefined
|
||||
});
|
||||
}
|
||||
},
|
||||
|
||||
async getDocuments(req: Request, res: Response): Promise<void> {
|
||||
try {
|
||||
|
||||
@@ -1,189 +0,0 @@
|
||||
import { Request, Response, NextFunction } from 'express';
|
||||
import multer from 'multer';
|
||||
import fs from 'fs';
|
||||
import { handleFileUpload, handleUploadError, cleanupUploadedFile, getFileInfo } from '../upload';
|
||||
|
||||
// Mock the logger
|
||||
jest.mock('../../utils/logger', () => ({
|
||||
logger: {
|
||||
info: jest.fn(),
|
||||
warn: jest.fn(),
|
||||
error: jest.fn(),
|
||||
},
|
||||
}));
|
||||
|
||||
// Mock fs
|
||||
jest.mock('fs', () => ({
|
||||
existsSync: jest.fn(),
|
||||
mkdirSync: jest.fn(),
|
||||
}));
|
||||
|
||||
describe('Upload Middleware', () => {
|
||||
let mockReq: Partial<Request>;
|
||||
let mockRes: Partial<Response>;
|
||||
let mockNext: NextFunction;
|
||||
|
||||
beforeEach(() => {
|
||||
mockReq = {
|
||||
ip: '127.0.0.1',
|
||||
} as any;
|
||||
mockRes = {
|
||||
status: jest.fn().mockReturnThis(),
|
||||
json: jest.fn(),
|
||||
};
|
||||
mockNext = jest.fn();
|
||||
|
||||
// Reset mocks
|
||||
jest.clearAllMocks();
|
||||
});
|
||||
|
||||
describe('handleUploadError', () => {
|
||||
it('should handle LIMIT_FILE_SIZE error', () => {
|
||||
const error = new multer.MulterError('LIMIT_FILE_SIZE', 'document');
|
||||
error.code = 'LIMIT_FILE_SIZE';
|
||||
|
||||
handleUploadError(error, mockReq as Request, mockRes as Response, mockNext);
|
||||
|
||||
expect(mockRes.status).toHaveBeenCalledWith(400);
|
||||
expect(mockRes.json).toHaveBeenCalledWith({
|
||||
success: false,
|
||||
error: 'File too large',
|
||||
message: expect.stringContaining('File size must be less than'),
|
||||
});
|
||||
});
|
||||
|
||||
it('should handle LIMIT_FILE_COUNT error', () => {
|
||||
const error = new multer.MulterError('LIMIT_FILE_COUNT', 'document');
|
||||
error.code = 'LIMIT_FILE_COUNT';
|
||||
|
||||
handleUploadError(error, mockReq as Request, mockRes as Response, mockNext);
|
||||
|
||||
expect(mockRes.status).toHaveBeenCalledWith(400);
|
||||
expect(mockRes.json).toHaveBeenCalledWith({
|
||||
success: false,
|
||||
error: 'Too many files',
|
||||
message: 'Only one file can be uploaded at a time',
|
||||
});
|
||||
});
|
||||
|
||||
it('should handle LIMIT_UNEXPECTED_FILE error', () => {
|
||||
const error = new multer.MulterError('LIMIT_UNEXPECTED_FILE', 'document');
|
||||
error.code = 'LIMIT_UNEXPECTED_FILE';
|
||||
|
||||
handleUploadError(error, mockReq as Request, mockRes as Response, mockNext);
|
||||
|
||||
expect(mockRes.status).toHaveBeenCalledWith(400);
|
||||
expect(mockRes.json).toHaveBeenCalledWith({
|
||||
success: false,
|
||||
error: 'Unexpected file field',
|
||||
message: 'File must be uploaded using the correct field name',
|
||||
});
|
||||
});
|
||||
|
||||
it('should handle generic multer errors', () => {
|
||||
const error = new multer.MulterError('LIMIT_FILE_SIZE', 'document');
|
||||
error.code = 'LIMIT_FILE_SIZE';
|
||||
|
||||
handleUploadError(error, mockReq as Request, mockRes as Response, mockNext);
|
||||
|
||||
expect(mockRes.status).toHaveBeenCalledWith(400);
|
||||
expect(mockRes.json).toHaveBeenCalledWith({
|
||||
success: false,
|
||||
error: 'File too large',
|
||||
message: expect.stringContaining('File size must be less than'),
|
||||
});
|
||||
});
|
||||
|
||||
it('should handle non-multer errors', () => {
|
||||
const error = new Error('Custom upload error');
|
||||
|
||||
handleUploadError(error, mockReq as Request, mockRes as Response, mockNext);
|
||||
|
||||
expect(mockRes.status).toHaveBeenCalledWith(400);
|
||||
expect(mockRes.json).toHaveBeenCalledWith({
|
||||
success: false,
|
||||
error: 'File upload failed',
|
||||
message: 'Custom upload error',
|
||||
});
|
||||
});
|
||||
|
||||
it('should call next when no error', () => {
|
||||
handleUploadError(null, mockReq as Request, mockRes as Response, mockNext);
|
||||
|
||||
expect(mockNext).toHaveBeenCalled();
|
||||
expect(mockRes.status).not.toHaveBeenCalled();
|
||||
expect(mockRes.json).not.toHaveBeenCalled();
|
||||
});
|
||||
});
|
||||
|
||||
describe('cleanupUploadedFile', () => {
|
||||
it('should delete existing file', () => {
|
||||
const filePath = '/test/path/file.pdf';
|
||||
const mockUnlinkSync = jest.fn();
|
||||
|
||||
(fs.existsSync as jest.Mock).mockReturnValue(true);
|
||||
(fs.unlinkSync as jest.Mock) = mockUnlinkSync;
|
||||
|
||||
cleanupUploadedFile(filePath);
|
||||
|
||||
expect(fs.existsSync).toHaveBeenCalledWith(filePath);
|
||||
expect(mockUnlinkSync).toHaveBeenCalledWith(filePath);
|
||||
});
|
||||
|
||||
it('should not delete non-existent file', () => {
|
||||
const filePath = '/test/path/file.pdf';
|
||||
const mockUnlinkSync = jest.fn();
|
||||
|
||||
(fs.existsSync as jest.Mock).mockReturnValue(false);
|
||||
(fs.unlinkSync as jest.Mock) = mockUnlinkSync;
|
||||
|
||||
cleanupUploadedFile(filePath);
|
||||
|
||||
expect(fs.existsSync).toHaveBeenCalledWith(filePath);
|
||||
expect(mockUnlinkSync).not.toHaveBeenCalled();
|
||||
});
|
||||
|
||||
it('should handle deletion errors gracefully', () => {
|
||||
const filePath = '/test/path/file.pdf';
|
||||
const mockUnlinkSync = jest.fn().mockImplementation(() => {
|
||||
throw new Error('Permission denied');
|
||||
});
|
||||
|
||||
(fs.existsSync as jest.Mock).mockReturnValue(true);
|
||||
(fs.unlinkSync as jest.Mock) = mockUnlinkSync;
|
||||
|
||||
// Should not throw error
|
||||
expect(() => cleanupUploadedFile(filePath)).not.toThrow();
|
||||
});
|
||||
});
|
||||
|
||||
describe('getFileInfo', () => {
|
||||
it('should return correct file info', () => {
|
||||
const mockFile = {
|
||||
originalname: 'test-document.pdf',
|
||||
filename: '1234567890-abc123.pdf',
|
||||
path: '/uploads/test-user-id/1234567890-abc123.pdf',
|
||||
size: 1024,
|
||||
mimetype: 'application/pdf',
|
||||
};
|
||||
|
||||
const fileInfo = getFileInfo(mockFile as any);
|
||||
|
||||
expect(fileInfo).toEqual({
|
||||
originalName: 'test-document.pdf',
|
||||
filename: '1234567890-abc123.pdf',
|
||||
path: '/uploads/test-user-id/1234567890-abc123.pdf',
|
||||
size: 1024,
|
||||
mimetype: 'application/pdf',
|
||||
uploadedAt: expect.any(Date),
|
||||
});
|
||||
});
|
||||
});
|
||||
|
||||
describe('handleFileUpload middleware', () => {
|
||||
it('should be an array with uploadMiddleware and handleUploadError', () => {
|
||||
expect(Array.isArray(handleFileUpload)).toBe(true);
|
||||
expect(handleFileUpload).toHaveLength(2);
|
||||
});
|
||||
});
|
||||
});
|
||||
@@ -65,12 +65,7 @@ export const errorHandler = (
|
||||
error = { message, statusCode: 401 } as AppError;
|
||||
}
|
||||
|
||||
// Multer errors (check if multer is imported anywhere)
|
||||
if (err.name === 'MulterError' || (err as any).code === 'UNEXPECTED_END_OF_FORM') {
|
||||
console.log('🚨 MULTER ERROR CAUGHT:', err.message);
|
||||
const message = `File upload failed: ${err.message}`;
|
||||
error = { message, statusCode: 400 } as AppError;
|
||||
}
|
||||
|
||||
|
||||
// Default error
|
||||
const statusCode = error.statusCode || 500;
|
||||
|
||||
@@ -1,212 +0,0 @@
|
||||
import multer from 'multer';
|
||||
import path from 'path';
|
||||
import fs from 'fs';
|
||||
import { Request, Response, NextFunction } from 'express';
|
||||
import { config } from '../config/env';
|
||||
import { logger } from '../utils/logger';
|
||||
|
||||
// Use temporary directory for file uploads (files will be immediately moved to GCS)
|
||||
const uploadDir = '/tmp/uploads';
|
||||
if (!fs.existsSync(uploadDir)) {
|
||||
fs.mkdirSync(uploadDir, { recursive: true });
|
||||
}
|
||||
|
||||
// File filter function
|
||||
const fileFilter = (req: Request, file: any, cb: multer.FileFilterCallback) => {
|
||||
console.log('🔍 ===== FILE FILTER CALLED =====');
|
||||
console.log('🔍 File originalname:', file.originalname);
|
||||
console.log('🔍 File mimetype:', file.mimetype);
|
||||
console.log('🔍 File size:', file.size);
|
||||
console.log('🔍 File encoding:', file.encoding);
|
||||
console.log('🔍 File fieldname:', file.fieldname);
|
||||
console.log('🔍 Request Content-Type:', req.get('Content-Type'));
|
||||
console.log('🔍 Request Content-Length:', req.get('Content-Length'));
|
||||
console.log('🔍 ===========================');
|
||||
|
||||
// Check file type - allow PDF and text files for testing
|
||||
const allowedTypes = ['application/pdf', 'text/plain', 'text/html'];
|
||||
if (!allowedTypes.includes(file.mimetype)) {
|
||||
const error = new Error(`File type ${file.mimetype} is not allowed. Only PDF and text files are accepted.`);
|
||||
console.log('❌ File rejected - invalid type:', file.mimetype);
|
||||
logger.warn(`File upload rejected - invalid type: ${file.mimetype}`, {
|
||||
originalName: file.originalname,
|
||||
size: file.size,
|
||||
ip: req.ip,
|
||||
});
|
||||
return cb(error);
|
||||
}
|
||||
|
||||
// Check file extension - allow PDF and text extensions for testing
|
||||
const ext = path.extname(file.originalname).toLowerCase();
|
||||
if (!['.pdf', '.txt', '.html'].includes(ext)) {
|
||||
const error = new Error(`File extension ${ext} is not allowed. Only .pdf, .txt, and .html files are accepted.`);
|
||||
console.log('❌ File rejected - invalid extension:', ext);
|
||||
logger.warn(`File upload rejected - invalid extension: ${ext}`, {
|
||||
originalName: file.originalname,
|
||||
size: file.size,
|
||||
ip: req.ip,
|
||||
});
|
||||
return cb(error);
|
||||
}
|
||||
|
||||
console.log('✅ File accepted:', file.originalname);
|
||||
logger.info(`File upload accepted: ${file.originalname}`, {
|
||||
originalName: file.originalname,
|
||||
size: file.size,
|
||||
mimetype: file.mimetype,
|
||||
ip: req.ip,
|
||||
});
|
||||
cb(null, true);
|
||||
};
|
||||
|
||||
// Storage configuration - use memory storage for immediate GCS upload
|
||||
const storage = multer.memoryStorage();
|
||||
|
||||
// Create multer instance
|
||||
const upload = multer({
|
||||
storage,
|
||||
fileFilter,
|
||||
limits: {
|
||||
fileSize: config.upload.maxFileSize, // 100MB default
|
||||
files: 1, // Only allow 1 file per request
|
||||
},
|
||||
});
|
||||
|
||||
// Error handling middleware for multer
|
||||
export const handleUploadError = (error: any, req: Request, res: Response, next: NextFunction): void => {
|
||||
console.log('🚨 =============================');
|
||||
console.log('🚨 UPLOAD ERROR HANDLER CALLED');
|
||||
console.log('🚨 Error type:', error?.constructor?.name);
|
||||
console.log('🚨 Error message:', error?.message);
|
||||
console.log('🚨 Error code:', error?.code);
|
||||
console.log('🚨 Is MulterError:', error instanceof multer.MulterError);
|
||||
console.log('🚨 =============================');
|
||||
|
||||
if (error instanceof multer.MulterError) {
|
||||
logger.error('Multer error during file upload:', {
|
||||
error: error.message,
|
||||
code: error.code,
|
||||
field: error.field,
|
||||
originalName: req.file?.originalname,
|
||||
ip: req.ip,
|
||||
});
|
||||
|
||||
switch (error.code) {
|
||||
case 'LIMIT_FILE_SIZE':
|
||||
res.status(400).json({
|
||||
success: false,
|
||||
error: 'File too large',
|
||||
message: `File size must be less than ${config.upload.maxFileSize / (1024 * 1024)}MB`,
|
||||
});
|
||||
return;
|
||||
case 'LIMIT_FILE_COUNT':
|
||||
res.status(400).json({
|
||||
success: false,
|
||||
error: 'Too many files',
|
||||
message: 'Only one file can be uploaded at a time',
|
||||
});
|
||||
return;
|
||||
case 'LIMIT_UNEXPECTED_FILE':
|
||||
res.status(400).json({
|
||||
success: false,
|
||||
error: 'Unexpected file field',
|
||||
message: 'File must be uploaded using the correct field name',
|
||||
});
|
||||
return;
|
||||
default:
|
||||
res.status(400).json({
|
||||
success: false,
|
||||
error: 'File upload error',
|
||||
message: error.message,
|
||||
});
|
||||
return;
|
||||
}
|
||||
}
|
||||
|
||||
if (error) {
|
||||
logger.error('File upload error:', {
|
||||
error: error.message,
|
||||
originalName: req.file?.originalname,
|
||||
ip: req.ip,
|
||||
});
|
||||
|
||||
res.status(400).json({
|
||||
success: false,
|
||||
error: 'File upload failed',
|
||||
message: error.message,
|
||||
});
|
||||
return;
|
||||
}
|
||||
|
||||
next();
|
||||
};
|
||||
|
||||
// Main upload middleware with timeout handling
|
||||
export const uploadMiddleware = (req: Request, res: Response, next: NextFunction) => {
|
||||
console.log('📤 =============================');
|
||||
console.log('📤 UPLOAD MIDDLEWARE CALLED');
|
||||
console.log('📤 Request method:', req.method);
|
||||
console.log('📤 Request URL:', req.url);
|
||||
console.log('📤 Content-Type:', req.get('Content-Type'));
|
||||
console.log('📤 Content-Length:', req.get('Content-Length'));
|
||||
console.log('📤 User-Agent:', req.get('User-Agent'));
|
||||
console.log('📤 =============================');
|
||||
|
||||
// Set a timeout for the upload
|
||||
const uploadTimeout = setTimeout(() => {
|
||||
logger.error('Upload timeout for request:', {
|
||||
ip: req.ip,
|
||||
userAgent: req.get('User-Agent'),
|
||||
});
|
||||
res.status(408).json({
|
||||
success: false,
|
||||
error: 'Upload timeout',
|
||||
message: 'Upload took too long to complete',
|
||||
});
|
||||
}, 300000); // 5 minutes timeout
|
||||
|
||||
// Clear timeout on successful upload
|
||||
const originalNext = next;
|
||||
next = (err?: any) => {
|
||||
clearTimeout(uploadTimeout);
|
||||
if (err) {
|
||||
console.log('❌ Upload middleware error:', err);
|
||||
console.log('❌ Error details:', {
|
||||
name: err.name,
|
||||
message: err.message,
|
||||
code: err.code,
|
||||
stack: err.stack?.split('\n')[0]
|
||||
});
|
||||
} else {
|
||||
console.log('✅ Upload middleware completed successfully');
|
||||
console.log('✅ File after multer processing:', {
|
||||
hasFile: !!req.file,
|
||||
filename: req.file?.originalname,
|
||||
size: req.file?.size,
|
||||
mimetype: req.file?.mimetype
|
||||
});
|
||||
}
|
||||
originalNext(err);
|
||||
};
|
||||
|
||||
console.log('🔄 Calling multer.single("document")...');
|
||||
upload.single('document')(req, res, next);
|
||||
};
|
||||
|
||||
// Combined middleware for file uploads
|
||||
export const handleFileUpload = [
|
||||
uploadMiddleware,
|
||||
handleUploadError,
|
||||
];
|
||||
|
||||
// Utility function to get file info from memory buffer
|
||||
export const getFileInfo = (file: any) => {
|
||||
return {
|
||||
originalName: file.originalname,
|
||||
filename: file.originalname, // Use original name since we're not saving to disk
|
||||
buffer: file.buffer, // File buffer for GCS upload
|
||||
size: file.size,
|
||||
mimetype: file.mimetype,
|
||||
uploadedAt: new Date(),
|
||||
};
|
||||
};
|
||||
@@ -4,7 +4,6 @@ import { documentController } from '../controllers/documentController';
|
||||
import { unifiedDocumentProcessor } from '../services/unifiedDocumentProcessor';
|
||||
import { logger } from '../utils/logger';
|
||||
import { config } from '../config/env';
|
||||
import { handleFileUpload } from '../middleware/upload';
|
||||
import { DocumentModel } from '../models/DocumentModel';
|
||||
import { validateUUID, addCorrelationId } from '../middleware/validation';
|
||||
|
||||
@@ -79,13 +78,11 @@ router.get('/processing-stats', async (req, res) => {
|
||||
}
|
||||
});
|
||||
|
||||
// NEW Firebase Storage direct upload routes
|
||||
// Firebase Storage direct upload routes
|
||||
router.post('/upload-url', documentController.getUploadUrl);
|
||||
router.post('/:id/confirm-upload', validateUUID('id'), documentController.confirmUpload);
|
||||
|
||||
// LEGACY multipart upload routes (keeping for backward compatibility)
|
||||
router.post('/upload', handleFileUpload, documentController.uploadDocument);
|
||||
router.post('/', handleFileUpload, documentController.uploadDocument);
|
||||
// Document listing route
|
||||
router.get('/', documentController.getDocuments);
|
||||
|
||||
// Document-specific routes with UUID validation
|
||||
|
||||
@@ -1,72 +1,9 @@
|
||||
import { Router } from 'express';
|
||||
import { vectorDocumentProcessor } from '../services/vectorDocumentProcessor';
|
||||
import { VectorDatabaseModel } from '../models/VectorDatabaseModel';
|
||||
import { logger } from '../utils/logger';
|
||||
|
||||
const router = Router();
|
||||
|
||||
// Extend VectorDocumentProcessor with missing methods
|
||||
const extendedVectorProcessor = {
|
||||
...vectorDocumentProcessor,
|
||||
|
||||
async findSimilarDocuments(
|
||||
documentId: string,
|
||||
limit: number,
|
||||
similarityThreshold: number
|
||||
) {
|
||||
// Implementation for finding similar documents
|
||||
const chunks = await VectorDatabaseModel.getDocumentChunks(documentId);
|
||||
// For now, return a basic implementation
|
||||
return chunks.slice(0, limit).map(chunk => ({
|
||||
...chunk,
|
||||
similarity: Math.random() * (1 - similarityThreshold) + similarityThreshold
|
||||
}));
|
||||
},
|
||||
|
||||
async searchByIndustry(
|
||||
industry: string,
|
||||
query: string,
|
||||
limit: number
|
||||
) {
|
||||
// Implementation for industry search
|
||||
const allChunks = await VectorDatabaseModel.getAllChunks();
|
||||
return allChunks
|
||||
.filter(chunk =>
|
||||
chunk.content.toLowerCase().includes(industry.toLowerCase()) ||
|
||||
chunk.content.toLowerCase().includes(query.toLowerCase())
|
||||
)
|
||||
.slice(0, limit);
|
||||
},
|
||||
|
||||
async processCIMSections(
|
||||
documentId: string,
|
||||
cimData: any,
|
||||
metadata: any
|
||||
) {
|
||||
// Implementation for processing CIM sections
|
||||
const chunks = await VectorDatabaseModel.getDocumentChunks(documentId);
|
||||
return {
|
||||
documentId,
|
||||
processedSections: chunks.length,
|
||||
metadata,
|
||||
cimData
|
||||
};
|
||||
},
|
||||
|
||||
async getVectorDatabaseStats() {
|
||||
// Implementation for getting vector database stats
|
||||
const totalChunks = await VectorDatabaseModel.getTotalChunkCount();
|
||||
return {
|
||||
totalChunks,
|
||||
totalDocuments: await VectorDatabaseModel.getTotalDocumentCount(),
|
||||
averageChunkSize: await VectorDatabaseModel.getAverageChunkSize()
|
||||
};
|
||||
}
|
||||
};
|
||||
|
||||
// DISABLED: All vector processing routes have been disabled
|
||||
// Only read-only endpoints for monitoring and analytics are kept
|
||||
|
||||
/**
|
||||
* GET /api/vector/document-chunks/:documentId
|
||||
* Get document chunks for a specific document (read-only)
|
||||
@@ -115,7 +52,11 @@ router.get('/analytics', async (req, res) => {
|
||||
*/
|
||||
router.get('/stats', async (_req, res) => {
|
||||
try {
|
||||
const stats = await extendedVectorProcessor.getVectorDatabaseStats();
|
||||
const stats = {
|
||||
totalChunks: await VectorDatabaseModel.getTotalChunkCount(),
|
||||
totalDocuments: await VectorDatabaseModel.getTotalDocumentCount(),
|
||||
averageChunkSize: await VectorDatabaseModel.getAverageChunkSize()
|
||||
};
|
||||
|
||||
return res.json({ stats });
|
||||
} catch (error) {
|
||||
|
||||
@@ -1,523 +0,0 @@
|
||||
import { agenticRAGProcessor } from '../agenticRAGProcessor';
|
||||
import { llmService } from '../llmService';
|
||||
import { AgentExecutionModel, AgenticRAGSessionModel, QualityMetricsModel } from '../../models/AgenticRAGModels';
|
||||
import { config } from '../../config/env';
|
||||
import { QualityMetrics } from '../../models/agenticTypes';
|
||||
|
||||
// Mock dependencies
|
||||
jest.mock('../llmService');
|
||||
jest.mock('../../models/AgenticRAGModels');
|
||||
jest.mock('../../config/env');
|
||||
jest.mock('../../utils/logger');
|
||||
|
||||
const mockLLMService = llmService as jest.Mocked<typeof llmService>;
|
||||
const mockAgentExecutionModel = AgentExecutionModel as jest.Mocked<typeof AgentExecutionModel>;
|
||||
const mockAgenticRAGSessionModel = AgenticRAGSessionModel as jest.Mocked<typeof AgenticRAGSessionModel>;
|
||||
const mockQualityMetricsModel = QualityMetricsModel as jest.Mocked<typeof QualityMetricsModel>;
|
||||
|
||||
describe('AgenticRAGProcessor', () => {
|
||||
let processor: any;
|
||||
|
||||
beforeEach(() => {
|
||||
jest.clearAllMocks();
|
||||
|
||||
// Mock config
|
||||
Object.assign(config, {
|
||||
agenticRag: {
|
||||
enabled: true,
|
||||
maxAgents: 6,
|
||||
parallelProcessing: true,
|
||||
validationStrict: true,
|
||||
retryAttempts: 3,
|
||||
timeoutPerAgent: 60000,
|
||||
},
|
||||
agentSpecific: {
|
||||
documentUnderstandingEnabled: true,
|
||||
financialAnalysisEnabled: true,
|
||||
marketAnalysisEnabled: true,
|
||||
investmentThesisEnabled: true,
|
||||
synthesisEnabled: true,
|
||||
validationEnabled: true,
|
||||
},
|
||||
llm: {
|
||||
maxTokens: 3000,
|
||||
temperature: 0.1,
|
||||
},
|
||||
});
|
||||
|
||||
// Mock successful LLM responses using the public method
|
||||
mockLLMService.processCIMDocument.mockResolvedValue({
|
||||
success: true,
|
||||
jsonOutput: createMockAgentResponse('document_understanding'),
|
||||
model: 'claude-3-opus-20240229',
|
||||
cost: 0.50,
|
||||
inputTokens: 1000,
|
||||
outputTokens: 500,
|
||||
});
|
||||
|
||||
// Mock database operations
|
||||
mockAgenticRAGSessionModel.create.mockResolvedValue(createMockSession());
|
||||
mockAgenticRAGSessionModel.update.mockResolvedValue(createMockSession());
|
||||
mockAgentExecutionModel.create.mockResolvedValue(createMockExecution());
|
||||
mockAgentExecutionModel.update.mockResolvedValue(createMockExecution());
|
||||
mockAgentExecutionModel.getBySessionId.mockResolvedValue([createMockExecution()]);
|
||||
mockQualityMetricsModel.create.mockResolvedValue(createMockQualityMetric());
|
||||
|
||||
processor = agenticRAGProcessor;
|
||||
});
|
||||
|
||||
describe('processDocument', () => {
|
||||
it('should successfully process document with all agents', async () => {
|
||||
// Arrange
|
||||
const documentText = loadTestDocument();
|
||||
const documentId = 'test-doc-123';
|
||||
const userId = 'test-user-123';
|
||||
|
||||
// Mock successful agent responses for all steps
|
||||
mockLLMService.processCIMDocument
|
||||
.mockResolvedValueOnce({
|
||||
success: true,
|
||||
jsonOutput: createMockAgentResponse('document_understanding'),
|
||||
model: 'claude-3-opus-20240229',
|
||||
cost: 0.50,
|
||||
inputTokens: 1000,
|
||||
outputTokens: 500,
|
||||
})
|
||||
.mockResolvedValueOnce({
|
||||
success: true,
|
||||
jsonOutput: createMockAgentResponse('financial_analysis'),
|
||||
model: 'claude-3-opus-20240229',
|
||||
cost: 0.50,
|
||||
inputTokens: 1000,
|
||||
outputTokens: 500,
|
||||
})
|
||||
.mockResolvedValueOnce({
|
||||
success: true,
|
||||
jsonOutput: createMockAgentResponse('market_analysis'),
|
||||
model: 'claude-3-opus-20240229',
|
||||
cost: 0.50,
|
||||
inputTokens: 1000,
|
||||
outputTokens: 500,
|
||||
})
|
||||
.mockResolvedValueOnce({
|
||||
success: true,
|
||||
jsonOutput: createMockAgentResponse('investment_thesis'),
|
||||
model: 'claude-3-opus-20240229',
|
||||
cost: 0.50,
|
||||
inputTokens: 1000,
|
||||
outputTokens: 500,
|
||||
})
|
||||
.mockResolvedValueOnce({
|
||||
success: true,
|
||||
jsonOutput: createMockAgentResponse('synthesis'),
|
||||
model: 'claude-3-opus-20240229',
|
||||
cost: 0.50,
|
||||
inputTokens: 1000,
|
||||
outputTokens: 500,
|
||||
})
|
||||
.mockResolvedValueOnce({
|
||||
success: true,
|
||||
jsonOutput: createMockAgentResponse('validation'),
|
||||
model: 'claude-3-opus-20240229',
|
||||
cost: 0.50,
|
||||
inputTokens: 1000,
|
||||
outputTokens: 500,
|
||||
});
|
||||
|
||||
// Act
|
||||
const result = await processor.processDocument(documentText, documentId, userId);
|
||||
|
||||
// Assert
|
||||
expect(result.success).toBe(true);
|
||||
expect(result.reasoningSteps).toBeDefined();
|
||||
expect(result.qualityMetrics).toBeDefined();
|
||||
expect(result.processingTime).toBeGreaterThan(0);
|
||||
expect(result.sessionId).toBeDefined();
|
||||
expect(result.error).toBeUndefined();
|
||||
|
||||
// Verify session was created and updated
|
||||
expect(mockAgenticRAGSessionModel.create).toHaveBeenCalledWith(
|
||||
expect.objectContaining({
|
||||
documentId,
|
||||
userId,
|
||||
strategy: 'agentic_rag',
|
||||
status: 'pending',
|
||||
totalAgents: 6,
|
||||
})
|
||||
);
|
||||
|
||||
// Verify all agents were executed
|
||||
expect(mockLLMService.processCIMDocument).toHaveBeenCalledTimes(6);
|
||||
});
|
||||
|
||||
it('should handle agent failures gracefully', async () => {
|
||||
// Arrange
|
||||
const documentText = loadTestDocument();
|
||||
const documentId = 'test-doc-123';
|
||||
const userId = 'test-user-123';
|
||||
|
||||
// Mock one agent failure
|
||||
mockLLMService.processCIMDocument
|
||||
.mockResolvedValueOnce({
|
||||
success: true,
|
||||
jsonOutput: createMockAgentResponse('document_understanding'),
|
||||
model: 'claude-3-opus-20240229',
|
||||
cost: 0.50,
|
||||
inputTokens: 1000,
|
||||
outputTokens: 500,
|
||||
})
|
||||
.mockRejectedValueOnce(new Error('Financial analysis failed'));
|
||||
|
||||
// Act
|
||||
const result = await processor.processDocument(documentText, documentId, userId);
|
||||
|
||||
// Assert
|
||||
expect(result.success).toBe(false);
|
||||
expect(result.error).toContain('Financial analysis failed');
|
||||
expect(result.reasoningSteps).toBeDefined();
|
||||
expect(result.sessionId).toBeDefined();
|
||||
|
||||
// Verify session was marked as failed
|
||||
expect(mockAgenticRAGSessionModel.update).toHaveBeenCalledWith(
|
||||
expect.any(String),
|
||||
expect.objectContaining({
|
||||
status: 'failed',
|
||||
})
|
||||
);
|
||||
});
|
||||
|
||||
it('should retry failed agents according to retry strategy', async () => {
|
||||
// Arrange
|
||||
const documentText = loadTestDocument();
|
||||
const documentId = 'test-doc-123';
|
||||
const userId = 'test-user-123';
|
||||
|
||||
// Mock agent that fails twice then succeeds
|
||||
mockLLMService.processCIMDocument
|
||||
.mockRejectedValueOnce(new Error('Temporary failure'))
|
||||
.mockRejectedValueOnce(new Error('Temporary failure'))
|
||||
.mockResolvedValueOnce({
|
||||
success: true,
|
||||
jsonOutput: createMockAgentResponse('document_understanding'),
|
||||
model: 'claude-3-opus-20240229',
|
||||
cost: 0.50,
|
||||
inputTokens: 1000,
|
||||
outputTokens: 500,
|
||||
});
|
||||
|
||||
// Act
|
||||
const result = await processor.processDocument(documentText, documentId, userId);
|
||||
|
||||
// Assert
|
||||
expect(mockLLMService.processCIMDocument).toHaveBeenCalledTimes(3);
|
||||
expect(result.success).toBe(true);
|
||||
});
|
||||
|
||||
it('should assess quality metrics correctly', async () => {
|
||||
// Arrange
|
||||
const documentText = loadTestDocument();
|
||||
const documentId = 'test-doc-123';
|
||||
const userId = 'test-user-123';
|
||||
|
||||
// Mock successful processing
|
||||
mockLLMService.processCIMDocument.mockResolvedValue({
|
||||
success: true,
|
||||
jsonOutput: createMockAgentResponse('document_understanding'),
|
||||
model: 'claude-3-opus-20240229',
|
||||
cost: 0.50,
|
||||
inputTokens: 1000,
|
||||
outputTokens: 500,
|
||||
});
|
||||
|
||||
// Act
|
||||
const result = await processor.processDocument(documentText, documentId, userId);
|
||||
|
||||
// Assert
|
||||
expect(result.qualityMetrics).toBeDefined();
|
||||
expect(result.qualityMetrics.length).toBeGreaterThan(0);
|
||||
expect(result.qualityMetrics.every((m: QualityMetrics) => m.metricValue >= 0 && m.metricValue <= 1)).toBe(true);
|
||||
});
|
||||
|
||||
it('should handle circuit breaker pattern', async () => {
|
||||
// Arrange
|
||||
const documentText = loadTestDocument();
|
||||
const documentId = 'test-doc-123';
|
||||
const userId = 'test-user-123';
|
||||
|
||||
// Mock repeated failures to trigger circuit breaker
|
||||
mockLLMService.processCIMDocument.mockRejectedValue(new Error('Service unavailable'));
|
||||
|
||||
// Act
|
||||
const result = await processor.processDocument(documentText, documentId, userId);
|
||||
|
||||
// Assert
|
||||
expect(result.success).toBe(false);
|
||||
expect(result.error).toContain('Service unavailable');
|
||||
});
|
||||
|
||||
it('should track API calls and costs', async () => {
|
||||
// Arrange
|
||||
const documentText = loadTestDocument();
|
||||
const documentId = 'test-doc-123';
|
||||
const userId = 'test-user-123';
|
||||
|
||||
// Mock successful processing
|
||||
mockLLMService.processCIMDocument.mockResolvedValue({
|
||||
success: true,
|
||||
jsonOutput: createMockAgentResponse('document_understanding'),
|
||||
model: 'claude-3-opus-20240229',
|
||||
cost: 0.50,
|
||||
inputTokens: 1000,
|
||||
outputTokens: 500,
|
||||
});
|
||||
|
||||
// Act
|
||||
const result = await processor.processDocument(documentText, documentId, userId);
|
||||
|
||||
// Assert
|
||||
expect(result.apiCalls).toBeGreaterThan(0);
|
||||
expect(result.totalCost).toBeDefined();
|
||||
});
|
||||
});
|
||||
|
||||
describe('error handling', () => {
|
||||
it('should handle database errors gracefully', async () => {
|
||||
// Arrange
|
||||
const documentText = loadTestDocument();
|
||||
const documentId = 'test-doc-123';
|
||||
const userId = 'test-user-123';
|
||||
|
||||
mockAgenticRAGSessionModel.create.mockRejectedValue(new Error('Database connection failed'));
|
||||
|
||||
// Act
|
||||
const result = await processor.processDocument(documentText, documentId, userId);
|
||||
|
||||
// Assert
|
||||
expect(result.success).toBe(false);
|
||||
expect(result.error).toContain('Database connection failed');
|
||||
});
|
||||
|
||||
it('should handle invalid JSON responses', async () => {
|
||||
// Arrange
|
||||
const documentText = loadTestDocument();
|
||||
const documentId = 'test-doc-123';
|
||||
const userId = 'test-user-123';
|
||||
|
||||
mockLLMService.processCIMDocument.mockResolvedValue({
|
||||
success: false,
|
||||
error: 'Invalid JSON response',
|
||||
model: 'claude-3-opus-20240229',
|
||||
cost: 0.50,
|
||||
inputTokens: 1000,
|
||||
outputTokens: 500,
|
||||
});
|
||||
|
||||
// Act
|
||||
const result = await processor.processDocument(documentText, documentId, userId);
|
||||
|
||||
// Assert
|
||||
expect(result.success).toBe(false);
|
||||
expect(result.error).toContain('Failed to parse JSON');
|
||||
});
|
||||
});
|
||||
|
||||
describe('configuration', () => {
|
||||
it('should respect agent-specific configuration', async () => {
|
||||
// Arrange
|
||||
const documentText = loadTestDocument();
|
||||
const documentId = 'test-doc-123';
|
||||
const userId = 'test-user-123';
|
||||
|
||||
// Disable some agents
|
||||
(config as any).agentSpecific.financialAnalysisEnabled = false;
|
||||
(config as any).agentSpecific.marketAnalysisEnabled = false;
|
||||
|
||||
mockLLMService.processCIMDocument.mockResolvedValue({
|
||||
success: true,
|
||||
jsonOutput: createMockAgentResponse('document_understanding'),
|
||||
model: 'claude-3-opus-20240229',
|
||||
cost: 0.50,
|
||||
inputTokens: 1000,
|
||||
outputTokens: 500,
|
||||
});
|
||||
|
||||
// Act
|
||||
const result = await processor.processDocument(documentText, documentId, userId);
|
||||
|
||||
// Assert
|
||||
// Should still work with enabled agents
|
||||
expect(result.success).toBeDefined();
|
||||
});
|
||||
});
|
||||
});
|
||||
|
||||
// Helper functions
|
||||
function createMockAgentResponse(agentName: string): any {
|
||||
const responses: Record<string, any> = {
|
||||
document_understanding: {
|
||||
companyOverview: {
|
||||
name: 'Test Company',
|
||||
industry: 'Technology',
|
||||
location: 'San Francisco, CA',
|
||||
founded: '2010',
|
||||
employees: '500'
|
||||
},
|
||||
documentStructure: {
|
||||
sections: ['Executive Summary', 'Financial Analysis', 'Market Analysis'],
|
||||
pageCount: 50,
|
||||
keyTopics: ['Financial Performance', 'Market Position', 'Growth Strategy']
|
||||
},
|
||||
financialHighlights: {
|
||||
revenue: '$100M',
|
||||
ebitda: '$20M',
|
||||
growth: '15%',
|
||||
margins: '20%'
|
||||
}
|
||||
},
|
||||
financial_analysis: {
|
||||
historicalPerformance: {
|
||||
revenue: ['$80M', '$90M', '$100M'],
|
||||
ebitda: ['$15M', '$18M', '$20M'],
|
||||
margins: ['18%', '20%', '20%']
|
||||
},
|
||||
qualityOfEarnings: 'High',
|
||||
workingCapital: 'Positive',
|
||||
cashFlow: 'Strong'
|
||||
},
|
||||
market_analysis: {
|
||||
marketSize: '$10B',
|
||||
growthRate: '8%',
|
||||
competitors: ['Competitor A', 'Competitor B'],
|
||||
barriersToEntry: 'High',
|
||||
competitiveAdvantages: ['Technology', 'Brand', 'Scale']
|
||||
},
|
||||
investment_thesis: {
|
||||
keyAttractions: ['Strong growth', 'Market leadership', 'Technology advantage'],
|
||||
potentialRisks: ['Market competition', 'Regulatory changes'],
|
||||
valueCreation: ['Operational improvements', 'Market expansion'],
|
||||
recommendation: 'Proceed with diligence'
|
||||
},
|
||||
synthesis: {
|
||||
dealOverview: {
|
||||
targetCompanyName: 'Test Company',
|
||||
industrySector: 'Technology',
|
||||
geography: 'San Francisco, CA'
|
||||
},
|
||||
financialSummary: {
|
||||
financials: {
|
||||
ltm: {
|
||||
revenue: '$100M',
|
||||
ebitda: '$20M'
|
||||
}
|
||||
}
|
||||
},
|
||||
preliminaryInvestmentThesis: {
|
||||
keyAttractions: ['Strong growth', 'Market leadership'],
|
||||
potentialRisks: ['Market competition']
|
||||
}
|
||||
},
|
||||
validation: {
|
||||
isValid: true,
|
||||
issues: [],
|
||||
completeness: '95%',
|
||||
quality: 'high'
|
||||
}
|
||||
};
|
||||
|
||||
return responses[agentName] || {};
|
||||
}
|
||||
|
||||
function createMockSession(): any {
|
||||
return {
|
||||
id: 'session-123',
|
||||
documentId: 'doc-123',
|
||||
userId: 'user-123',
|
||||
strategy: 'agentic_rag',
|
||||
status: 'completed',
|
||||
totalAgents: 6,
|
||||
completedAgents: 6,
|
||||
failedAgents: 0,
|
||||
overallValidationScore: 0.9,
|
||||
processingTimeMs: 120000,
|
||||
apiCallsCount: 6,
|
||||
totalCost: 2.50,
|
||||
reasoningSteps: [],
|
||||
finalResult: {},
|
||||
createdAt: new Date(),
|
||||
completedAt: new Date()
|
||||
};
|
||||
}
|
||||
|
||||
function createMockExecution(): any {
|
||||
return {
|
||||
id: 'execution-123',
|
||||
documentId: 'doc-123',
|
||||
sessionId: 'session-123',
|
||||
agentName: 'document_understanding',
|
||||
stepNumber: 1,
|
||||
status: 'completed',
|
||||
inputData: {},
|
||||
outputData: createMockAgentResponse('document_understanding'),
|
||||
validationResult: true,
|
||||
processingTimeMs: 20000,
|
||||
errorMessage: null,
|
||||
retryCount: 0,
|
||||
createdAt: new Date(),
|
||||
updatedAt: new Date()
|
||||
};
|
||||
}
|
||||
|
||||
function createMockQualityMetric(): any {
|
||||
return {
|
||||
id: 'metric-123',
|
||||
documentId: 'doc-123',
|
||||
sessionId: 'session-123',
|
||||
metricType: 'completeness',
|
||||
metricValue: 0.9,
|
||||
metricDetails: {
|
||||
requiredSections: 7,
|
||||
presentSections: 6,
|
||||
missingSections: ['managementTeamOverview']
|
||||
},
|
||||
createdAt: new Date()
|
||||
};
|
||||
}
|
||||
|
||||
function loadTestDocument(): string {
|
||||
// Mock document content for testing
|
||||
return `
|
||||
CONFIDENTIAL INVESTMENT MEMORANDUM
|
||||
|
||||
Test Company, Inc.
|
||||
|
||||
Executive Summary
|
||||
Test Company is a leading technology company with strong financial performance and market position.
|
||||
|
||||
Financial Performance
|
||||
- Revenue: $100M (2023)
|
||||
- EBITDA: $20M (2023)
|
||||
- Growth Rate: 15% annually
|
||||
|
||||
Market Position
|
||||
- Market Size: $10B
|
||||
- Market Share: 5%
|
||||
- Competitive Advantages: Technology, Brand, Scale
|
||||
|
||||
Management Team
|
||||
- CEO: John Smith (10+ years experience)
|
||||
- CFO: Jane Doe (15+ years experience)
|
||||
|
||||
Investment Opportunity
|
||||
- Strong growth potential
|
||||
- Market leadership position
|
||||
- Technology advantage
|
||||
- Experienced management team
|
||||
|
||||
Risks and Considerations
|
||||
- Market competition
|
||||
- Regulatory changes
|
||||
- Technology disruption
|
||||
|
||||
This memorandum contains confidential information and is for internal use only.
|
||||
`;
|
||||
}
|
||||
@@ -1,435 +0,0 @@
|
||||
import { documentProcessingService } from '../documentProcessingService';
|
||||
import { DocumentModel } from '../../models/DocumentModel';
|
||||
import { ProcessingJobModel } from '../../models/ProcessingJobModel';
|
||||
import { fileStorageService } from '../fileStorageService';
|
||||
import { llmService } from '../llmService';
|
||||
import { pdfGenerationService } from '../pdfGenerationService';
|
||||
import { config } from '../../config/env';
|
||||
import fs from 'fs';
|
||||
import path from 'path';
|
||||
|
||||
// Mock dependencies
|
||||
jest.mock('../../models/DocumentModel');
|
||||
jest.mock('../../models/ProcessingJobModel');
|
||||
jest.mock('../fileStorageService');
|
||||
jest.mock('../llmService');
|
||||
jest.mock('../pdfGenerationService');
|
||||
jest.mock('../../config/env');
|
||||
jest.mock('fs');
|
||||
jest.mock('path');
|
||||
|
||||
const mockDocumentModel = DocumentModel as jest.Mocked<typeof DocumentModel>;
|
||||
const mockProcessingJobModel = ProcessingJobModel as jest.Mocked<typeof ProcessingJobModel>;
|
||||
const mockFileStorageService = fileStorageService as jest.Mocked<typeof fileStorageService>;
|
||||
const mockLlmService = llmService as jest.Mocked<typeof llmService>;
|
||||
const mockPdfGenerationService = pdfGenerationService as jest.Mocked<typeof pdfGenerationService>;
|
||||
|
||||
// Mock CIM review data that matches the schema
|
||||
const mockCIMReviewData = {
|
||||
dealOverview: {
|
||||
targetCompanyName: 'Test Company',
|
||||
industrySector: 'Technology',
|
||||
geography: 'US',
|
||||
dealSource: 'Investment Bank',
|
||||
transactionType: 'Buyout',
|
||||
dateCIMReceived: '2024-01-01',
|
||||
dateReviewed: '2024-01-02',
|
||||
reviewers: 'Test Reviewer',
|
||||
cimPageCount: '50',
|
||||
statedReasonForSale: 'Strategic exit'
|
||||
},
|
||||
businessDescription: {
|
||||
coreOperationsSummary: 'Test operations',
|
||||
keyProductsServices: 'Software solutions',
|
||||
uniqueValueProposition: 'Market leader',
|
||||
customerBaseOverview: {
|
||||
keyCustomerSegments: 'Enterprise clients',
|
||||
customerConcentrationRisk: 'Low',
|
||||
typicalContractLength: '3 years'
|
||||
},
|
||||
keySupplierOverview: {
|
||||
dependenceConcentrationRisk: 'Moderate'
|
||||
}
|
||||
},
|
||||
marketIndustryAnalysis: {
|
||||
estimatedMarketSize: '$1B',
|
||||
estimatedMarketGrowthRate: '10%',
|
||||
keyIndustryTrends: 'Digital transformation',
|
||||
competitiveLandscape: {
|
||||
keyCompetitors: 'Competitor A, B',
|
||||
targetMarketPosition: '#2',
|
||||
basisOfCompetition: 'Innovation'
|
||||
},
|
||||
barriersToEntry: 'High switching costs'
|
||||
},
|
||||
financialSummary: {
|
||||
financials: {
|
||||
fy3: {
|
||||
revenue: '$10M',
|
||||
revenueGrowth: '15%',
|
||||
grossProfit: '$7M',
|
||||
grossMargin: '70%',
|
||||
ebitda: '$2M',
|
||||
ebitdaMargin: '20%'
|
||||
},
|
||||
fy2: {
|
||||
revenue: '$12M',
|
||||
revenueGrowth: '20%',
|
||||
grossProfit: '$8.4M',
|
||||
grossMargin: '70%',
|
||||
ebitda: '$2.4M',
|
||||
ebitdaMargin: '20%'
|
||||
},
|
||||
fy1: {
|
||||
revenue: '$15M',
|
||||
revenueGrowth: '25%',
|
||||
grossProfit: '$10.5M',
|
||||
grossMargin: '70%',
|
||||
ebitda: '$3M',
|
||||
ebitdaMargin: '20%'
|
||||
},
|
||||
ltm: {
|
||||
revenue: '$18M',
|
||||
revenueGrowth: '20%',
|
||||
grossProfit: '$12.6M',
|
||||
grossMargin: '70%',
|
||||
ebitda: '$3.6M',
|
||||
ebitdaMargin: '20%'
|
||||
}
|
||||
},
|
||||
qualityOfEarnings: 'High quality',
|
||||
revenueGrowthDrivers: 'Market expansion',
|
||||
marginStabilityAnalysis: 'Stable',
|
||||
capitalExpenditures: '5%',
|
||||
workingCapitalIntensity: 'Low',
|
||||
freeCashFlowQuality: 'Strong'
|
||||
},
|
||||
managementTeamOverview: {
|
||||
keyLeaders: 'CEO, CFO, CTO',
|
||||
managementQualityAssessment: 'Experienced team',
|
||||
postTransactionIntentions: 'Stay on board',
|
||||
organizationalStructure: 'Flat structure'
|
||||
},
|
||||
preliminaryInvestmentThesis: {
|
||||
keyAttractions: 'Market leader with strong growth',
|
||||
potentialRisks: 'Market competition',
|
||||
valueCreationLevers: 'Operational improvements',
|
||||
alignmentWithFundStrategy: 'Strong fit'
|
||||
},
|
||||
keyQuestionsNextSteps: {
|
||||
criticalQuestions: 'Market sustainability',
|
||||
missingInformation: 'Customer references',
|
||||
preliminaryRecommendation: 'Proceed',
|
||||
rationaleForRecommendation: 'Strong fundamentals',
|
||||
proposedNextSteps: 'Management presentation'
|
||||
}
|
||||
};
|
||||
|
||||
describe('DocumentProcessingService', () => {
|
||||
const mockDocument = {
|
||||
id: 'doc-123',
|
||||
user_id: 'user-123',
|
||||
original_file_name: 'test-document.pdf',
|
||||
file_path: '/uploads/test-document.pdf',
|
||||
file_size: 1024,
|
||||
status: 'uploaded' as const,
|
||||
uploaded_at: new Date(),
|
||||
created_at: new Date(),
|
||||
updated_at: new Date(),
|
||||
};
|
||||
|
||||
|
||||
|
||||
beforeEach(() => {
|
||||
jest.clearAllMocks();
|
||||
|
||||
// Mock config
|
||||
(config as any).upload = {
|
||||
uploadDir: '/test/uploads',
|
||||
};
|
||||
(config as any).llm = {
|
||||
maxTokens: 4000,
|
||||
};
|
||||
|
||||
// Mock fs
|
||||
(fs.existsSync as jest.Mock).mockReturnValue(true);
|
||||
(fs.mkdirSync as jest.Mock).mockImplementation(() => {});
|
||||
(fs.writeFileSync as jest.Mock).mockImplementation(() => {});
|
||||
|
||||
// Mock path
|
||||
(path.join as jest.Mock).mockImplementation((...args) => args.join('/'));
|
||||
(path.dirname as jest.Mock).mockReturnValue('/test/uploads/summaries');
|
||||
});
|
||||
|
||||
describe('processDocument', () => {
|
||||
it('should process a document successfully', async () => {
|
||||
// Mock document model
|
||||
mockDocumentModel.findById.mockResolvedValue(mockDocument);
|
||||
mockDocumentModel.updateStatus.mockResolvedValue(mockDocument);
|
||||
|
||||
// Mock file storage service
|
||||
mockFileStorageService.getFile.mockResolvedValue(Buffer.from('mock pdf content'));
|
||||
mockFileStorageService.fileExists.mockResolvedValue(true);
|
||||
|
||||
// Mock processing job model
|
||||
mockProcessingJobModel.create.mockResolvedValue({} as any);
|
||||
mockProcessingJobModel.updateStatus.mockResolvedValue({} as any);
|
||||
|
||||
// Mock LLM service
|
||||
// Remove estimateTokenCount mock - it's a private method
|
||||
mockLlmService.processCIMDocument.mockResolvedValue({
|
||||
success: true,
|
||||
jsonOutput: mockCIMReviewData,
|
||||
model: 'test-model',
|
||||
cost: 0.01,
|
||||
inputTokens: 1000,
|
||||
outputTokens: 500
|
||||
});
|
||||
|
||||
// Mock PDF generation service
|
||||
mockPdfGenerationService.generatePDFFromMarkdown.mockResolvedValue(true);
|
||||
|
||||
const result = await documentProcessingService.processDocument(
|
||||
'doc-123',
|
||||
'user-123'
|
||||
);
|
||||
|
||||
expect(result.success).toBe(true);
|
||||
expect(result.documentId).toBe('doc-123');
|
||||
expect(result.jobId).toBeDefined();
|
||||
expect(result.steps).toHaveLength(5);
|
||||
expect(result.steps.every(step => step.status === 'completed')).toBe(true);
|
||||
});
|
||||
|
||||
it('should handle document validation failure', async () => {
|
||||
mockDocumentModel.findById.mockResolvedValue(null);
|
||||
|
||||
const result = await documentProcessingService.processDocument(
|
||||
'doc-123',
|
||||
'user-123'
|
||||
);
|
||||
|
||||
expect(result.success).toBe(false);
|
||||
expect(result.error).toContain('Document not found');
|
||||
});
|
||||
|
||||
it('should handle access denied', async () => {
|
||||
const wrongUserDocument = { ...mockDocument, user_id: 'wrong-user' as any };
|
||||
mockDocumentModel.findById.mockResolvedValue(wrongUserDocument);
|
||||
|
||||
const result = await documentProcessingService.processDocument(
|
||||
'doc-123',
|
||||
'user-123'
|
||||
);
|
||||
|
||||
expect(result.success).toBe(false);
|
||||
expect(result.error).toContain('Access denied');
|
||||
});
|
||||
|
||||
it('should handle file not found', async () => {
|
||||
mockDocumentModel.findById.mockResolvedValue(mockDocument);
|
||||
mockFileStorageService.fileExists.mockResolvedValue(false);
|
||||
|
||||
const result = await documentProcessingService.processDocument(
|
||||
'doc-123',
|
||||
'user-123'
|
||||
);
|
||||
|
||||
expect(result.success).toBe(false);
|
||||
expect(result.error).toContain('Document file not accessible');
|
||||
});
|
||||
|
||||
it('should handle text extraction failure', async () => {
|
||||
mockDocumentModel.findById.mockResolvedValue(mockDocument);
|
||||
mockFileStorageService.fileExists.mockResolvedValue(true);
|
||||
mockFileStorageService.getFile.mockResolvedValue(null);
|
||||
|
||||
const result = await documentProcessingService.processDocument(
|
||||
'doc-123',
|
||||
'user-123'
|
||||
);
|
||||
|
||||
expect(result.success).toBe(false);
|
||||
expect(result.error).toContain('Could not read document file');
|
||||
});
|
||||
|
||||
it('should handle LLM processing failure', async () => {
|
||||
mockDocumentModel.findById.mockResolvedValue(mockDocument);
|
||||
mockFileStorageService.fileExists.mockResolvedValue(true);
|
||||
mockFileStorageService.getFile.mockResolvedValue(Buffer.from('mock pdf content'));
|
||||
mockProcessingJobModel.create.mockResolvedValue({} as any);
|
||||
// Remove estimateTokenCount mock - it's a private method
|
||||
mockLlmService.processCIMDocument.mockRejectedValue(new Error('LLM API error'));
|
||||
|
||||
const result = await documentProcessingService.processDocument(
|
||||
'doc-123',
|
||||
'user-123'
|
||||
);
|
||||
|
||||
expect(result.success).toBe(false);
|
||||
expect(result.error).toContain('LLM processing failed');
|
||||
});
|
||||
|
||||
it('should handle PDF generation failure', async () => {
|
||||
mockDocumentModel.findById.mockResolvedValue(mockDocument);
|
||||
mockFileStorageService.fileExists.mockResolvedValue(true);
|
||||
mockFileStorageService.getFile.mockResolvedValue(Buffer.from('mock pdf content'));
|
||||
mockProcessingJobModel.create.mockResolvedValue({} as any);
|
||||
// Remove estimateTokenCount mock - it's a private method
|
||||
mockLlmService.processCIMDocument.mockResolvedValue({
|
||||
success: true,
|
||||
jsonOutput: mockCIMReviewData,
|
||||
model: 'test-model',
|
||||
cost: 0.01,
|
||||
inputTokens: 1000,
|
||||
outputTokens: 500
|
||||
});
|
||||
mockPdfGenerationService.generatePDFFromMarkdown.mockResolvedValue(false);
|
||||
|
||||
const result = await documentProcessingService.processDocument(
|
||||
'doc-123',
|
||||
'user-123'
|
||||
);
|
||||
|
||||
expect(result.success).toBe(false);
|
||||
expect(result.error).toContain('Failed to generate PDF');
|
||||
});
|
||||
|
||||
it('should process large documents in chunks', async () => {
|
||||
mockDocumentModel.findById.mockResolvedValue(mockDocument);
|
||||
mockFileStorageService.fileExists.mockResolvedValue(true);
|
||||
mockFileStorageService.getFile.mockResolvedValue(Buffer.from('mock pdf content'));
|
||||
mockProcessingJobModel.create.mockResolvedValue({} as any);
|
||||
mockProcessingJobModel.updateStatus.mockResolvedValue({} as any);
|
||||
|
||||
// Mock large document
|
||||
mockLlmService.processCIMDocument.mockResolvedValue({
|
||||
success: true,
|
||||
jsonOutput: mockCIMReviewData,
|
||||
model: 'test-model',
|
||||
cost: 0.01,
|
||||
inputTokens: 1000,
|
||||
outputTokens: 500
|
||||
});
|
||||
mockPdfGenerationService.generatePDFFromMarkdown.mockResolvedValue(true);
|
||||
|
||||
const result = await documentProcessingService.processDocument(
|
||||
'doc-123',
|
||||
'user-123'
|
||||
);
|
||||
|
||||
expect(result.success).toBe(true);
|
||||
expect(mockLlmService.processCIMDocument).toHaveBeenCalled();
|
||||
});
|
||||
});
|
||||
|
||||
describe('getProcessingJobStatus', () => {
|
||||
it('should return job status', async () => {
|
||||
const mockJob = {
|
||||
id: 'job-123',
|
||||
status: 'completed',
|
||||
created_at: new Date(),
|
||||
};
|
||||
|
||||
mockProcessingJobModel.findById.mockResolvedValue(mockJob as any);
|
||||
|
||||
const result = await documentProcessingService.getProcessingJobStatus('job-123');
|
||||
|
||||
expect(result).toEqual(mockJob);
|
||||
expect(mockProcessingJobModel.findById).toHaveBeenCalledWith('job-123');
|
||||
});
|
||||
|
||||
it('should handle job not found', async () => {
|
||||
mockProcessingJobModel.findById.mockResolvedValue(null);
|
||||
|
||||
const result = await documentProcessingService.getProcessingJobStatus('job-123');
|
||||
|
||||
expect(result).toBeNull();
|
||||
});
|
||||
});
|
||||
|
||||
describe('getDocumentProcessingHistory', () => {
|
||||
it('should return processing history', async () => {
|
||||
const mockJobs = [
|
||||
{ id: 'job-1', status: 'completed' },
|
||||
{ id: 'job-2', status: 'failed' },
|
||||
];
|
||||
|
||||
mockProcessingJobModel.findByDocumentId.mockResolvedValue(mockJobs as any);
|
||||
|
||||
const result = await documentProcessingService.getDocumentProcessingHistory('doc-123');
|
||||
|
||||
expect(result).toEqual(mockJobs);
|
||||
expect(mockProcessingJobModel.findByDocumentId).toHaveBeenCalledWith('doc-123');
|
||||
});
|
||||
|
||||
it('should return empty array for no history', async () => {
|
||||
mockProcessingJobModel.findByDocumentId.mockResolvedValue([]);
|
||||
|
||||
const result = await documentProcessingService.getDocumentProcessingHistory('doc-123');
|
||||
|
||||
expect(result).toEqual([]);
|
||||
});
|
||||
});
|
||||
|
||||
describe('document analysis', () => {
|
||||
it('should detect financial content', () => {
|
||||
const financialText = 'Revenue increased by 25% and EBITDA margins improved.';
|
||||
const result = (documentProcessingService as any).detectFinancialContent(financialText);
|
||||
expect(result).toBe(true);
|
||||
});
|
||||
|
||||
it('should detect technical content', () => {
|
||||
const technicalText = 'The system architecture includes multiple components.';
|
||||
const result = (documentProcessingService as any).detectTechnicalContent(technicalText);
|
||||
expect(result).toBe(true);
|
||||
});
|
||||
|
||||
it('should extract key topics', () => {
|
||||
const text = 'Financial analysis shows strong market growth and competitive advantages.';
|
||||
const result = (documentProcessingService as any).extractKeyTopics(text);
|
||||
expect(result).toContain('Financial Analysis');
|
||||
expect(result).toContain('Market Analysis');
|
||||
});
|
||||
|
||||
it('should analyze sentiment', () => {
|
||||
const positiveText = 'Strong growth and excellent opportunities.';
|
||||
const result = (documentProcessingService as any).analyzeSentiment(positiveText);
|
||||
expect(result).toBe('positive');
|
||||
});
|
||||
|
||||
it('should assess complexity', () => {
|
||||
const simpleText = 'This is a simple document.';
|
||||
const result = (documentProcessingService as any).assessComplexity(simpleText);
|
||||
expect(result).toBe('low');
|
||||
});
|
||||
});
|
||||
|
||||
describe('error handling', () => {
|
||||
it('should handle database errors gracefully', async () => {
|
||||
mockDocumentModel.findById.mockRejectedValue(new Error('Database connection failed'));
|
||||
|
||||
const result = await documentProcessingService.processDocument(
|
||||
'doc-123',
|
||||
'user-123'
|
||||
);
|
||||
|
||||
expect(result.success).toBe(false);
|
||||
expect(result.error).toContain('Database connection failed');
|
||||
});
|
||||
|
||||
it('should handle file system errors', async () => {
|
||||
mockDocumentModel.findById.mockResolvedValue(mockDocument);
|
||||
mockFileStorageService.fileExists.mockResolvedValue(true);
|
||||
mockFileStorageService.getFile.mockRejectedValue(new Error('File system error'));
|
||||
|
||||
const result = await documentProcessingService.processDocument(
|
||||
'doc-123',
|
||||
'user-123'
|
||||
);
|
||||
|
||||
expect(result.success).toBe(false);
|
||||
expect(result.error).toContain('File system error');
|
||||
});
|
||||
});
|
||||
});
|
||||
@@ -1,121 +0,0 @@
|
||||
import { VectorDocumentProcessor, TextBlock } from '../vectorDocumentProcessor';
|
||||
import { llmService } from '../llmService';
|
||||
import { vectorDatabaseService } from '../vectorDatabaseService';
|
||||
|
||||
// Mock the dependencies
|
||||
jest.mock('../llmService');
|
||||
jest.mock('../vectorDatabaseService');
|
||||
|
||||
const mockedLlmService = llmService as jest.Mocked<typeof llmService>;
|
||||
const mockedVectorDBService = vectorDatabaseService as jest.Mocked<typeof vectorDatabaseService>;
|
||||
|
||||
// Sample text mimicking a messy PDF extraction with various elements
|
||||
const sampleText = `
|
||||
This is the first paragraph of the document. It contains some general information about the company.
|
||||
|
||||
Financial Highlights
|
||||
|
||||
This paragraph discusses the financial performance. It is located after a heading.
|
||||
|
||||
Here is a table of financial data:
|
||||
|
||||
Metric FY2021 FY2022 FY2023
|
||||
Revenue $10.0M $12.5M $15.0M
|
||||
EBITDA $2.0M $2.5M $3.0M
|
||||
|
||||
This is the final paragraph, coming after the table. It summarizes the outlook.
|
||||
`;
|
||||
|
||||
describe('VectorDocumentProcessor', () => {
|
||||
let processor: VectorDocumentProcessor;
|
||||
|
||||
beforeEach(() => {
|
||||
processor = new VectorDocumentProcessor();
|
||||
// Reset mocks before each test
|
||||
jest.clearAllMocks();
|
||||
|
||||
// Set up VectorDatabaseService mock methods
|
||||
(mockedVectorDBService as any).generateEmbeddings = jest.fn();
|
||||
(mockedVectorDBService as any).storeDocumentChunks = jest.fn();
|
||||
(mockedVectorDBService as any).search = jest.fn();
|
||||
});
|
||||
|
||||
describe('identifyTextBlocks', () => {
|
||||
it('should correctly identify paragraphs, headings, and tables', () => {
|
||||
// Access the private method for testing purposes
|
||||
const blocks: TextBlock[] = (processor as any).identifyTextBlocks(sampleText);
|
||||
|
||||
expect(blocks).toHaveLength(5);
|
||||
|
||||
// Check block types
|
||||
expect(blocks[0]?.type).toBe('paragraph');
|
||||
expect(blocks[1]?.type).toBe('heading');
|
||||
expect(blocks[2]?.type).toBe('paragraph');
|
||||
expect(blocks[3]?.type).toBe('table');
|
||||
expect(blocks[4]?.type).toBe('paragraph');
|
||||
|
||||
// Check block content
|
||||
expect(blocks[0]?.content).toBe('This is the first paragraph of the document. It contains some general information about the company.');
|
||||
expect(blocks[1]?.content).toBe('Financial Highlights');
|
||||
expect(blocks[3]?.content).toContain('Revenue $10.0M $12.5M $15.0M');
|
||||
});
|
||||
});
|
||||
|
||||
describe('processDocumentForVectorSearch', () => {
|
||||
it('should use the LLM to summarize tables and store original in metadata', async () => {
|
||||
const documentId = 'test-doc-1';
|
||||
const tableSummary = 'The table shows revenue growing from $10M to $15M and EBITDA growing from $2M to $3M between FY2021 and FY2023.';
|
||||
|
||||
// Mock the LLM service to return a summary for the table
|
||||
mockedLlmService.processCIMDocument.mockResolvedValue({
|
||||
success: true,
|
||||
jsonOutput: { summary: tableSummary } as any,
|
||||
model: 'test-model',
|
||||
cost: 0.01,
|
||||
inputTokens: 100,
|
||||
outputTokens: 50,
|
||||
});
|
||||
|
||||
// Mock the embedding service to return a dummy vector
|
||||
(mockedVectorDBService as any).generateEmbeddings.mockResolvedValue([0.1, 0.2, 0.3]);
|
||||
|
||||
// Mock the storage service
|
||||
(mockedVectorDBService as any).storeDocumentChunks.mockResolvedValue();
|
||||
|
||||
await processor.processDocumentForVectorSearch(documentId, sampleText);
|
||||
|
||||
// Verify that storeDocumentChunks was called
|
||||
expect((mockedVectorDBService as any).storeDocumentChunks).toHaveBeenCalled();
|
||||
|
||||
// Get the arguments passed to storeDocumentChunks
|
||||
const storedChunks = (mockedVectorDBService as any).storeDocumentChunks.mock.calls[0]?.[0];
|
||||
expect(storedChunks).toBeDefined();
|
||||
if (!storedChunks) return;
|
||||
|
||||
expect(storedChunks).toHaveLength(5);
|
||||
|
||||
// Find the table chunk
|
||||
const tableChunk = storedChunks.find((c: any) => c.metadata.block_type === 'table');
|
||||
expect(tableChunk).toBeDefined();
|
||||
if (!tableChunk) return;
|
||||
|
||||
// Assert that the LLM was called for the table summarization
|
||||
expect(mockedLlmService.processCIMDocument).toHaveBeenCalledTimes(1);
|
||||
const prompt = mockedLlmService.processCIMDocument.mock.calls[0]?.[0];
|
||||
expect(prompt).toContain('Summarize the key information in this table');
|
||||
|
||||
// Assert that the table chunk's content is the LLM summary
|
||||
expect(tableChunk.content).toBe(tableSummary);
|
||||
|
||||
// Assert that the original table text is stored in the metadata
|
||||
expect(tableChunk.metadata['original_table']).toContain('Metric FY2021 FY2022 FY2023');
|
||||
|
||||
// Find a paragraph chunk and check its content
|
||||
const paragraphChunk = storedChunks.find((c: any) => c.metadata.block_type === 'paragraph');
|
||||
expect(paragraphChunk).toBeDefined();
|
||||
if (paragraphChunk) {
|
||||
expect(paragraphChunk.content).not.toBe(tableSummary); // Ensure it wasn't summarized
|
||||
}
|
||||
});
|
||||
});
|
||||
});
|
||||
File diff suppressed because it is too large
Load Diff
@@ -23,13 +23,7 @@ interface DocumentAIOutput {
|
||||
mimeType: string;
|
||||
}
|
||||
|
||||
interface PageChunk {
|
||||
startPage: number;
|
||||
endPage: number;
|
||||
buffer: Buffer;
|
||||
}
|
||||
|
||||
export class DocumentAiGenkitProcessor {
|
||||
export class DocumentAiProcessor {
|
||||
private gcsBucketName: string;
|
||||
private documentAiClient: DocumentProcessorServiceClient;
|
||||
private storageClient: Storage;
|
||||
@@ -44,7 +38,7 @@ export class DocumentAiGenkitProcessor {
|
||||
// Construct the processor name
|
||||
this.processorName = `projects/${config.googleCloud.projectId}/locations/${config.googleCloud.documentAiLocation}/processors/${config.googleCloud.documentAiProcessorId}`;
|
||||
|
||||
logger.info('Document AI + Genkit processor initialized', {
|
||||
logger.info('Document AI processor initialized', {
|
||||
projectId: config.googleCloud.projectId,
|
||||
location: config.googleCloud.documentAiLocation,
|
||||
processorId: config.googleCloud.documentAiProcessorId,
|
||||
@@ -386,4 +380,4 @@ export class DocumentAiGenkitProcessor {
|
||||
}
|
||||
}
|
||||
|
||||
export const documentAiGenkitProcessor = new DocumentAiGenkitProcessor();
|
||||
export const documentAiProcessor = new DocumentAiProcessor();
|
||||
File diff suppressed because it is too large
Load Diff
@@ -2,10 +2,18 @@ import { EventEmitter } from 'events';
|
||||
import path from 'path';
|
||||
import { logger, StructuredLogger } from '../utils/logger';
|
||||
import { config } from '../config/env';
|
||||
import { ProcessingOptions } from './documentProcessingService';
|
||||
import { unifiedDocumentProcessor } from './unifiedDocumentProcessor';
|
||||
import { uploadMonitoringService } from './uploadMonitoringService';
|
||||
|
||||
// Define ProcessingOptions interface locally since documentProcessingService was removed
|
||||
export interface ProcessingOptions {
|
||||
strategy?: string;
|
||||
fileBuffer?: Buffer;
|
||||
fileName?: string;
|
||||
mimeType?: string;
|
||||
[key: string]: any;
|
||||
}
|
||||
|
||||
export interface Job {
|
||||
id: string;
|
||||
type: 'document_processing';
|
||||
|
||||
@@ -1,410 +0,0 @@
|
||||
import { logger } from '../utils/logger';
|
||||
import { llmService } from './llmService';
|
||||
|
||||
import { CIMReview } from './llmSchemas';
|
||||
|
||||
interface DocumentSection {
|
||||
id: string;
|
||||
type: 'executive_summary' | 'business_description' | 'financial_analysis' | 'market_analysis' | 'management' | 'investment_thesis';
|
||||
content: string;
|
||||
pageRange: [number, number];
|
||||
keyMetrics: Record<string, any>;
|
||||
relevanceScore: number;
|
||||
}
|
||||
|
||||
interface RAGQuery {
|
||||
section: string;
|
||||
context: string;
|
||||
specificQuestions: string[];
|
||||
}
|
||||
|
||||
interface RAGAnalysisResult {
|
||||
success: boolean;
|
||||
summary: string;
|
||||
analysisData: CIMReview;
|
||||
error?: string;
|
||||
processingTime: number;
|
||||
apiCalls: number;
|
||||
}
|
||||
|
||||
class RAGDocumentProcessor {
|
||||
private sections: DocumentSection[] = [];
|
||||
|
||||
private apiCallCount: number = 0;
|
||||
|
||||
/**
|
||||
* Process CIM document using RAG approach
|
||||
*/
|
||||
async processDocument(text: string, documentId: string): Promise<RAGAnalysisResult> {
|
||||
const startTime = Date.now();
|
||||
this.apiCallCount = 0;
|
||||
|
||||
logger.info('Starting RAG-based CIM processing', { documentId });
|
||||
|
||||
try {
|
||||
// Step 1: Intelligent document segmentation
|
||||
await this.segmentDocument(text);
|
||||
|
||||
// Step 2: Extract key metrics and context
|
||||
await this.extractKeyMetrics();
|
||||
|
||||
// Step 3: Generate comprehensive analysis using RAG
|
||||
const analysis = await this.generateRAGAnalysis();
|
||||
|
||||
// Step 4: Create final summary
|
||||
const summary = await this.createFinalSummary(analysis);
|
||||
|
||||
const processingTime = Date.now() - startTime;
|
||||
|
||||
logger.info('RAG processing completed successfully', {
|
||||
documentId,
|
||||
processingTime,
|
||||
apiCalls: this.apiCallCount,
|
||||
sections: this.sections.length
|
||||
});
|
||||
|
||||
return {
|
||||
success: true,
|
||||
summary,
|
||||
analysisData: analysis,
|
||||
processingTime,
|
||||
apiCalls: this.apiCallCount
|
||||
};
|
||||
|
||||
} catch (error) {
|
||||
const processingTime = Date.now() - startTime;
|
||||
logger.error('RAG processing failed', {
|
||||
documentId,
|
||||
error: error instanceof Error ? error.message : 'Unknown error',
|
||||
processingTime,
|
||||
apiCalls: this.apiCallCount
|
||||
});
|
||||
|
||||
return {
|
||||
success: false,
|
||||
summary: '',
|
||||
analysisData: {} as CIMReview,
|
||||
error: error instanceof Error ? error.message : 'Unknown error',
|
||||
processingTime,
|
||||
apiCalls: this.apiCallCount
|
||||
};
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Segment document into logical sections with metadata
|
||||
*/
|
||||
private async segmentDocument(text: string): Promise<void> {
|
||||
logger.info('Segmenting document into logical sections');
|
||||
|
||||
// Use LLM to identify and segment document sections
|
||||
const segmentationPrompt = `
|
||||
Analyze this CIM document and identify its logical sections. For each section, provide:
|
||||
1. Section type (executive_summary, business_description, financial_analysis, market_analysis, management, investment_thesis)
|
||||
2. Start and end page numbers
|
||||
3. Key topics covered
|
||||
4. Relevance to investment analysis (1-10 scale)
|
||||
|
||||
Document text:
|
||||
${text.substring(0, 50000)} // First 50K chars for section identification
|
||||
|
||||
Return as JSON array of sections.
|
||||
`;
|
||||
|
||||
const segmentationResult = await this.callLLM({
|
||||
prompt: segmentationPrompt,
|
||||
systemPrompt: 'You are an expert at analyzing CIM document structure. Identify logical sections accurately.',
|
||||
maxTokens: 2000,
|
||||
temperature: 0.1
|
||||
});
|
||||
|
||||
if (segmentationResult.success) {
|
||||
try {
|
||||
const sections = JSON.parse(segmentationResult.content);
|
||||
this.sections = sections.map((section: any, index: number) => ({
|
||||
id: `section_${index}`,
|
||||
type: section.type,
|
||||
content: this.extractSectionContent(text, section.pageRange),
|
||||
pageRange: section.pageRange,
|
||||
keyMetrics: {},
|
||||
relevanceScore: section.relevanceScore
|
||||
}));
|
||||
} catch (error) {
|
||||
logger.error('Failed to parse section segmentation', { error });
|
||||
// Fallback to rule-based segmentation
|
||||
this.sections = this.fallbackSegmentation(text);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Extract key metrics from each section
|
||||
*/
|
||||
private async extractKeyMetrics(): Promise<void> {
|
||||
logger.info('Extracting key metrics from document sections');
|
||||
|
||||
for (const section of this.sections) {
|
||||
const metricsPrompt = `
|
||||
Extract key financial and business metrics from this section:
|
||||
|
||||
Section Type: ${section.type}
|
||||
Content: ${section.content.substring(0, 10000)}
|
||||
|
||||
Focus on:
|
||||
- Revenue, EBITDA, margins
|
||||
- Growth rates, market size
|
||||
- Customer metrics, employee count
|
||||
- Key risks and opportunities
|
||||
|
||||
Return as JSON object.
|
||||
`;
|
||||
|
||||
const metricsResult = await this.callLLM({
|
||||
prompt: metricsPrompt,
|
||||
systemPrompt: 'Extract precise numerical and qualitative metrics from CIM sections.',
|
||||
maxTokens: 1500,
|
||||
temperature: 0.1
|
||||
});
|
||||
|
||||
if (metricsResult.success) {
|
||||
try {
|
||||
section.keyMetrics = JSON.parse(metricsResult.content);
|
||||
} catch (error) {
|
||||
logger.warn('Failed to parse metrics for section', { sectionId: section.id, error });
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Generate analysis using RAG approach
|
||||
*/
|
||||
private async generateRAGAnalysis(): Promise<CIMReview> {
|
||||
logger.info('Generating RAG-based analysis');
|
||||
|
||||
// Create queries for each section of the BPCP template
|
||||
const queries: RAGQuery[] = [
|
||||
{
|
||||
section: 'dealOverview',
|
||||
context: 'Extract deal-specific information including company name, industry, geography, transaction details',
|
||||
specificQuestions: [
|
||||
'What is the target company name?',
|
||||
'What industry/sector does it operate in?',
|
||||
'Where is the company headquartered?',
|
||||
'What type of transaction is this?',
|
||||
'What is the stated reason for sale?'
|
||||
]
|
||||
},
|
||||
{
|
||||
section: 'businessDescription',
|
||||
context: 'Analyze the company\'s core operations, products/services, and customer base',
|
||||
specificQuestions: [
|
||||
'What are the core operations?',
|
||||
'What are the key products/services?',
|
||||
'What is the revenue mix?',
|
||||
'Who are the key customers?',
|
||||
'What is the unique value proposition?'
|
||||
]
|
||||
},
|
||||
{
|
||||
section: 'financialSummary',
|
||||
context: 'Extract and analyze financial performance, trends, and quality metrics',
|
||||
specificQuestions: [
|
||||
'What are the revenue trends?',
|
||||
'What are the EBITDA margins?',
|
||||
'What is the quality of earnings?',
|
||||
'What are the growth drivers?',
|
||||
'What is the working capital intensity?'
|
||||
]
|
||||
},
|
||||
{
|
||||
section: 'marketIndustryAnalysis',
|
||||
context: 'Analyze market size, growth, competition, and industry trends',
|
||||
specificQuestions: [
|
||||
'What is the market size (TAM/SAM)?',
|
||||
'What is the market growth rate?',
|
||||
'Who are the key competitors?',
|
||||
'What are the barriers to entry?',
|
||||
'What are the key industry trends?'
|
||||
]
|
||||
},
|
||||
{
|
||||
section: 'managementTeamOverview',
|
||||
context: 'Evaluate management team quality, experience, and post-transaction intentions',
|
||||
specificQuestions: [
|
||||
'Who are the key leaders?',
|
||||
'What is their experience level?',
|
||||
'What are their post-transaction intentions?',
|
||||
'How is the organization structured?'
|
||||
]
|
||||
},
|
||||
{
|
||||
section: 'preliminaryInvestmentThesis',
|
||||
context: 'Develop investment thesis based on all available information',
|
||||
specificQuestions: [
|
||||
'What are the key attractions?',
|
||||
'What are the potential risks?',
|
||||
'What are the value creation levers?',
|
||||
'How does this align with BPCP strategy?'
|
||||
]
|
||||
}
|
||||
];
|
||||
|
||||
const analysis: any = {};
|
||||
|
||||
// Process each query using RAG
|
||||
for (const query of queries) {
|
||||
const relevantSections = this.findRelevantSections(query);
|
||||
const queryContext = this.buildQueryContext(relevantSections, query);
|
||||
|
||||
const analysisResult = await this.callLLM({
|
||||
prompt: this.buildRAGPrompt(query, queryContext),
|
||||
systemPrompt: 'You are an expert investment analyst. Provide precise, structured analysis based on the provided context.',
|
||||
maxTokens: 2000,
|
||||
temperature: 0.1
|
||||
});
|
||||
|
||||
if (analysisResult.success) {
|
||||
try {
|
||||
analysis[query.section] = JSON.parse(analysisResult.content);
|
||||
} catch (error) {
|
||||
logger.warn('Failed to parse analysis for section', { section: query.section, error });
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return analysis as CIMReview;
|
||||
}
|
||||
|
||||
/**
|
||||
* Find sections relevant to a specific query
|
||||
*/
|
||||
private findRelevantSections(query: RAGQuery): DocumentSection[] {
|
||||
const relevanceMap: Record<string, string[]> = {
|
||||
dealOverview: ['executive_summary'],
|
||||
businessDescription: ['business_description', 'executive_summary'],
|
||||
financialSummary: ['financial_analysis', 'executive_summary'],
|
||||
marketIndustryAnalysis: ['market_analysis', 'executive_summary'],
|
||||
managementTeamOverview: ['management', 'executive_summary'],
|
||||
preliminaryInvestmentThesis: ['investment_thesis', 'executive_summary', 'business_description']
|
||||
};
|
||||
|
||||
const relevantTypes = relevanceMap[query.section] || [];
|
||||
return this.sections.filter(section =>
|
||||
relevantTypes.includes(section.type) && section.relevanceScore >= 5
|
||||
);
|
||||
}
|
||||
|
||||
/**
|
||||
* Build context for a specific query
|
||||
*/
|
||||
private buildQueryContext(sections: DocumentSection[], query: RAGQuery): string {
|
||||
let context = `Query: ${query.context}\n\n`;
|
||||
context += `Specific Questions:\n${query.specificQuestions.map(q => `- ${q}`).join('\n')}\n\n`;
|
||||
context += `Relevant Document Sections:\n\n`;
|
||||
|
||||
for (const section of sections) {
|
||||
context += `Section: ${section.type}\n`;
|
||||
context += `Relevance Score: ${section.relevanceScore}/10\n`;
|
||||
context += `Key Metrics: ${JSON.stringify(section.keyMetrics, null, 2)}\n`;
|
||||
context += `Content: ${section.content.substring(0, 5000)}\n\n`;
|
||||
}
|
||||
|
||||
return context;
|
||||
}
|
||||
|
||||
/**
|
||||
* Build RAG prompt for specific analysis
|
||||
*/
|
||||
private buildRAGPrompt(query: RAGQuery, context: string): string {
|
||||
return `
|
||||
Based on the following context from a CIM document, provide a comprehensive analysis for the ${query.section} section.
|
||||
|
||||
${context}
|
||||
|
||||
Please provide your analysis in the exact JSON format required for the BPCP CIM Review Template.
|
||||
Focus on answering the specific questions listed above.
|
||||
Use "Not specified in CIM" for any information not available in the provided context.
|
||||
`;
|
||||
}
|
||||
|
||||
/**
|
||||
* Create final summary from RAG analysis
|
||||
*/
|
||||
private async createFinalSummary(analysis: CIMReview): Promise<string> {
|
||||
logger.info('Creating final summary from RAG analysis');
|
||||
|
||||
const summaryPrompt = `
|
||||
Create a comprehensive markdown summary from the following BPCP CIM analysis:
|
||||
|
||||
${JSON.stringify(analysis, null, 2)}
|
||||
|
||||
Format as a professional BPCP CIM Review Template with proper markdown structure.
|
||||
`;
|
||||
|
||||
const summaryResult = await this.callLLM({
|
||||
prompt: summaryPrompt,
|
||||
systemPrompt: 'Create a professional, well-structured markdown summary for BPCP investment committee.',
|
||||
maxTokens: 3000,
|
||||
temperature: 0.1
|
||||
});
|
||||
|
||||
return summaryResult.success ? summaryResult.content : 'Summary generation failed';
|
||||
}
|
||||
|
||||
/**
|
||||
* Fallback segmentation if LLM segmentation fails
|
||||
*/
|
||||
private fallbackSegmentation(text: string): DocumentSection[] {
|
||||
// Rule-based segmentation as fallback
|
||||
const sections: DocumentSection[] = [];
|
||||
const patterns = [
|
||||
{ type: 'executive_summary', pattern: /(?:executive\s+summary|overview|introduction)/i },
|
||||
{ type: 'business_description', pattern: /(?:business\s+description|company\s+overview|operations)/i },
|
||||
{ type: 'financial_analysis', pattern: /(?:financial|financials|performance|results)/i },
|
||||
{ type: 'market_analysis', pattern: /(?:market|industry|competitive)/i },
|
||||
{ type: 'management', pattern: /(?:management|leadership|team)/i },
|
||||
{ type: 'investment_thesis', pattern: /(?:investment|opportunity|thesis)/i }
|
||||
];
|
||||
|
||||
// Simple text splitting based on patterns
|
||||
const textLength = text.length;
|
||||
const sectionSize = Math.floor(textLength / patterns.length);
|
||||
|
||||
patterns.forEach((pattern, index) => {
|
||||
const start = index * sectionSize;
|
||||
const end = Math.min((index + 1) * sectionSize, textLength);
|
||||
|
||||
sections.push({
|
||||
id: `section_${index}`,
|
||||
type: pattern.type as any,
|
||||
content: text.substring(start, end),
|
||||
pageRange: [Math.floor(start / 1000), Math.floor(end / 1000)],
|
||||
keyMetrics: {},
|
||||
relevanceScore: 7
|
||||
});
|
||||
});
|
||||
|
||||
return sections;
|
||||
}
|
||||
|
||||
/**
|
||||
* Extract content for specific page range
|
||||
*/
|
||||
private extractSectionContent(text: string, pageRange: [number, number]): string {
|
||||
// Rough estimation: 1000 characters per page
|
||||
const startChar = pageRange[0] * 1000;
|
||||
const endChar = pageRange[1] * 1000;
|
||||
return text.substring(startChar, endChar);
|
||||
}
|
||||
|
||||
/**
|
||||
* Wrapper for LLM calls to track API usage
|
||||
*/
|
||||
private async callLLM(request: any): Promise<any> {
|
||||
this.apiCallCount++;
|
||||
return await llmService.processCIMDocument(request.prompt, '', {});
|
||||
}
|
||||
}
|
||||
|
||||
export const ragDocumentProcessor = new RAGDocumentProcessor();
|
||||
@@ -1,37 +1,124 @@
|
||||
import { logger } from '../utils/logger';
|
||||
import { config } from '../config/env';
|
||||
import { documentProcessingService } from './documentProcessingService';
|
||||
import { ragDocumentProcessor } from './ragDocumentProcessor';
|
||||
import { optimizedAgenticRAGProcessor } from './optimizedAgenticRAGProcessor';
|
||||
import { documentAiGenkitProcessor } from './documentAiGenkitProcessor';
|
||||
import { documentAiProcessor } from './documentAiProcessor';
|
||||
import { CIMReview } from './llmSchemas';
|
||||
import { documentController } from '../controllers/documentController';
|
||||
|
||||
// Default empty CIMReview object
|
||||
const defaultCIMReview: CIMReview = {
|
||||
dealOverview: {
|
||||
targetCompanyName: '',
|
||||
industrySector: '',
|
||||
geography: '',
|
||||
dealSource: '',
|
||||
transactionType: '',
|
||||
dateCIMReceived: '',
|
||||
dateReviewed: '',
|
||||
reviewers: '',
|
||||
cimPageCount: '',
|
||||
statedReasonForSale: '',
|
||||
employeeCount: ''
|
||||
},
|
||||
businessDescription: {
|
||||
coreOperationsSummary: '',
|
||||
keyProductsServices: '',
|
||||
uniqueValueProposition: '',
|
||||
customerBaseOverview: {
|
||||
keyCustomerSegments: '',
|
||||
customerConcentrationRisk: '',
|
||||
typicalContractLength: ''
|
||||
},
|
||||
keySupplierOverview: {
|
||||
dependenceConcentrationRisk: ''
|
||||
}
|
||||
},
|
||||
marketIndustryAnalysis: {
|
||||
estimatedMarketSize: '',
|
||||
estimatedMarketGrowthRate: '',
|
||||
keyIndustryTrends: '',
|
||||
competitiveLandscape: {
|
||||
keyCompetitors: '',
|
||||
targetMarketPosition: '',
|
||||
basisOfCompetition: ''
|
||||
},
|
||||
barriersToEntry: ''
|
||||
},
|
||||
financialSummary: {
|
||||
financials: {
|
||||
fy3: {
|
||||
revenue: '',
|
||||
revenueGrowth: '',
|
||||
grossProfit: '',
|
||||
grossMargin: '',
|
||||
ebitda: '',
|
||||
ebitdaMargin: ''
|
||||
},
|
||||
fy2: {
|
||||
revenue: '',
|
||||
revenueGrowth: '',
|
||||
grossProfit: '',
|
||||
grossMargin: '',
|
||||
ebitda: '',
|
||||
ebitdaMargin: ''
|
||||
},
|
||||
fy1: {
|
||||
revenue: '',
|
||||
revenueGrowth: '',
|
||||
grossProfit: '',
|
||||
grossMargin: '',
|
||||
ebitda: '',
|
||||
ebitdaMargin: ''
|
||||
},
|
||||
ltm: {
|
||||
revenue: '',
|
||||
revenueGrowth: '',
|
||||
grossProfit: '',
|
||||
grossMargin: '',
|
||||
ebitda: '',
|
||||
ebitdaMargin: ''
|
||||
}
|
||||
},
|
||||
qualityOfEarnings: '',
|
||||
revenueGrowthDrivers: '',
|
||||
marginStabilityAnalysis: '',
|
||||
capitalExpenditures: '',
|
||||
workingCapitalIntensity: '',
|
||||
freeCashFlowQuality: ''
|
||||
},
|
||||
managementTeamOverview: {
|
||||
keyLeaders: '',
|
||||
managementQualityAssessment: '',
|
||||
postTransactionIntentions: '',
|
||||
organizationalStructure: ''
|
||||
},
|
||||
preliminaryInvestmentThesis: {
|
||||
keyAttractions: '',
|
||||
potentialRisks: '',
|
||||
valueCreationLevers: '',
|
||||
alignmentWithFundStrategy: ''
|
||||
},
|
||||
keyQuestionsNextSteps: {
|
||||
criticalQuestions: '',
|
||||
missingInformation: '',
|
||||
preliminaryRecommendation: '',
|
||||
rationaleForRecommendation: '',
|
||||
proposedNextSteps: ''
|
||||
}
|
||||
};
|
||||
|
||||
interface ProcessingResult {
|
||||
success: boolean;
|
||||
summary: string;
|
||||
analysisData: CIMReview;
|
||||
processingStrategy: 'chunking' | 'rag' | 'agentic_rag' | 'optimized_agentic_rag' | 'document_ai_genkit';
|
||||
processingStrategy: 'document_ai_agentic_rag';
|
||||
processingTime: number;
|
||||
apiCalls: number;
|
||||
error: string | undefined;
|
||||
}
|
||||
|
||||
interface ComparisonResult {
|
||||
chunking: ProcessingResult;
|
||||
rag: ProcessingResult;
|
||||
agenticRag: ProcessingResult;
|
||||
winner: 'chunking' | 'rag' | 'agentic_rag' | 'tie';
|
||||
performanceMetrics: {
|
||||
timeDifference: number;
|
||||
apiCallDifference: number;
|
||||
qualityScore: number;
|
||||
};
|
||||
}
|
||||
|
||||
class UnifiedDocumentProcessor {
|
||||
/**
|
||||
* Process document using the configured strategy
|
||||
* Process document using Document AI + Agentic RAG strategy
|
||||
*/
|
||||
async processDocument(
|
||||
documentId: string,
|
||||
@@ -39,178 +126,45 @@ class UnifiedDocumentProcessor {
|
||||
text: string,
|
||||
options: any = {}
|
||||
): Promise<ProcessingResult> {
|
||||
const strategy = options.strategy || config.processingStrategy;
|
||||
const strategy = options.strategy || 'document_ai_agentic_rag';
|
||||
|
||||
logger.info('Processing document with unified processor', {
|
||||
documentId,
|
||||
strategy,
|
||||
configStrategy: config.processingStrategy,
|
||||
textLength: text.length
|
||||
});
|
||||
|
||||
if (strategy === 'rag') {
|
||||
return await this.processWithRAG(documentId, text);
|
||||
} else if (strategy === 'agentic_rag') {
|
||||
return await this.processWithAgenticRAG(documentId, userId, text);
|
||||
} else if (strategy === 'optimized_agentic_rag') {
|
||||
return await this.processWithOptimizedAgenticRAG(documentId, userId, text, options);
|
||||
} else if (strategy === 'document_ai_genkit') {
|
||||
return await this.processWithDocumentAiGenkit(documentId, userId, text, options);
|
||||
// Only support document_ai_agentic_rag strategy
|
||||
if (strategy === 'document_ai_agentic_rag') {
|
||||
return await this.processWithDocumentAiAgenticRag(documentId, userId, text, options);
|
||||
} else {
|
||||
return await this.processWithChunking(documentId, userId, text, options);
|
||||
throw new Error(`Unsupported processing strategy: ${strategy}. Only 'document_ai_agentic_rag' is supported.`);
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Process document using RAG approach
|
||||
* Process document using Document AI + Agentic RAG approach
|
||||
*/
|
||||
private async processWithRAG(documentId: string, text: string): Promise<ProcessingResult> {
|
||||
logger.info('Using RAG processing strategy', { documentId });
|
||||
|
||||
const result = await ragDocumentProcessor.processDocument(text, documentId);
|
||||
|
||||
return {
|
||||
success: result.success,
|
||||
summary: result.summary,
|
||||
analysisData: result.analysisData,
|
||||
processingStrategy: 'rag',
|
||||
processingTime: result.processingTime,
|
||||
apiCalls: result.apiCalls,
|
||||
error: result.error || undefined
|
||||
};
|
||||
}
|
||||
|
||||
/**
|
||||
* Process document using agentic RAG approach
|
||||
*/
|
||||
private async processWithAgenticRAG(
|
||||
documentId: string,
|
||||
_userId: string,
|
||||
text: string
|
||||
): Promise<ProcessingResult> {
|
||||
logger.info('Using agentic RAG processing strategy', { documentId });
|
||||
|
||||
try {
|
||||
// If text is empty, extract it from the document
|
||||
let extractedText = text;
|
||||
if (!text || text.length === 0) {
|
||||
logger.info('Extracting text for agentic RAG processing', { documentId });
|
||||
extractedText = await documentController.getDocumentText(documentId);
|
||||
}
|
||||
|
||||
const result = await optimizedAgenticRAGProcessor.processLargeDocument(documentId, extractedText, {});
|
||||
|
||||
return {
|
||||
success: result.success,
|
||||
summary: result.summary || '',
|
||||
analysisData: result.analysisData || {} as CIMReview,
|
||||
processingStrategy: 'agentic_rag',
|
||||
processingTime: result.processingTime,
|
||||
apiCalls: Math.ceil(result.processedChunks / 5), // Estimate API calls
|
||||
error: result.error || undefined
|
||||
};
|
||||
} catch (error) {
|
||||
logger.error('Agentic RAG processing failed', { documentId, error });
|
||||
|
||||
return {
|
||||
success: false,
|
||||
summary: '',
|
||||
analysisData: {} as CIMReview,
|
||||
processingStrategy: 'agentic_rag',
|
||||
processingTime: 0,
|
||||
apiCalls: 0,
|
||||
error: error instanceof Error ? error.message : 'Unknown error'
|
||||
};
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Process document using optimized agentic RAG approach for large documents
|
||||
*/
|
||||
private async processWithOptimizedAgenticRAG(
|
||||
documentId: string,
|
||||
_userId: string,
|
||||
text: string,
|
||||
_options: any
|
||||
): Promise<ProcessingResult> {
|
||||
logger.info('Using optimized agentic RAG processing strategy', { documentId, textLength: text.length });
|
||||
|
||||
const startTime = Date.now();
|
||||
|
||||
try {
|
||||
// If text is empty, extract it from the document
|
||||
let extractedText = text;
|
||||
if (!text || text.length === 0) {
|
||||
logger.info('Extracting text for optimized agentic RAG processing', { documentId });
|
||||
extractedText = await documentController.getDocumentText(documentId);
|
||||
}
|
||||
|
||||
// Use the optimized processor for large documents
|
||||
const optimizedResult = await optimizedAgenticRAGProcessor.processLargeDocument(
|
||||
documentId,
|
||||
extractedText,
|
||||
{
|
||||
enableSemanticChunking: true,
|
||||
enableMetadataEnrichment: true,
|
||||
similarityThreshold: 0.8
|
||||
}
|
||||
);
|
||||
|
||||
// Return the complete result from the optimized processor
|
||||
return {
|
||||
success: optimizedResult.success,
|
||||
summary: optimizedResult.summary || `Document successfully processed with optimized agentic RAG. Created ${optimizedResult.processedChunks} chunks with ${optimizedResult.averageChunkSize} average size.`,
|
||||
analysisData: optimizedResult.analysisData || {} as CIMReview,
|
||||
processingStrategy: 'optimized_agentic_rag',
|
||||
processingTime: optimizedResult.processingTime,
|
||||
apiCalls: Math.ceil(optimizedResult.processedChunks / 5), // Estimate API calls
|
||||
error: optimizedResult.error
|
||||
};
|
||||
} catch (error) {
|
||||
logger.error('Optimized agentic RAG processing failed', { documentId, error });
|
||||
|
||||
console.log('❌ Unified document processor - optimized agentic RAG failed for document:', documentId);
|
||||
console.log('❌ Error:', error instanceof Error ? error.message : String(error));
|
||||
|
||||
return {
|
||||
success: false,
|
||||
summary: '',
|
||||
analysisData: {} as CIMReview,
|
||||
processingStrategy: 'optimized_agentic_rag',
|
||||
processingTime: Date.now() - startTime,
|
||||
apiCalls: 0,
|
||||
error: error instanceof Error ? error.message : 'Unknown error'
|
||||
};
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Process document using Document AI + Genkit approach
|
||||
*/
|
||||
private async processWithDocumentAiGenkit(
|
||||
private async processWithDocumentAiAgenticRag(
|
||||
documentId: string,
|
||||
userId: string,
|
||||
text: string,
|
||||
options: any
|
||||
): Promise<ProcessingResult> {
|
||||
logger.info('Using Document AI + Genkit processing strategy', { documentId });
|
||||
|
||||
const startTime = Date.now();
|
||||
logger.info('Using Document AI + Agentic RAG processing strategy', { documentId });
|
||||
|
||||
try {
|
||||
// Get the file buffer from options if available, otherwise use text
|
||||
const fileBuffer = options.fileBuffer || Buffer.from(text);
|
||||
const fileName = options.fileName || `document-${documentId}.pdf`;
|
||||
const mimeType = options.mimeType || 'application/pdf';
|
||||
const startTime = Date.now();
|
||||
|
||||
logger.info('Document AI processing with file data', {
|
||||
documentId,
|
||||
fileSize: fileBuffer.length,
|
||||
fileName,
|
||||
mimeType
|
||||
});
|
||||
// Extract file buffer from options
|
||||
const { fileBuffer, fileName, mimeType } = options;
|
||||
|
||||
const result = await documentAiGenkitProcessor.processDocument(
|
||||
if (!fileBuffer || !fileName || !mimeType) {
|
||||
throw new Error('Missing required options: fileBuffer, fileName, mimeType');
|
||||
}
|
||||
|
||||
// Process with Document AI + Agentic RAG
|
||||
const result = await documentAiProcessor.processDocument(
|
||||
documentId,
|
||||
userId,
|
||||
fileBuffer,
|
||||
@@ -218,39 +172,42 @@ class UnifiedDocumentProcessor {
|
||||
mimeType
|
||||
);
|
||||
|
||||
if (!result.success) {
|
||||
logger.error('Document AI processing failed', {
|
||||
documentId,
|
||||
error: result.error,
|
||||
metadata: result.metadata
|
||||
});
|
||||
const processingTime = Date.now() - startTime;
|
||||
|
||||
if (result.success) {
|
||||
return {
|
||||
success: true,
|
||||
summary: result.content,
|
||||
analysisData: result.metadata?.agenticRagResult?.analysisData || {},
|
||||
processingStrategy: 'document_ai_agentic_rag',
|
||||
processingTime,
|
||||
apiCalls: result.metadata?.agenticRagResult?.apiCalls || 0,
|
||||
error: undefined
|
||||
};
|
||||
} else {
|
||||
return {
|
||||
success: false,
|
||||
summary: '',
|
||||
analysisData: defaultCIMReview,
|
||||
processingStrategy: 'document_ai_agentic_rag',
|
||||
processingTime,
|
||||
apiCalls: 0,
|
||||
error: result.error || 'Unknown processing error'
|
||||
};
|
||||
}
|
||||
|
||||
return {
|
||||
success: result.success,
|
||||
summary: result.content || '',
|
||||
analysisData: (result.metadata?.agenticRagResult?.analysisData as CIMReview) || {} as CIMReview,
|
||||
processingStrategy: 'document_ai_genkit',
|
||||
processingTime: Date.now() - startTime,
|
||||
apiCalls: 1, // Document AI + Agentic RAG typically uses fewer API calls
|
||||
error: result.error || undefined
|
||||
};
|
||||
} catch (error) {
|
||||
const errorMessage = error instanceof Error ? error.message : String(error);
|
||||
const errorStack = error instanceof Error ? error.stack : undefined;
|
||||
|
||||
logger.error('Document AI + Genkit processing failed with exception', {
|
||||
const errorMessage = error instanceof Error ? error.message : 'Unknown error';
|
||||
logger.error('Document AI + Agentic RAG processing failed', {
|
||||
documentId,
|
||||
error: errorMessage,
|
||||
stack: errorStack
|
||||
error: errorMessage
|
||||
});
|
||||
|
||||
return {
|
||||
success: false,
|
||||
summary: '',
|
||||
analysisData: {} as CIMReview,
|
||||
processingStrategy: 'document_ai_genkit',
|
||||
processingTime: Date.now() - startTime,
|
||||
analysisData: defaultCIMReview,
|
||||
processingStrategy: 'document_ai_agentic_rag',
|
||||
processingTime: 0,
|
||||
apiCalls: 0,
|
||||
error: errorMessage
|
||||
};
|
||||
@@ -258,213 +215,31 @@ class UnifiedDocumentProcessor {
|
||||
}
|
||||
|
||||
/**
|
||||
* Process document using chunking approach
|
||||
*/
|
||||
private async processWithChunking(
|
||||
documentId: string,
|
||||
userId: string,
|
||||
text: string,
|
||||
options: any
|
||||
): Promise<ProcessingResult> {
|
||||
logger.info('Using chunking processing strategy', { documentId });
|
||||
|
||||
const startTime = Date.now();
|
||||
|
||||
try {
|
||||
const result = await documentProcessingService.processDocument(documentId, userId, options);
|
||||
|
||||
// Estimate API calls for chunking (this is approximate)
|
||||
const estimatedApiCalls = this.estimateChunkingApiCalls(text);
|
||||
|
||||
return {
|
||||
success: result.success,
|
||||
summary: result.summary || '',
|
||||
analysisData: (result.analysis as CIMReview) || {} as CIMReview,
|
||||
processingStrategy: 'chunking',
|
||||
processingTime: Date.now() - startTime,
|
||||
apiCalls: estimatedApiCalls,
|
||||
error: result.error || undefined
|
||||
};
|
||||
} catch (error) {
|
||||
return {
|
||||
success: false,
|
||||
summary: '',
|
||||
analysisData: {} as CIMReview,
|
||||
processingStrategy: 'chunking',
|
||||
processingTime: Date.now() - startTime,
|
||||
apiCalls: 0,
|
||||
error: error instanceof Error ? error.message : 'Unknown error'
|
||||
};
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Compare all processing strategies
|
||||
*/
|
||||
async compareProcessingStrategies(
|
||||
documentId: string,
|
||||
userId: string,
|
||||
text: string,
|
||||
options: any = {}
|
||||
): Promise<ComparisonResult> {
|
||||
logger.info('Comparing processing strategies', { documentId });
|
||||
|
||||
// Process with all strategies
|
||||
const [chunkingResult, ragResult, agenticRagResult] = await Promise.all([
|
||||
this.processWithChunking(documentId, userId, text, options),
|
||||
this.processWithRAG(documentId, text),
|
||||
this.processWithAgenticRAG(documentId, userId, text)
|
||||
]);
|
||||
|
||||
// Calculate performance metrics
|
||||
const timeDifference = chunkingResult.processingTime - ragResult.processingTime;
|
||||
const apiCallDifference = chunkingResult.apiCalls - ragResult.apiCalls;
|
||||
const qualityScore = this.calculateQualityScore(chunkingResult, ragResult);
|
||||
|
||||
// Determine winner
|
||||
let winner: 'chunking' | 'rag' | 'agentic_rag' | 'tie' = 'tie';
|
||||
|
||||
// Check which strategies were successful
|
||||
const successfulStrategies: Array<{ name: string; result: ProcessingResult }> = [];
|
||||
if (chunkingResult.success) successfulStrategies.push({ name: 'chunking', result: chunkingResult });
|
||||
if (ragResult.success) successfulStrategies.push({ name: 'rag', result: ragResult });
|
||||
if (agenticRagResult.success) successfulStrategies.push({ name: 'agentic_rag', result: agenticRagResult });
|
||||
|
||||
if (successfulStrategies.length === 0) {
|
||||
winner = 'tie';
|
||||
} else if (successfulStrategies.length === 1) {
|
||||
winner = successfulStrategies[0]?.name as 'chunking' | 'rag' | 'agentic_rag' || 'tie';
|
||||
} else {
|
||||
// Multiple successful strategies, compare performance
|
||||
const scores = successfulStrategies.map(strategy => {
|
||||
const result = strategy.result;
|
||||
const quality = this.calculateQualityScore(result, result); // Self-comparison for baseline
|
||||
const timeScore = 1 / (1 + result.processingTime / 60000); // Normalize to 1 minute
|
||||
const apiScore = 1 / (1 + result.apiCalls / 10); // Normalize to 10 API calls
|
||||
return {
|
||||
name: strategy.name,
|
||||
score: quality * 0.5 + timeScore * 0.25 + apiScore * 0.25
|
||||
};
|
||||
});
|
||||
|
||||
scores.sort((a, b) => b.score - a.score);
|
||||
winner = scores[0]?.name as 'chunking' | 'rag' | 'agentic_rag' || 'tie';
|
||||
}
|
||||
|
||||
return {
|
||||
chunking: chunkingResult,
|
||||
rag: ragResult,
|
||||
agenticRag: agenticRagResult,
|
||||
winner,
|
||||
performanceMetrics: {
|
||||
timeDifference,
|
||||
apiCallDifference,
|
||||
qualityScore
|
||||
}
|
||||
};
|
||||
}
|
||||
|
||||
/**
|
||||
* Estimate API calls for chunking approach
|
||||
*/
|
||||
private estimateChunkingApiCalls(text: string): number {
|
||||
const chunkSize = config.llm.chunkSize;
|
||||
const estimatedTokens = Math.ceil(text.length / 4); // Rough token estimation
|
||||
const chunks = Math.ceil(estimatedTokens / chunkSize);
|
||||
return chunks + 1; // +1 for final synthesis
|
||||
}
|
||||
|
||||
/**
|
||||
* Calculate quality score based on result completeness
|
||||
*/
|
||||
private calculateQualityScore(chunkingResult: ProcessingResult, ragResult: ProcessingResult): number {
|
||||
if (!chunkingResult.success && !ragResult.success) return 0.5;
|
||||
if (!chunkingResult.success) return 1.0;
|
||||
if (!ragResult.success) return 0.0;
|
||||
|
||||
// Compare summary length and structure
|
||||
const chunkingScore = this.analyzeSummaryQuality(chunkingResult.summary);
|
||||
const ragScore = this.analyzeSummaryQuality(ragResult.summary);
|
||||
|
||||
return ragScore / (chunkingScore + ragScore);
|
||||
}
|
||||
|
||||
/**
|
||||
* Analyze summary quality based on length and structure
|
||||
*/
|
||||
private analyzeSummaryQuality(summary: string): number {
|
||||
if (!summary) return 0;
|
||||
|
||||
// Check for markdown structure
|
||||
const hasHeaders = (summary.match(/#{1,6}\s/g) || []).length;
|
||||
const hasLists = (summary.match(/[-*+]\s/g) || []).length;
|
||||
const hasBold = (summary.match(/\*\*.*?\*\*/g) || []).length;
|
||||
|
||||
// Length factor (longer summaries tend to be more comprehensive)
|
||||
const lengthFactor = Math.min(summary.length / 5000, 1);
|
||||
|
||||
// Structure factor
|
||||
const structureFactor = Math.min((hasHeaders + hasLists + hasBold) / 10, 1);
|
||||
|
||||
return (lengthFactor * 0.7) + (structureFactor * 0.3);
|
||||
}
|
||||
|
||||
/**
|
||||
* Get processing statistics
|
||||
* Get processing statistics (simplified)
|
||||
*/
|
||||
async getProcessingStats(): Promise<{
|
||||
totalDocuments: number;
|
||||
chunkingSuccess: number;
|
||||
ragSuccess: number;
|
||||
agenticRagSuccess: number;
|
||||
documentAiAgenticRagSuccess: number;
|
||||
averageProcessingTime: {
|
||||
chunking: number;
|
||||
rag: number;
|
||||
agenticRag: number;
|
||||
documentAiAgenticRag: number;
|
||||
};
|
||||
averageApiCalls: {
|
||||
chunking: number;
|
||||
rag: number;
|
||||
agenticRag: number;
|
||||
documentAiAgenticRag: number;
|
||||
};
|
||||
}> {
|
||||
// This would typically query a database for processing statistics
|
||||
// For now, return mock data
|
||||
// This would need to be implemented based on actual database queries
|
||||
// For now, return placeholder data
|
||||
return {
|
||||
totalDocuments: 0,
|
||||
chunkingSuccess: 0,
|
||||
ragSuccess: 0,
|
||||
agenticRagSuccess: 0,
|
||||
documentAiAgenticRagSuccess: 0,
|
||||
averageProcessingTime: {
|
||||
chunking: 0,
|
||||
rag: 0,
|
||||
agenticRag: 0
|
||||
documentAiAgenticRag: 0
|
||||
},
|
||||
averageApiCalls: {
|
||||
chunking: 0,
|
||||
rag: 0,
|
||||
agenticRag: 0
|
||||
documentAiAgenticRag: 0
|
||||
}
|
||||
};
|
||||
}
|
||||
|
||||
/**
|
||||
* Switch processing strategy for a document
|
||||
*/
|
||||
async switchStrategy(
|
||||
documentId: string,
|
||||
userId: string,
|
||||
text: string,
|
||||
newStrategy: 'chunking' | 'rag' | 'agentic_rag',
|
||||
options: any = {}
|
||||
): Promise<ProcessingResult> {
|
||||
logger.info('Switching processing strategy', { documentId, newStrategy });
|
||||
|
||||
return await this.processDocument(documentId, userId, text, {
|
||||
...options,
|
||||
strategy: newStrategy
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
export const unifiedDocumentProcessor = new UnifiedDocumentProcessor();
|
||||
@@ -1,500 +0,0 @@
|
||||
import { vectorDatabaseService } from './vectorDatabaseService';
|
||||
import { llmService } from './llmService';
|
||||
import { logger } from '../utils/logger';
|
||||
import { DocumentChunk } from '../models/VectorDatabaseModel';
|
||||
|
||||
export interface ChunkingOptions {
|
||||
chunkSize: number;
|
||||
chunkOverlap: number;
|
||||
maxChunks: number;
|
||||
}
|
||||
|
||||
export interface VectorProcessingResult {
|
||||
totalChunks: number;
|
||||
chunksWithEmbeddings: number;
|
||||
processingTime: number;
|
||||
averageChunkSize: number;
|
||||
}
|
||||
|
||||
export interface TextBlock {
|
||||
type: 'paragraph' | 'table' | 'heading' | 'list_item';
|
||||
content: string;
|
||||
}
|
||||
|
||||
export class VectorDocumentProcessor {
|
||||
|
||||
/**
|
||||
* Store enriched chunks with metadata from agenticRAGProcessor
|
||||
*/
|
||||
async storeDocumentChunks(enrichedChunks: Array<{
|
||||
content: string;
|
||||
chunkIndex: number;
|
||||
startPosition: number;
|
||||
endPosition: number;
|
||||
sectionType?: string;
|
||||
metadata?: {
|
||||
hasFinancialData: boolean;
|
||||
hasMetrics: boolean;
|
||||
keyTerms: string[];
|
||||
importance: 'high' | 'medium' | 'low';
|
||||
conceptDensity: number;
|
||||
};
|
||||
}>, options?: {
|
||||
documentId: string;
|
||||
indexingStrategy?: string;
|
||||
similarity_threshold?: number;
|
||||
enable_hybrid_search?: boolean;
|
||||
}): Promise<void> {
|
||||
const startTime = Date.now();
|
||||
|
||||
try {
|
||||
const documentChunks: DocumentChunk[] = [];
|
||||
|
||||
for (const chunk of enrichedChunks) {
|
||||
// Generate embedding for the chunk
|
||||
const embedding = await vectorDatabaseService.generateEmbeddings(chunk.content);
|
||||
|
||||
// Create DocumentChunk with enhanced metadata
|
||||
const documentChunk: DocumentChunk = {
|
||||
id: `${options?.documentId}-chunk-${chunk.chunkIndex}`,
|
||||
documentId: options?.documentId || '',
|
||||
content: chunk.content,
|
||||
embedding,
|
||||
chunkIndex: chunk.chunkIndex,
|
||||
metadata: {
|
||||
...chunk.metadata,
|
||||
sectionType: chunk.sectionType,
|
||||
chunkSize: chunk.content.length,
|
||||
processingStrategy: options?.indexingStrategy || 'hierarchical',
|
||||
startPosition: chunk.startPosition,
|
||||
endPosition: chunk.endPosition
|
||||
},
|
||||
createdAt: new Date(),
|
||||
updatedAt: new Date()
|
||||
};
|
||||
|
||||
documentChunks.push(documentChunk);
|
||||
}
|
||||
|
||||
// Store all chunks in vector database
|
||||
await vectorDatabaseService.storeDocumentChunks(documentChunks);
|
||||
|
||||
const processingTime = Date.now() - startTime;
|
||||
const averageImportance = this.calculateAverageImportance(enrichedChunks);
|
||||
|
||||
logger.info(`Stored ${documentChunks.length} enriched chunks`, {
|
||||
documentId: options?.documentId,
|
||||
processingTime,
|
||||
averageImportance,
|
||||
indexingStrategy: options?.indexingStrategy
|
||||
});
|
||||
|
||||
} catch (error) {
|
||||
logger.error('Failed to store enriched chunks', error);
|
||||
throw error;
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Calculate average importance score for logging
|
||||
*/
|
||||
private calculateAverageImportance(chunks: Array<{ metadata?: { importance: string } }>): string {
|
||||
const importanceScores = chunks
|
||||
.map(c => c.metadata?.importance)
|
||||
.filter(Boolean);
|
||||
|
||||
if (importanceScores.length === 0) return 'unknown';
|
||||
|
||||
const highCount = importanceScores.filter(i => i === 'high').length;
|
||||
const mediumCount = importanceScores.filter(i => i === 'medium').length;
|
||||
|
||||
if (highCount > importanceScores.length / 2) return 'high';
|
||||
if (mediumCount + highCount > importanceScores.length / 2) return 'medium';
|
||||
return 'low';
|
||||
}
|
||||
|
||||
/**
|
||||
* Identifies structured blocks of text from a raw string using heuristics.
|
||||
* This is the core of the improved ingestion pipeline.
|
||||
* @param text The raw text from a PDF extraction.
|
||||
*/
|
||||
private identifyTextBlocks(text: string): TextBlock[] {
|
||||
const blocks: TextBlock[] = [];
|
||||
// Normalize line endings and remove excessive blank lines to regularize input
|
||||
const lines = text.replace(/\n/g, '\n').split('\n');
|
||||
|
||||
let currentParagraph = '';
|
||||
|
||||
for (let i = 0; i < lines.length; i++) {
|
||||
const line = lines[i];
|
||||
if (line === undefined) continue;
|
||||
const trimmedLine = line.trim();
|
||||
|
||||
// If we encounter a blank line, the current paragraph (if any) has ended.
|
||||
if (trimmedLine === '') {
|
||||
if (currentParagraph.trim()) {
|
||||
blocks.push({ type: 'paragraph', content: currentParagraph.trim() });
|
||||
currentParagraph = '';
|
||||
}
|
||||
continue;
|
||||
}
|
||||
|
||||
// Heuristic for tables: A line with at least 2 instances of multiple spaces is likely a table row.
|
||||
// This is a strong indicator of columnar data in plain text.
|
||||
const isTableLike = /(\s{2,}.*){2,}/.test(line);
|
||||
|
||||
if (isTableLike) {
|
||||
if (currentParagraph.trim()) {
|
||||
blocks.push({ type: 'paragraph', content: currentParagraph.trim() });
|
||||
currentParagraph = '';
|
||||
}
|
||||
// Greedily consume subsequent lines that also look like part of the table.
|
||||
let tableContent = line;
|
||||
while (i + 1 < lines.length && /(\s{2,}.*){2,}/.test(lines[i + 1] || '')) {
|
||||
i++;
|
||||
tableContent += '\n' + lines[i];
|
||||
}
|
||||
blocks.push({ type: 'table', content: tableContent });
|
||||
continue;
|
||||
}
|
||||
|
||||
// Heuristic for headings: A short line (under 80 chars) that doesn't end with a period.
|
||||
// Often in Title Case, but we won't strictly enforce that to be more flexible.
|
||||
const isHeadingLike = trimmedLine.length < 80 && !trimmedLine.endsWith('.');
|
||||
if (i + 1 < lines.length && (lines[i+1] || '').trim() === '' && isHeadingLike) {
|
||||
if (currentParagraph.trim()) {
|
||||
blocks.push({ type: 'paragraph', content: currentParagraph.trim() });
|
||||
currentParagraph = '';
|
||||
}
|
||||
blocks.push({ type: 'heading', content: trimmedLine });
|
||||
i++; // Skip the blank line after the heading
|
||||
continue;
|
||||
}
|
||||
|
||||
// Heuristic for list items
|
||||
if (trimmedLine.match(/^(\*|-\d+\.)\s/)) {
|
||||
if (currentParagraph.trim()) {
|
||||
blocks.push({ type: 'paragraph', content: currentParagraph.trim() });
|
||||
currentParagraph = '';
|
||||
}
|
||||
blocks.push({ type: 'list_item', content: trimmedLine });
|
||||
continue;
|
||||
}
|
||||
|
||||
// Otherwise, append the line to the current paragraph.
|
||||
currentParagraph += (currentParagraph ? ' ' : '') + trimmedLine;
|
||||
}
|
||||
|
||||
// Add the last remaining paragraph if it exists.
|
||||
if (currentParagraph.trim()) {
|
||||
blocks.push({ type: 'paragraph', content: currentParagraph.trim() });
|
||||
}
|
||||
|
||||
logger.info(`Identified ${blocks.length} semantic blocks from text.`);
|
||||
return blocks;
|
||||
}
|
||||
|
||||
/**
|
||||
* Generates a text summary for a table to be used for embedding.
|
||||
* @param tableText The raw text of the table.
|
||||
*/
|
||||
private async getSummaryForTable(tableText: string): Promise<string> {
|
||||
const prompt = `The following text is an OCR'd table from a financial document. It may be messy.\n Summarize the key information in this table in a few clear, narrative sentences.\n Focus on the main metrics, trends, and time periods.\n Do not return a markdown table. Return only a natural language summary.\n\n Table Text:\n ---\n ${tableText}\n ---\n Summary:`;
|
||||
|
||||
try {
|
||||
const result = await llmService.processCIMDocument(prompt, '', { agentName: 'table_summarizer' });
|
||||
// Handle both string and object responses from the LLM
|
||||
if (result.success) {
|
||||
if (typeof result.jsonOutput === 'string') {
|
||||
return result.jsonOutput;
|
||||
}
|
||||
if (typeof result.jsonOutput === 'object' && (result.jsonOutput as any)?.summary) {
|
||||
return (result.jsonOutput as any).summary;
|
||||
}
|
||||
}
|
||||
logger.warn('Table summarization failed or returned invalid format, falling back to raw text.', { tableText });
|
||||
return tableText; // Fallback
|
||||
} catch (error) {
|
||||
logger.error('Error during table summarization', { error });
|
||||
return tableText; // Fallback
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Process document text into chunks and generate embeddings using the new heuristic-based strategy.
|
||||
*/
|
||||
async processDocumentForVectorSearch(
|
||||
documentId: string,
|
||||
text: string,
|
||||
metadata: Record<string, any> = {}
|
||||
): Promise<VectorProcessingResult> {
|
||||
const startTime = Date.now();
|
||||
|
||||
try {
|
||||
logger.info(`Starting HEURISTIC vector processing for document: ${documentId}`);
|
||||
|
||||
// Step 1: Identify semantic blocks from the document text
|
||||
const blocks = this.identifyTextBlocks(text);
|
||||
|
||||
// Step 2: Generate embeddings for each block, with differential processing
|
||||
const chunksWithEmbeddings = await this.generateEmbeddingsForBlocks(
|
||||
documentId,
|
||||
blocks,
|
||||
metadata
|
||||
);
|
||||
|
||||
// Step 3: Store chunks in vector database
|
||||
await vectorDatabaseService.storeDocumentChunks(chunksWithEmbeddings);
|
||||
|
||||
const processingTime = Date.now() - startTime;
|
||||
const averageChunkSize = chunksWithEmbeddings.length > 0 ? chunksWithEmbeddings.reduce((sum: number, chunk: any) => sum + chunk.content.length, 0) / chunksWithEmbeddings.length : 0;
|
||||
|
||||
logger.info(`Heuristic vector processing completed for document: ${documentId}`, {
|
||||
totalChunks: blocks.length,
|
||||
chunksWithEmbeddings: chunksWithEmbeddings.length,
|
||||
processingTime,
|
||||
averageChunkSize: Math.round(averageChunkSize)
|
||||
});
|
||||
|
||||
return {
|
||||
totalChunks: blocks.length,
|
||||
chunksWithEmbeddings: chunksWithEmbeddings.length,
|
||||
processingTime,
|
||||
averageChunkSize: Math.round(averageChunkSize)
|
||||
};
|
||||
} catch (error) {
|
||||
logger.error(`Heuristic vector processing failed for document: ${documentId}`, error);
|
||||
throw error;
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Generates embeddings for the identified text blocks, applying special logic for tables.
|
||||
*/
|
||||
private async generateEmbeddingsForBlocks(
|
||||
documentId: string,
|
||||
blocks: TextBlock[],
|
||||
metadata: Record<string, any>
|
||||
): Promise<DocumentChunk[]> {
|
||||
const chunksWithEmbeddings: DocumentChunk[] = [];
|
||||
|
||||
for (let i = 0; i < blocks.length; i++) {
|
||||
const block = blocks[i];
|
||||
if (!block || !block.content) continue;
|
||||
|
||||
let contentToEmbed = block.content;
|
||||
const blockMetadata: any = {
|
||||
...metadata,
|
||||
block_type: block.type,
|
||||
chunkIndex: i,
|
||||
totalChunks: blocks.length,
|
||||
chunkSize: block.content.length,
|
||||
};
|
||||
|
||||
try {
|
||||
// Differential processing for tables
|
||||
if (block.type === 'table') {
|
||||
logger.info(`Summarizing table chunk ${i}...`);
|
||||
contentToEmbed = await this.getSummaryForTable(block.content);
|
||||
// Store the original table text in the metadata for later retrieval
|
||||
blockMetadata.original_table = block.content;
|
||||
}
|
||||
|
||||
const embedding = await vectorDatabaseService.generateEmbeddings(contentToEmbed);
|
||||
|
||||
const documentChunk: DocumentChunk = {
|
||||
id: `${documentId}-chunk-${i}`,
|
||||
documentId,
|
||||
content: contentToEmbed, // This is the summary for tables, or the raw text for others
|
||||
metadata: blockMetadata,
|
||||
embedding,
|
||||
chunkIndex: i,
|
||||
createdAt: new Date(),
|
||||
updatedAt: new Date()
|
||||
};
|
||||
|
||||
chunksWithEmbeddings.push(documentChunk);
|
||||
|
||||
if (blocks.length > 10 && (i + 1) % 10 === 0) {
|
||||
logger.info(`Generated embeddings for ${i + 1}/${blocks.length} blocks`);
|
||||
}
|
||||
} catch (error) {
|
||||
logger.error(`Failed to generate embedding for block ${i}`, { error, blockType: block.type });
|
||||
// Continue with other chunks, do not halt the entire process
|
||||
}
|
||||
}
|
||||
|
||||
return chunksWithEmbeddings;
|
||||
}
|
||||
|
||||
/**
|
||||
* Enhanced search with intelligent filtering and ranking
|
||||
*/
|
||||
async searchRelevantContent(
|
||||
query: string,
|
||||
options: {
|
||||
documentId?: string;
|
||||
limit?: number;
|
||||
similarityThreshold?: number;
|
||||
filters?: Record<string, any>;
|
||||
prioritizeFinancial?: boolean;
|
||||
boostImportance?: boolean;
|
||||
enableReranking?: boolean;
|
||||
} = {}
|
||||
) {
|
||||
try {
|
||||
// Enhanced search parameters
|
||||
const searchOptions = {
|
||||
...options,
|
||||
limit: Math.min(options.limit || 5, 20), // Cap at 20 for performance
|
||||
similarityThreshold: options.similarityThreshold || 0.7, // Higher threshold for quality
|
||||
};
|
||||
|
||||
// Add metadata filters for better relevance
|
||||
if (options.prioritizeFinancial) {
|
||||
searchOptions.filters = {
|
||||
...searchOptions.filters,
|
||||
'metadata.hasFinancialData': true
|
||||
};
|
||||
}
|
||||
|
||||
const rawResults = await vectorDatabaseService.search(query, searchOptions);
|
||||
|
||||
// Post-process results for enhanced ranking
|
||||
const enhancedResults = this.rankSearchResults(rawResults, query, options);
|
||||
|
||||
// Apply reranking if enabled
|
||||
let finalResults = enhancedResults;
|
||||
if (options.enableReranking !== false) {
|
||||
finalResults = await this.rerankResults(query, enhancedResults, options.limit || 5);
|
||||
}
|
||||
|
||||
logger.info(`Enhanced vector search completed`, {
|
||||
query: query.substring(0, 100) + (query.length > 100 ? '...' : ''),
|
||||
rawResultsCount: rawResults.length,
|
||||
enhancedResultsCount: enhancedResults.length,
|
||||
finalResultsCount: finalResults.length,
|
||||
documentId: options.documentId,
|
||||
prioritizeFinancial: options.prioritizeFinancial,
|
||||
enableReranking: options.enableReranking !== false,
|
||||
avgRelevanceScore: finalResults.length > 0 ?
|
||||
Math.round((finalResults.reduce((sum: number, r: any) => sum + (r.similarity || 0), 0) / finalResults.length) * 100) / 100 : 0
|
||||
});
|
||||
|
||||
return finalResults;
|
||||
} catch (error) {
|
||||
logger.error('Enhanced vector search failed', { query, options, error });
|
||||
throw error;
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Rank search results based on multiple criteria
|
||||
*/
|
||||
private rankSearchResults(results: any[], query: string, options: any): any[] {
|
||||
return results
|
||||
.map(result => ({
|
||||
...result,
|
||||
enhancedScore: this.calculateEnhancedScore(result, query, options)
|
||||
}))
|
||||
.sort((a, b) => b.enhancedScore - a.enhancedScore)
|
||||
.slice(0, options.limit || 5);
|
||||
}
|
||||
|
||||
/**
|
||||
* Calculate enhanced relevance score
|
||||
*/
|
||||
private calculateEnhancedScore(result: any, query: string, options: any): number {
|
||||
let score = result.similarity || 0;
|
||||
|
||||
// Boost based on importance
|
||||
if (options.boostImportance && result.metadata?.importance) {
|
||||
if (result.metadata.importance === 'high') score += 0.2;
|
||||
else if (result.metadata.importance === 'medium') score += 0.1;
|
||||
}
|
||||
|
||||
// Boost based on concept density
|
||||
if (result.metadata?.conceptDensity) {
|
||||
score += result.metadata.conceptDensity * 0.1;
|
||||
}
|
||||
|
||||
// Boost financial content if query suggests financial context
|
||||
if (/financial|revenue|profit|ebitda|margin|cost|cash|debt/i.test(query)) {
|
||||
if (result.metadata?.hasFinancialData) score += 0.15;
|
||||
if (result.metadata?.hasMetrics) score += 0.1;
|
||||
}
|
||||
|
||||
// Boost based on section type relevance
|
||||
if (result.metadata?.sectionType) {
|
||||
const sectionBoosts: Record<string, number> = {
|
||||
'executive_summary': 0.1,
|
||||
'financial': 0.15,
|
||||
'market_analysis': 0.1,
|
||||
'management': 0.05
|
||||
};
|
||||
score += sectionBoosts[result.metadata.sectionType] || 0;
|
||||
}
|
||||
|
||||
// Boost if query terms appear in key terms
|
||||
if (result.metadata?.keyTerms) {
|
||||
const queryWords = query.toLowerCase().split(/\s+/);
|
||||
const keyTermMatches = result.metadata.keyTerms.filter((term: string) =>
|
||||
queryWords.some(word => term.toLowerCase().includes(word))
|
||||
).length;
|
||||
score += keyTermMatches * 0.05;
|
||||
}
|
||||
|
||||
return Math.min(score, 1.0); // Cap at 1.0
|
||||
}
|
||||
|
||||
/**
|
||||
* Rerank results using cross-encoder approach
|
||||
*/
|
||||
private async rerankResults(query: string, candidates: any[], topK: number = 5): Promise<any[]> {
|
||||
try {
|
||||
// Create reranking prompt
|
||||
const rerankingPrompt = `Given the query: "${query}"
|
||||
|
||||
Please rank the following document chunks by relevance (1 = most relevant, ${candidates.length} = least relevant). Consider:
|
||||
- Semantic similarity to the query
|
||||
- Financial/business relevance
|
||||
- Information completeness
|
||||
- Factual accuracy
|
||||
|
||||
Document chunks:
|
||||
${candidates.map((c, i) => `${i + 1}. ${c.content.substring(0, 200)}...`).join('\n')}
|
||||
|
||||
Return only a JSON array of indices in order of relevance: [1, 3, 2, ...]`;
|
||||
|
||||
const result = await llmService.processCIMDocument(rerankingPrompt, '', {
|
||||
agentName: 'reranker',
|
||||
maxTokens: 1000
|
||||
});
|
||||
|
||||
if (result.success && typeof result.jsonOutput === 'object') {
|
||||
const ranking = Array.isArray(result.jsonOutput) ? result.jsonOutput as number[] : null;
|
||||
if (ranking) {
|
||||
// Apply the ranking
|
||||
const reranked = ranking
|
||||
.map(index => candidates[index - 1]) // Convert 1-based to 0-based
|
||||
.filter(Boolean) // Remove any undefined entries
|
||||
.slice(0, topK);
|
||||
|
||||
logger.info(`Reranked ${candidates.length} candidates to ${reranked.length} results`);
|
||||
return reranked;
|
||||
}
|
||||
}
|
||||
|
||||
// Fallback to original ranking if reranking fails
|
||||
logger.warn('Reranking failed, using original ranking');
|
||||
return candidates.slice(0, topK);
|
||||
} catch (error) {
|
||||
logger.error('Reranking failed', error);
|
||||
return candidates.slice(0, topK);
|
||||
}
|
||||
}
|
||||
|
||||
// ... other methods like findSimilarDocuments, etc. remain unchanged ...
|
||||
}
|
||||
|
||||
export const vectorDocumentProcessor = new VectorDocumentProcessor();
|
||||
@@ -4,7 +4,6 @@ import { fileStorageService } from '../../services/fileStorageService';
|
||||
import { documentController } from '../../controllers/documentController';
|
||||
import { unifiedDocumentProcessor } from '../../services/unifiedDocumentProcessor';
|
||||
import { uploadMonitoringService } from '../../services/uploadMonitoringService';
|
||||
import { handleFileUpload } from '../../middleware/upload';
|
||||
import { verifyFirebaseToken } from '../../middleware/firebaseAuth';
|
||||
import { addCorrelationId } from '../../middleware/validation';
|
||||
|
||||
@@ -13,10 +12,11 @@ jest.mock('../../services/fileStorageService');
|
||||
jest.mock('../../services/unifiedDocumentProcessor');
|
||||
jest.mock('../../services/uploadMonitoringService');
|
||||
jest.mock('../../middleware/firebaseAuth');
|
||||
jest.mock('../../middleware/upload');
|
||||
|
||||
// Mock Firebase Admin
|
||||
jest.mock('firebase-admin', () => ({
|
||||
apps: [],
|
||||
initializeApp: jest.fn(),
|
||||
auth: () => ({
|
||||
verifyIdToken: jest.fn().mockResolvedValue({
|
||||
uid: 'test-user-id',
|
||||
@@ -36,18 +36,9 @@ jest.mock('../../models/DocumentModel', () => ({
|
||||
},
|
||||
}));
|
||||
|
||||
describe('Upload Pipeline Integration Tests', () => {
|
||||
describe('Firebase Storage Direct Upload Pipeline Tests', () => {
|
||||
let app: express.Application;
|
||||
|
||||
const mockFile = {
|
||||
originalname: 'test-document.pdf',
|
||||
filename: '1234567890-abc123.pdf',
|
||||
path: '/tmp/1234567890-abc123.pdf',
|
||||
size: 1024,
|
||||
mimetype: 'application/pdf',
|
||||
buffer: Buffer.from('test file content'),
|
||||
};
|
||||
|
||||
const mockUser = {
|
||||
uid: 'test-user-id',
|
||||
email: 'test@example.com',
|
||||
@@ -62,24 +53,40 @@ describe('Upload Pipeline Integration Tests', () => {
|
||||
next();
|
||||
});
|
||||
|
||||
(handleFileUpload as jest.Mock).mockImplementation((req: any, res: any, next: any) => {
|
||||
req.file = mockFile;
|
||||
next();
|
||||
// Mock file storage service for new upload flow
|
||||
(fileStorageService.generateSignedUploadUrl as jest.Mock).mockResolvedValue(
|
||||
'https://storage.googleapis.com/test-bucket/uploads/test-user-id/1234567890-test-document.pdf?signature=...'
|
||||
);
|
||||
|
||||
(fileStorageService.getFile as jest.Mock).mockResolvedValue(Buffer.from('test file content'));
|
||||
|
||||
// Mock document model
|
||||
const { DocumentModel } = require('../../models/DocumentModel');
|
||||
DocumentModel.create.mockResolvedValue({
|
||||
id: '123e4567-e89b-12d3-a456-426614174000',
|
||||
user_id: mockUser.uid,
|
||||
original_file_name: 'test-document.pdf',
|
||||
file_path: 'uploads/test-user-id/1234567890-test-document.pdf',
|
||||
file_size: 1024,
|
||||
status: 'uploading',
|
||||
created_at: new Date(),
|
||||
updated_at: new Date()
|
||||
});
|
||||
|
||||
(fileStorageService.storeFile as jest.Mock).mockResolvedValue({
|
||||
success: true,
|
||||
fileInfo: {
|
||||
originalName: 'test-document.pdf',
|
||||
filename: '1234567890-abc123.pdf',
|
||||
path: 'uploads/test-user-id/1234567890-abc123.pdf',
|
||||
size: 1024,
|
||||
mimetype: 'application/pdf',
|
||||
uploadedAt: new Date(),
|
||||
gcsPath: 'uploads/test-user-id/1234567890-abc123.pdf',
|
||||
},
|
||||
DocumentModel.findById.mockResolvedValue({
|
||||
id: '123e4567-e89b-12d3-a456-426614174000',
|
||||
user_id: mockUser.uid,
|
||||
original_file_name: 'test-document.pdf',
|
||||
file_path: 'uploads/test-user-id/1234567890-test-document.pdf',
|
||||
file_size: 1024,
|
||||
status: 'uploading',
|
||||
created_at: new Date(),
|
||||
updated_at: new Date()
|
||||
});
|
||||
|
||||
DocumentModel.updateById.mockResolvedValue(true);
|
||||
|
||||
// Mock unified document processor
|
||||
(unifiedDocumentProcessor.processDocument as jest.Mock).mockResolvedValue({
|
||||
success: true,
|
||||
documentId: '123e4567-e89b-12d3-a456-426614174000',
|
||||
@@ -93,166 +100,115 @@ describe('Upload Pipeline Integration Tests', () => {
|
||||
app.use(express.json());
|
||||
app.use(verifyFirebaseToken);
|
||||
app.use(addCorrelationId);
|
||||
app.post('/upload', handleFileUpload, documentController.uploadDocument);
|
||||
|
||||
// Add routes for testing
|
||||
app.post('/upload-url', documentController.getUploadUrl);
|
||||
app.post('/:id/confirm-upload', documentController.confirmUpload);
|
||||
});
|
||||
|
||||
describe('Complete Upload Pipeline', () => {
|
||||
it('should successfully process a complete file upload', async () => {
|
||||
describe('Upload URL Generation', () => {
|
||||
it('should successfully get upload URL', async () => {
|
||||
const response = await request(app)
|
||||
.post('/upload')
|
||||
.attach('file', Buffer.from('test content'), 'test-document.pdf')
|
||||
.post('/upload-url')
|
||||
.send({
|
||||
fileName: 'test-document.pdf',
|
||||
fileSize: 1024,
|
||||
contentType: 'application/pdf'
|
||||
})
|
||||
.expect(200);
|
||||
|
||||
expect(response.body.documentId).toBeDefined();
|
||||
expect(response.body.uploadUrl).toBeDefined();
|
||||
expect(response.body.filePath).toBeDefined();
|
||||
});
|
||||
|
||||
it('should reject non-PDF files', async () => {
|
||||
const response = await request(app)
|
||||
.post('/upload-url')
|
||||
.send({
|
||||
fileName: 'test-document.txt',
|
||||
fileSize: 1024,
|
||||
contentType: 'text/plain'
|
||||
})
|
||||
.expect(400);
|
||||
|
||||
expect(response.body.error).toBe('Only PDF files are supported');
|
||||
});
|
||||
|
||||
it('should reject files larger than 50MB', async () => {
|
||||
const response = await request(app)
|
||||
.post('/upload-url')
|
||||
.send({
|
||||
fileName: 'large-document.pdf',
|
||||
fileSize: 60 * 1024 * 1024, // 60MB
|
||||
contentType: 'application/pdf'
|
||||
})
|
||||
.expect(400);
|
||||
|
||||
expect(response.body.error).toBe('File size exceeds 50MB limit');
|
||||
});
|
||||
|
||||
it('should handle missing required fields', async () => {
|
||||
const response = await request(app)
|
||||
.post('/upload-url')
|
||||
.send({
|
||||
fileName: 'test-document.pdf'
|
||||
// Missing fileSize and contentType
|
||||
})
|
||||
.expect(400);
|
||||
|
||||
expect(response.body.error).toBe('Missing required fields: fileName, fileSize, contentType');
|
||||
});
|
||||
});
|
||||
|
||||
describe('Upload Confirmation', () => {
|
||||
it('should successfully confirm upload and trigger processing', async () => {
|
||||
// First create a document record
|
||||
const { DocumentModel } = require('../../models/DocumentModel');
|
||||
const document = await DocumentModel.create({
|
||||
user_id: mockUser.uid,
|
||||
original_file_name: 'test-document.pdf',
|
||||
file_path: 'uploads/test-user-id/1234567890-test-document.pdf',
|
||||
file_size: 1024,
|
||||
status: 'uploading'
|
||||
});
|
||||
|
||||
const response = await request(app)
|
||||
.post(`/${document.id}/confirm-upload`)
|
||||
.expect(200);
|
||||
|
||||
expect(response.body.success).toBe(true);
|
||||
expect(response.body.documentId).toBeDefined();
|
||||
expect(response.body.documentId).toBe(document.id);
|
||||
expect(response.body.status).toBe('processing');
|
||||
|
||||
// Verify file storage was called
|
||||
expect(fileStorageService.storeFile).toHaveBeenCalledWith(mockFile, mockUser.uid);
|
||||
|
||||
// Verify document processing was called
|
||||
expect(unifiedDocumentProcessor.processDocument).toHaveBeenCalledWith(
|
||||
expect.objectContaining({
|
||||
userId: mockUser.uid,
|
||||
fileInfo: expect.objectContaining({
|
||||
originalName: 'test-document.pdf',
|
||||
gcsPath: 'uploads/test-user-id/1234567890-abc123.pdf',
|
||||
}),
|
||||
})
|
||||
);
|
||||
|
||||
// Verify monitoring was called
|
||||
expect(uploadMonitoringService.trackUploadEvent).toHaveBeenCalled();
|
||||
});
|
||||
|
||||
it('should handle file storage failures gracefully', async () => {
|
||||
(fileStorageService.storeFile as jest.Mock).mockResolvedValue({
|
||||
success: false,
|
||||
error: 'GCS upload failed',
|
||||
});
|
||||
it('should handle confirm upload for non-existent document', async () => {
|
||||
const fakeId = '12345678-1234-1234-1234-123456789012';
|
||||
|
||||
const response = await request(app)
|
||||
.post('/upload')
|
||||
.attach('file', Buffer.from('test content'), 'test-document.pdf')
|
||||
.expect(500);
|
||||
.post(`/${fakeId}/confirm-upload`)
|
||||
.expect(404);
|
||||
|
||||
expect(response.body.error).toContain('Failed to store file');
|
||||
expect(unifiedDocumentProcessor.processDocument).not.toHaveBeenCalled();
|
||||
});
|
||||
|
||||
it('should handle document processing failures gracefully', async () => {
|
||||
(unifiedDocumentProcessor.processDocument as jest.Mock).mockResolvedValue({
|
||||
success: false,
|
||||
error: 'Processing failed',
|
||||
});
|
||||
|
||||
const response = await request(app)
|
||||
.post('/upload')
|
||||
.attach('file', Buffer.from('test content'), 'test-document.pdf')
|
||||
.expect(500);
|
||||
|
||||
expect(response.body.error).toContain('Failed to process document');
|
||||
});
|
||||
|
||||
it('should handle large file uploads', async () => {
|
||||
const largeFile = {
|
||||
...mockFile,
|
||||
size: 50 * 1024 * 1024, // 50MB
|
||||
};
|
||||
|
||||
(handleFileUpload as jest.Mock).mockImplementation((req: any, res: any, next: any) => {
|
||||
req.file = largeFile;
|
||||
next();
|
||||
});
|
||||
|
||||
await request(app)
|
||||
.post('/upload')
|
||||
.attach('file', Buffer.alloc(50 * 1024 * 1024), 'large-document.pdf')
|
||||
.expect(200);
|
||||
|
||||
expect(fileStorageService.storeFile).toHaveBeenCalledWith(largeFile, mockUser.uid);
|
||||
});
|
||||
|
||||
it('should handle unsupported file types', async () => {
|
||||
const unsupportedFile = {
|
||||
...mockFile,
|
||||
mimetype: 'application/exe',
|
||||
originalname: 'malicious.exe',
|
||||
};
|
||||
|
||||
(handleFileUpload as jest.Mock).mockImplementation((req: any, res: any, next: any) => {
|
||||
req.file = unsupportedFile;
|
||||
next();
|
||||
});
|
||||
|
||||
const response = await request(app)
|
||||
.post('/upload')
|
||||
.attach('file', Buffer.from('test content'), 'malicious.exe')
|
||||
.expect(400);
|
||||
|
||||
expect(response.body.error).toContain('Unsupported file type');
|
||||
});
|
||||
|
||||
it('should track upload progress correctly', async () => {
|
||||
const response = await request(app)
|
||||
.post('/upload')
|
||||
.attach('file', Buffer.from('test content'), 'test-document.pdf')
|
||||
.expect(200);
|
||||
|
||||
// Verify monitoring events were tracked
|
||||
expect(uploadMonitoringService.trackUploadEvent).toHaveBeenCalledWith(
|
||||
expect.objectContaining({
|
||||
userId: mockUser.uid,
|
||||
fileInfo: expect.objectContaining({
|
||||
originalName: 'test-document.pdf',
|
||||
size: 1024,
|
||||
}),
|
||||
status: 'success',
|
||||
stage: 'file_storage',
|
||||
})
|
||||
);
|
||||
expect(response.body.error).toBe('Document not found');
|
||||
});
|
||||
});
|
||||
|
||||
describe('Error Scenarios and Recovery', () => {
|
||||
it('should handle GCS connection failures', async () => {
|
||||
(fileStorageService.storeFile as jest.Mock).mockRejectedValue(
|
||||
describe('Error Handling', () => {
|
||||
it('should handle GCS connection failures during URL generation', async () => {
|
||||
(fileStorageService.generateSignedUploadUrl as jest.Mock).mockRejectedValue(
|
||||
new Error('GCS connection timeout')
|
||||
);
|
||||
|
||||
const response = await request(app)
|
||||
.post('/upload')
|
||||
.attach('file', Buffer.from('test content'), 'test-document.pdf')
|
||||
.post('/upload-url')
|
||||
.send({
|
||||
fileName: 'test-document.pdf',
|
||||
fileSize: 1024,
|
||||
contentType: 'application/pdf'
|
||||
})
|
||||
.expect(500);
|
||||
|
||||
expect(response.body.error).toContain('Internal server error');
|
||||
});
|
||||
|
||||
it('should handle partial upload failures', async () => {
|
||||
// Mock storage success but processing failure
|
||||
(fileStorageService.storeFile as jest.Mock).mockResolvedValue({
|
||||
success: true,
|
||||
fileInfo: {
|
||||
originalName: 'test-document.pdf',
|
||||
filename: '1234567890-abc123.pdf',
|
||||
path: 'uploads/test-user-id/1234567890-abc123.pdf',
|
||||
size: 1024,
|
||||
mimetype: 'application/pdf',
|
||||
uploadedAt: new Date(),
|
||||
gcsPath: 'uploads/test-user-id/1234567890-abc123.pdf',
|
||||
},
|
||||
});
|
||||
|
||||
(unifiedDocumentProcessor.processDocument as jest.Mock).mockRejectedValue(
|
||||
new Error('Processing service unavailable')
|
||||
);
|
||||
|
||||
const response = await request(app)
|
||||
.post('/upload')
|
||||
.attach('file', Buffer.from('test content'), 'test-document.pdf')
|
||||
.expect(500);
|
||||
|
||||
expect(response.body.error).toContain('Failed to process document');
|
||||
expect(response.body.error).toBe('Failed to generate upload URL');
|
||||
});
|
||||
|
||||
it('should handle authentication failures', async () => {
|
||||
@@ -261,106 +217,44 @@ describe('Upload Pipeline Integration Tests', () => {
|
||||
});
|
||||
|
||||
const response = await request(app)
|
||||
.post('/upload')
|
||||
.attach('file', Buffer.from('test content'), 'test-document.pdf')
|
||||
.post('/upload-url')
|
||||
.send({
|
||||
fileName: 'test-document.pdf',
|
||||
fileSize: 1024,
|
||||
contentType: 'application/pdf'
|
||||
})
|
||||
.expect(401);
|
||||
|
||||
expect(response.body.error).toBe('Invalid token');
|
||||
});
|
||||
|
||||
it('should handle missing file uploads', async () => {
|
||||
(handleFileUpload as jest.Mock).mockImplementation((req: any, res: any, next: any) => {
|
||||
req.file = undefined;
|
||||
next();
|
||||
});
|
||||
|
||||
const response = await request(app)
|
||||
.post('/upload')
|
||||
.expect(400);
|
||||
|
||||
expect(response.body.error).toContain('No file uploaded');
|
||||
});
|
||||
});
|
||||
|
||||
describe('Performance and Scalability', () => {
|
||||
it('should handle concurrent uploads', async () => {
|
||||
it('should handle concurrent upload URL requests', async () => {
|
||||
const concurrentRequests = 5;
|
||||
const promises = [];
|
||||
const promises: any[] = [];
|
||||
|
||||
for (let i = 0; i < concurrentRequests; i++) {
|
||||
promises.push(
|
||||
request(app)
|
||||
.post('/upload')
|
||||
.attach('file', Buffer.from(`test content ${i}`), `test-document-${i}.pdf`)
|
||||
.post('/upload-url')
|
||||
.send({
|
||||
fileName: `test-document-${i}.pdf`,
|
||||
fileSize: 1024,
|
||||
contentType: 'application/pdf'
|
||||
})
|
||||
);
|
||||
}
|
||||
|
||||
const responses = await Promise.all(promises);
|
||||
|
||||
responses.forEach(response => {
|
||||
responses.forEach((response: any) => {
|
||||
expect(response.status).toBe(200);
|
||||
expect(response.body.success).toBe(true);
|
||||
expect(response.body.documentId).toBeDefined();
|
||||
expect(response.body.uploadUrl).toBeDefined();
|
||||
});
|
||||
|
||||
expect(fileStorageService.storeFile).toHaveBeenCalledTimes(concurrentRequests);
|
||||
});
|
||||
|
||||
it('should handle upload timeout scenarios', async () => {
|
||||
(fileStorageService.storeFile as jest.Mock).mockImplementation(
|
||||
() => new Promise(resolve => setTimeout(() => resolve({
|
||||
success: true,
|
||||
fileInfo: {
|
||||
originalName: 'test-document.pdf',
|
||||
filename: '1234567890-abc123.pdf',
|
||||
path: 'uploads/test-user-id/1234567890-abc123.pdf',
|
||||
size: 1024,
|
||||
mimetype: 'application/pdf',
|
||||
uploadedAt: new Date(),
|
||||
gcsPath: 'uploads/test-user-id/1234567890-abc123.pdf',
|
||||
},
|
||||
}), 30000)) // 30 second delay
|
||||
);
|
||||
|
||||
await request(app)
|
||||
.post('/upload')
|
||||
.attach('file', Buffer.from('test content'), 'test-document.pdf')
|
||||
.timeout(35000) // 35 second timeout
|
||||
.expect(200);
|
||||
});
|
||||
});
|
||||
|
||||
describe('Data Integrity and Validation', () => {
|
||||
it('should validate file metadata correctly', async () => {
|
||||
const response = await request(app)
|
||||
.post('/upload')
|
||||
.attach('file', Buffer.from('test content'), 'test-document.pdf')
|
||||
.expect(200);
|
||||
|
||||
// Verify file metadata is preserved
|
||||
expect(fileStorageService.storeFile).toHaveBeenCalledWith(
|
||||
expect.objectContaining({
|
||||
originalname: 'test-document.pdf',
|
||||
size: 1024,
|
||||
mimetype: 'application/pdf',
|
||||
}),
|
||||
mockUser.uid
|
||||
);
|
||||
});
|
||||
|
||||
it('should generate unique file paths for each upload', async () => {
|
||||
const uploads = [];
|
||||
for (let i = 0; i < 3; i++) {
|
||||
uploads.push(
|
||||
request(app)
|
||||
.post('/upload')
|
||||
.attach('file', Buffer.from(`test content ${i}`), `test-document-${i}.pdf`)
|
||||
);
|
||||
}
|
||||
|
||||
const responses = await Promise.all(uploads);
|
||||
|
||||
// Verify each upload was called
|
||||
expect(fileStorageService.storeFile).toHaveBeenCalledTimes(3);
|
||||
expect(fileStorageService.generateSignedUploadUrl).toHaveBeenCalledTimes(concurrentRequests);
|
||||
});
|
||||
});
|
||||
});
|
||||
2
backend/src/types/express.d.ts
vendored
2
backend/src/types/express.d.ts
vendored
@@ -3,8 +3,6 @@ import { Request } from 'express';
|
||||
declare global {
|
||||
namespace Express {
|
||||
interface Request {
|
||||
file?: Express.Multer.File;
|
||||
files?: Express.Multer.File[];
|
||||
correlationId?: string;
|
||||
}
|
||||
}
|
||||
|
||||
@@ -294,7 +294,7 @@
|
||||
"processedAt": "2025-08-01T01:36:18.949+00:00",
|
||||
"uploadedBy": "UthFrGPrQLY6bzNL46aIOHck4yi1",
|
||||
"fileSize": 5768711,
|
||||
"summary": "# CIM Analysis: 2025-04-23 Stax Holding Company, LLC Confidential Information Presentation for Stax Holding Company, LLC - April 2025.pdf\n\n## Executive Summary\nSample analysis generated by Document AI + Genkit integration.\n\n## Key Findings\n- Document processed successfully\n- AI analysis completed\n- Integration working as expected\n\n---\n*Generated by Document AI + Genkit integration*",
|
||||
"summary": "# CIM Analysis: 2025-04-23 Stax Holding Company, LLC Confidential Information Presentation for Stax Holding Company, LLC - April 2025.pdf\n\n## Executive Summary\nSample analysis generated by Document AI + Agentic RAG integration.\n\n## Key Findings\n- Document processed successfully\n- AI analysis completed\n- Integration working as expected\n\n---\n*Generated by Document AI + Agentic RAG integration*",
|
||||
"error": null
|
||||
},
|
||||
{
|
||||
|
||||
@@ -390,91 +390,7 @@ class DocumentService {
|
||||
});
|
||||
}
|
||||
|
||||
/**
|
||||
* Legacy multipart upload method (kept for compatibility)
|
||||
*/
|
||||
async uploadDocumentLegacy(
|
||||
file: File,
|
||||
onProgress?: (progress: number) => void,
|
||||
signal?: AbortSignal
|
||||
): Promise<Document> {
|
||||
try {
|
||||
// Check authentication before upload
|
||||
const token = await authService.getToken();
|
||||
if (!token) {
|
||||
throw new Error('Authentication required. Please log in to upload documents.');
|
||||
}
|
||||
|
||||
console.log('📤 Starting legacy multipart upload...');
|
||||
console.log('📤 File:', file.name, 'Size:', file.size, 'Type:', file.type);
|
||||
console.log('📤 Token available:', !!token);
|
||||
|
||||
const formData = new FormData();
|
||||
formData.append('document', file);
|
||||
|
||||
// Always use optimized agentic RAG processing - no strategy selection needed
|
||||
formData.append('processingStrategy', 'optimized_agentic_rag');
|
||||
|
||||
const response = await apiClient.post('/documents/upload', formData, {
|
||||
headers: {
|
||||
'Content-Type': 'multipart/form-data',
|
||||
},
|
||||
signal, // Add abort signal support
|
||||
onUploadProgress: (progressEvent) => {
|
||||
if (onProgress && progressEvent.total) {
|
||||
const progress = Math.round((progressEvent.loaded * 100) / progressEvent.total);
|
||||
onProgress(progress);
|
||||
}
|
||||
},
|
||||
});
|
||||
|
||||
console.log('✅ Legacy document upload successful:', response.data);
|
||||
return response.data;
|
||||
} catch (error: any) {
|
||||
console.error('❌ Legacy document upload failed:', error);
|
||||
|
||||
// Provide more specific error messages
|
||||
if (error.response?.status === 401) {
|
||||
if (error.response?.data?.error === 'No valid authorization header') {
|
||||
throw new Error('Authentication required. Please log in to upload documents.');
|
||||
} else if (error.response?.data?.error === 'Token expired') {
|
||||
throw new Error('Your session has expired. Please log in again.');
|
||||
} else if (error.response?.data?.error === 'Invalid token') {
|
||||
throw new Error('Authentication failed. Please log in again.');
|
||||
} else {
|
||||
throw new Error('Authentication error. Please log in again.');
|
||||
}
|
||||
} else if (error.response?.status === 400) {
|
||||
if (error.response?.data?.error === 'No file uploaded') {
|
||||
throw new Error('No file was selected for upload.');
|
||||
} else if (error.response?.data?.error === 'File too large') {
|
||||
throw new Error('File is too large. Please select a smaller file.');
|
||||
} else if (error.response?.data?.error === 'File type not allowed') {
|
||||
throw new Error('File type not supported. Please upload a PDF or text file.');
|
||||
} else {
|
||||
throw new Error(`Upload failed: ${error.response?.data?.error || 'Bad request'}`);
|
||||
}
|
||||
} else if (error.response?.status === 413) {
|
||||
throw new Error('File is too large. Please select a smaller file.');
|
||||
} else if (error.response?.status >= 500) {
|
||||
throw new Error('Server error. Please try again later.');
|
||||
} else if (error.code === 'ERR_NETWORK') {
|
||||
throw new Error('Network error. Please check your connection and try again.');
|
||||
} else if (error.name === 'AbortError') {
|
||||
throw new Error('Upload was cancelled.');
|
||||
}
|
||||
|
||||
// Handle GCS-specific errors
|
||||
if (error.response?.data?.type === 'storage_error' ||
|
||||
error.message?.includes('GCS') ||
|
||||
error.message?.includes('storage.googleapis.com')) {
|
||||
throw GCSErrorHandler.createGCSError(error, 'upload');
|
||||
}
|
||||
|
||||
// Generic error fallback
|
||||
throw new Error(error.response?.data?.error || error.message || 'Upload failed');
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Get all documents for the current user
|
||||
|
||||
Reference in New Issue
Block a user