Pre-cleanup commit: Current state before service layer consolidation
This commit is contained in:
533
APP_DESIGN_DOCUMENTATION.md
Normal file
533
APP_DESIGN_DOCUMENTATION.md
Normal file
@@ -0,0 +1,533 @@
|
||||
# CIM Document Processor - Application Design Documentation
|
||||
|
||||
## Overview
|
||||
|
||||
The CIM Document Processor is a web application that processes Confidential Information Memorandums (CIMs) using AI to extract key business information and generate structured analysis reports. The system uses Google Document AI for text extraction and an optimized Agentic RAG (Retrieval-Augmented Generation) approach for intelligent document analysis.
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
```
|
||||
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
||||
│ Frontend │ │ Backend │ │ External │
|
||||
│ (React) │◄──►│ (Node.js) │◄──►│ Services │
|
||||
└─────────────────┘ └─────────────────┘ └─────────────────┘
|
||||
│ │
|
||||
▼ ▼
|
||||
┌─────────────────┐ ┌─────────────────┐
|
||||
│ Database │ │ Google Cloud │
|
||||
│ (Supabase) │ │ Services │
|
||||
└─────────────────┘ └─────────────────┘
|
||||
```
|
||||
|
||||
## Core Components
|
||||
|
||||
### 1. Frontend (React + TypeScript)
|
||||
|
||||
**Location**: `frontend/src/`
|
||||
|
||||
**Key Components**:
|
||||
- **App.tsx**: Main application with tabbed interface
|
||||
- **DocumentUpload**: File upload with Firebase Storage integration
|
||||
- **DocumentList**: Display and manage uploaded documents
|
||||
- **DocumentViewer**: View processed documents and analysis
|
||||
- **Analytics**: Dashboard for processing statistics
|
||||
- **UploadMonitoringDashboard**: Real-time upload monitoring
|
||||
|
||||
**Authentication**: Firebase Authentication with protected routes
|
||||
|
||||
### 2. Backend (Node.js + Express + TypeScript)
|
||||
|
||||
**Location**: `backend/src/`
|
||||
|
||||
**Key Services**:
|
||||
- **unifiedDocumentProcessor**: Main orchestrator for document processing
|
||||
- **optimizedAgenticRAGProcessor**: Core AI processing engine
|
||||
- **llmService**: LLM interaction service (Claude AI/OpenAI)
|
||||
- **pdfGenerationService**: PDF report generation using Puppeteer
|
||||
- **fileStorageService**: Google Cloud Storage operations
|
||||
- **uploadMonitoringService**: Real-time upload tracking
|
||||
- **agenticRAGDatabaseService**: Analytics and session management
|
||||
- **sessionService**: User session management
|
||||
- **jobQueueService**: Background job processing
|
||||
- **uploadProgressService**: Upload progress tracking
|
||||
|
||||
## Data Flow
|
||||
|
||||
### 1. Document Upload Process
|
||||
|
||||
```
|
||||
User Uploads PDF
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ 1. Get Upload │ ──► Generate signed URL from Google Cloud Storage
|
||||
│ URL │
|
||||
└─────────┬───────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ 2. Upload to │ ──► Direct upload to GCS bucket
|
||||
│ GCS │
|
||||
└─────────┬───────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ 3. Confirm │ ──► Update database, create processing job
|
||||
│ Upload │
|
||||
└─────────┬───────┘
|
||||
```
|
||||
|
||||
### 2. Document Processing Pipeline
|
||||
|
||||
```
|
||||
Document Uploaded
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ 1. Text │ ──► Google Document AI extracts text from PDF
|
||||
│ Extraction │ (documentAiGenkitProcessor or direct Document AI)
|
||||
└─────────┬───────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ 2. Intelligent │ ──► Split text into semantic chunks (4000 chars)
|
||||
│ Chunking │ with 200 char overlap
|
||||
└─────────┬───────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ 3. Vector │ ──► Generate embeddings for each chunk
|
||||
│ Embedding │ (rate-limited to 5 concurrent calls)
|
||||
└─────────┬───────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ 4. LLM Analysis │ ──► llmService → Claude AI analyzes chunks
|
||||
│ │ and generates structured CIM review data
|
||||
└─────────┬───────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ 5. PDF │ ──► pdfGenerationService generates summary PDF
|
||||
│ Generation │ using Puppeteer
|
||||
└─────────┬───────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ 6. Database │ ──► Store analysis data, update document status
|
||||
│ Storage │
|
||||
└─────────┬───────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ 7. Complete │ ──► Update session, notify user, cleanup
|
||||
│ Processing │
|
||||
└─────────────────┘
|
||||
```
|
||||
|
||||
### 3. Error Handling Flow
|
||||
|
||||
```
|
||||
Processing Error
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ Error Logging │ ──► Log error with correlation ID
|
||||
└─────────┬───────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ Retry Logic │ ──► Retry failed operation (up to 3 times)
|
||||
└─────────┬───────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ Graceful │ ──► Return partial results or error message
|
||||
│ Degradation │
|
||||
└─────────────────┘
|
||||
```
|
||||
|
||||
## Key Services Explained
|
||||
|
||||
### 1. Unified Document Processor (`unifiedDocumentProcessor.ts`)
|
||||
|
||||
**Purpose**: Main orchestrator that routes documents to the appropriate processing strategy.
|
||||
|
||||
**Current Strategy**: `optimized_agentic_rag` (only active strategy)
|
||||
|
||||
**Methods**:
|
||||
- `processDocument()`: Main processing entry point
|
||||
- `processWithOptimizedAgenticRAG()`: Current active processing method
|
||||
- `getProcessingStats()`: Returns processing statistics
|
||||
|
||||
### 2. Optimized Agentic RAG Processor (`optimizedAgenticRAGProcessor.ts`)
|
||||
|
||||
**Purpose**: Core AI processing engine that handles large documents efficiently.
|
||||
|
||||
**Key Features**:
|
||||
- **Intelligent Chunking**: Splits text at semantic boundaries (sections, paragraphs)
|
||||
- **Batch Processing**: Processes chunks in batches of 10 to manage memory
|
||||
- **Rate Limiting**: Limits concurrent API calls to 5
|
||||
- **Memory Optimization**: Tracks memory usage and processes efficiently
|
||||
|
||||
**Processing Steps**:
|
||||
1. **Create Intelligent Chunks**: Split text into 4000-char chunks with semantic boundaries
|
||||
2. **Process Chunks in Batches**: Generate embeddings and metadata for each chunk
|
||||
3. **Store Chunks Optimized**: Save to vector database with batching
|
||||
4. **Generate LLM Analysis**: Use llmService to analyze and create structured data
|
||||
|
||||
### 3. LLM Service (`llmService.ts`)
|
||||
|
||||
**Purpose**: Handles all LLM interactions with Claude AI and OpenAI.
|
||||
|
||||
**Key Features**:
|
||||
- **Model Selection**: Automatically selects optimal model based on task complexity
|
||||
- **Retry Logic**: Implements retry mechanism for failed API calls
|
||||
- **Cost Tracking**: Tracks token usage and API costs
|
||||
- **Error Handling**: Graceful error handling with fallback options
|
||||
|
||||
**Methods**:
|
||||
- `processCIMDocument()`: Main CIM analysis method
|
||||
- `callLLM()`: Generic LLM call method
|
||||
- `callAnthropic()`: Claude AI specific calls
|
||||
- `callOpenAI()`: OpenAI specific calls
|
||||
|
||||
### 4. PDF Generation Service (`pdfGenerationService.ts`)
|
||||
|
||||
**Purpose**: Generates PDF reports from analysis data using Puppeteer.
|
||||
|
||||
**Key Features**:
|
||||
- **HTML to PDF**: Converts HTML content to PDF using Puppeteer
|
||||
- **Markdown Support**: Converts markdown to HTML then to PDF
|
||||
- **Custom Styling**: Professional PDF formatting with CSS
|
||||
- **CIM Review Templates**: Specialized templates for CIM analysis reports
|
||||
|
||||
**Methods**:
|
||||
- `generateCIMReviewPDF()`: Generate CIM review PDF from analysis data
|
||||
- `generatePDFFromMarkdown()`: Convert markdown to PDF
|
||||
- `generatePDFBuffer()`: Generate PDF as buffer for immediate download
|
||||
|
||||
### 5. File Storage Service (`fileStorageService.ts`)
|
||||
|
||||
**Purpose**: Handles all Google Cloud Storage operations.
|
||||
|
||||
**Key Operations**:
|
||||
- `generateSignedUploadUrl()`: Creates secure upload URLs
|
||||
- `getFile()`: Downloads files from GCS
|
||||
- `uploadFile()`: Uploads files to GCS
|
||||
- `deleteFile()`: Removes files from GCS
|
||||
|
||||
### 6. Upload Monitoring Service (`uploadMonitoringService.ts`)
|
||||
|
||||
**Purpose**: Tracks upload progress and provides real-time monitoring.
|
||||
|
||||
**Key Features**:
|
||||
- Real-time upload tracking
|
||||
- Error analysis and reporting
|
||||
- Performance metrics
|
||||
- Health status monitoring
|
||||
|
||||
### 7. Session Service (`sessionService.ts`)
|
||||
|
||||
**Purpose**: Manages user sessions and authentication state.
|
||||
|
||||
**Key Features**:
|
||||
- Session storage and retrieval
|
||||
- Token management
|
||||
- Session cleanup
|
||||
- Security token blacklisting
|
||||
|
||||
### 8. Job Queue Service (`jobQueueService.ts`)
|
||||
|
||||
**Purpose**: Manages background job processing and queuing.
|
||||
|
||||
**Key Features**:
|
||||
- Job queuing and scheduling
|
||||
- Background processing
|
||||
- Job status tracking
|
||||
- Error recovery
|
||||
|
||||
## Service Dependencies
|
||||
|
||||
```
|
||||
unifiedDocumentProcessor
|
||||
├── optimizedAgenticRAGProcessor
|
||||
│ ├── llmService (for AI processing)
|
||||
│ ├── vectorDatabaseService (for embeddings)
|
||||
│ └── fileStorageService (for file operations)
|
||||
├── pdfGenerationService (for PDF creation)
|
||||
├── uploadMonitoringService (for tracking)
|
||||
├── sessionService (for session management)
|
||||
└── jobQueueService (for background processing)
|
||||
```
|
||||
|
||||
## Database Schema
|
||||
|
||||
### Core Tables
|
||||
|
||||
#### 1. Documents Table
|
||||
```sql
|
||||
CREATE TABLE documents (
|
||||
id UUID PRIMARY KEY,
|
||||
user_id TEXT NOT NULL,
|
||||
original_file_name TEXT NOT NULL,
|
||||
file_path TEXT NOT NULL,
|
||||
file_size INTEGER NOT NULL,
|
||||
status TEXT NOT NULL,
|
||||
extracted_text TEXT,
|
||||
generated_summary TEXT,
|
||||
summary_pdf_path TEXT,
|
||||
analysis_data JSONB,
|
||||
created_at TIMESTAMP DEFAULT NOW(),
|
||||
updated_at TIMESTAMP DEFAULT NOW()
|
||||
);
|
||||
```
|
||||
|
||||
#### 2. Agentic RAG Sessions Table
|
||||
```sql
|
||||
CREATE TABLE agentic_rag_sessions (
|
||||
id UUID PRIMARY KEY,
|
||||
document_id UUID REFERENCES documents(id),
|
||||
strategy TEXT NOT NULL,
|
||||
status TEXT NOT NULL,
|
||||
total_agents INTEGER,
|
||||
completed_agents INTEGER,
|
||||
failed_agents INTEGER,
|
||||
overall_validation_score DECIMAL,
|
||||
processing_time_ms INTEGER,
|
||||
api_calls_count INTEGER,
|
||||
total_cost DECIMAL,
|
||||
created_at TIMESTAMP DEFAULT NOW(),
|
||||
completed_at TIMESTAMP
|
||||
);
|
||||
```
|
||||
|
||||
#### 3. Vector Database Tables
|
||||
```sql
|
||||
CREATE TABLE document_chunks (
|
||||
id UUID PRIMARY KEY,
|
||||
document_id UUID REFERENCES documents(id),
|
||||
content TEXT NOT NULL,
|
||||
embedding VECTOR(1536),
|
||||
chunk_index INTEGER,
|
||||
metadata JSONB,
|
||||
created_at TIMESTAMP DEFAULT NOW()
|
||||
);
|
||||
```
|
||||
|
||||
## API Endpoints
|
||||
|
||||
### Active Endpoints
|
||||
|
||||
#### Document Management
|
||||
- `POST /documents/upload-url` - Get signed upload URL
|
||||
- `POST /documents/:id/confirm-upload` - Confirm upload and start processing
|
||||
- `POST /documents/:id/process-optimized-agentic-rag` - Trigger AI processing
|
||||
- `GET /documents/:id/download` - Download processed PDF
|
||||
- `DELETE /documents/:id` - Delete document
|
||||
|
||||
#### Analytics & Monitoring
|
||||
- `GET /documents/analytics` - Get processing analytics
|
||||
- `GET /documents/:id/agentic-rag-sessions` - Get processing sessions
|
||||
- `GET /monitoring/dashboard` - Get monitoring dashboard
|
||||
- `GET /vector/stats` - Get vector database statistics
|
||||
|
||||
### Legacy Endpoints (Kept for Backward Compatibility)
|
||||
- `POST /documents/upload` - Multipart file upload (legacy)
|
||||
- `GET /documents` - List documents (basic CRUD)
|
||||
|
||||
## Configuration
|
||||
|
||||
### Environment Variables
|
||||
|
||||
**Backend** (`backend/src/config/env.ts`):
|
||||
```typescript
|
||||
// Google Cloud
|
||||
GOOGLE_CLOUD_PROJECT_ID
|
||||
GOOGLE_CLOUD_STORAGE_BUCKET
|
||||
GOOGLE_APPLICATION_CREDENTIALS
|
||||
|
||||
// Document AI
|
||||
GOOGLE_DOCUMENT_AI_LOCATION
|
||||
GOOGLE_DOCUMENT_AI_PROCESSOR_ID
|
||||
|
||||
// Database
|
||||
DATABASE_URL
|
||||
SUPABASE_URL
|
||||
SUPABASE_ANON_KEY
|
||||
|
||||
// AI Services
|
||||
ANTHROPIC_API_KEY
|
||||
OPENAI_API_KEY
|
||||
|
||||
// Processing
|
||||
AGENTIC_RAG_ENABLED=true
|
||||
PROCESSING_STRATEGY=optimized_agentic_rag
|
||||
|
||||
// LLM Configuration
|
||||
LLM_PROVIDER=anthropic
|
||||
LLM_MODEL=claude-3-opus-20240229
|
||||
LLM_MAX_TOKENS=4000
|
||||
LLM_TEMPERATURE=0.1
|
||||
```
|
||||
|
||||
**Frontend** (`frontend/src/config/env.ts`):
|
||||
```typescript
|
||||
// API
|
||||
VITE_API_BASE_URL
|
||||
VITE_FIREBASE_API_KEY
|
||||
VITE_FIREBASE_AUTH_DOMAIN
|
||||
```
|
||||
|
||||
## Processing Strategy Details
|
||||
|
||||
### Current Strategy: Optimized Agentic RAG
|
||||
|
||||
**Why This Strategy**:
|
||||
- Handles large documents efficiently
|
||||
- Provides structured analysis output
|
||||
- Optimizes memory usage and API costs
|
||||
- Generates high-quality summaries
|
||||
|
||||
**How It Works**:
|
||||
1. **Text Extraction**: Google Document AI extracts text from PDF
|
||||
2. **Semantic Chunking**: Splits text at natural boundaries (sections, paragraphs)
|
||||
3. **Vector Embedding**: Creates embeddings for each chunk
|
||||
4. **LLM Analysis**: llmService calls Claude AI to analyze chunks and generate structured data
|
||||
5. **PDF Generation**: pdfGenerationService creates summary PDF with analysis results
|
||||
|
||||
**Output Format**: Structured CIM Review data including:
|
||||
- Deal Overview
|
||||
- Business Description
|
||||
- Market Analysis
|
||||
- Financial Summary
|
||||
- Management Team
|
||||
- Investment Thesis
|
||||
- Key Questions & Next Steps
|
||||
|
||||
## Error Handling
|
||||
|
||||
### Frontend Error Handling
|
||||
- **Network Errors**: Automatic retry with exponential backoff
|
||||
- **Authentication Errors**: Automatic token refresh or redirect to login
|
||||
- **Upload Errors**: User-friendly error messages with retry options
|
||||
- **Processing Errors**: Real-time error display with retry functionality
|
||||
|
||||
### Backend Error Handling
|
||||
- **Validation Errors**: Input validation with detailed error messages
|
||||
- **Processing Errors**: Graceful degradation with error logging
|
||||
- **Storage Errors**: Retry logic for transient failures
|
||||
- **Database Errors**: Connection pooling and retry mechanisms
|
||||
- **LLM API Errors**: Retry logic with exponential backoff
|
||||
- **PDF Generation Errors**: Fallback to text-only output
|
||||
|
||||
### Error Recovery Mechanisms
|
||||
- **LLM API Failures**: Up to 3 retry attempts with different models
|
||||
- **Processing Timeouts**: Graceful timeout handling with partial results
|
||||
- **Memory Issues**: Automatic garbage collection and memory cleanup
|
||||
- **File Storage Errors**: Retry with exponential backoff
|
||||
|
||||
## Monitoring & Analytics
|
||||
|
||||
### Real-time Monitoring
|
||||
- Upload progress tracking
|
||||
- Processing status updates
|
||||
- Error rate monitoring
|
||||
- Performance metrics
|
||||
- API usage tracking
|
||||
- Cost monitoring
|
||||
|
||||
### Analytics Dashboard
|
||||
- Processing success rates
|
||||
- Average processing times
|
||||
- API usage statistics
|
||||
- Cost tracking
|
||||
- User activity metrics
|
||||
- Error analysis reports
|
||||
|
||||
## Security
|
||||
|
||||
### Authentication
|
||||
- Firebase Authentication
|
||||
- JWT token validation
|
||||
- Protected API endpoints
|
||||
- User-specific data isolation
|
||||
- Session management with secure token handling
|
||||
|
||||
### File Security
|
||||
- Signed URLs for secure uploads
|
||||
- File type validation (PDF only)
|
||||
- File size limits (50MB max)
|
||||
- User-specific file storage paths
|
||||
- Secure file deletion
|
||||
|
||||
### API Security
|
||||
- Rate limiting (1000 requests per 15 minutes)
|
||||
- CORS configuration
|
||||
- Input validation
|
||||
- SQL injection prevention
|
||||
- Request correlation IDs for tracking
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### Memory Management
|
||||
- Batch processing to limit memory usage
|
||||
- Garbage collection optimization
|
||||
- Connection pooling for database
|
||||
- Efficient chunking to minimize memory footprint
|
||||
|
||||
### API Optimization
|
||||
- Rate limiting to prevent API quota exhaustion
|
||||
- Caching for frequently accessed data
|
||||
- Efficient chunking to minimize API calls
|
||||
- Model selection based on task complexity
|
||||
|
||||
### Processing Optimization
|
||||
- Concurrent processing with limits
|
||||
- Intelligent chunking for optimal processing
|
||||
- Background job processing
|
||||
- Progress tracking for user feedback
|
||||
|
||||
## Deployment
|
||||
|
||||
### Backend Deployment
|
||||
- **Firebase Functions**: Serverless deployment
|
||||
- **Google Cloud Run**: Containerized deployment
|
||||
- **Docker**: Container support
|
||||
|
||||
### Frontend Deployment
|
||||
- **Firebase Hosting**: Static hosting
|
||||
- **Vite**: Build tool
|
||||
- **TypeScript**: Type safety
|
||||
|
||||
## Development Workflow
|
||||
|
||||
### Local Development
|
||||
1. **Backend**: `npm run dev` (runs on port 5001)
|
||||
2. **Frontend**: `npm run dev` (runs on port 5173)
|
||||
3. **Database**: Supabase local development
|
||||
4. **Storage**: Google Cloud Storage (development bucket)
|
||||
|
||||
### Testing
|
||||
- **Unit Tests**: Jest for backend, Vitest for frontend
|
||||
- **Integration Tests**: End-to-end testing
|
||||
- **API Tests**: Supertest for backend endpoints
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
1. **Upload Failures**: Check GCS permissions and bucket configuration
|
||||
2. **Processing Timeouts**: Increase timeout limits for large documents
|
||||
3. **Memory Issues**: Monitor memory usage and adjust batch sizes
|
||||
4. **API Quotas**: Check API usage and implement rate limiting
|
||||
5. **PDF Generation Failures**: Check Puppeteer installation and memory
|
||||
6. **LLM API Errors**: Verify API keys and check rate limits
|
||||
|
||||
### Debug Tools
|
||||
- Real-time logging with correlation IDs
|
||||
- Upload monitoring dashboard
|
||||
- Processing session details
|
||||
- Error analysis reports
|
||||
- Performance metrics dashboard
|
||||
|
||||
This documentation provides a comprehensive overview of the CIM Document Processor architecture, helping junior programmers understand the system's design, data flow, and key components.
|
||||
463
ARCHITECTURE_DIAGRAMS.md
Normal file
463
ARCHITECTURE_DIAGRAMS.md
Normal file
@@ -0,0 +1,463 @@
|
||||
# CIM Document Processor - Architecture Diagrams
|
||||
|
||||
## System Architecture Overview
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ FRONTEND (React) │
|
||||
├─────────────────────────────────────────────────────────────────────────────┤
|
||||
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
|
||||
│ │ Login │ │ Document │ │ Document │ │ Analytics │ │
|
||||
│ │ Form │ │ Upload │ │ List │ │ Dashboard │ │
|
||||
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
|
||||
│ │
|
||||
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
|
||||
│ │ Document │ │ Upload │ │ Protected │ │ Auth │ │
|
||||
│ │ Viewer │ │ Monitoring │ │ Route │ │ Context │ │
|
||||
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼ HTTP/HTTPS
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ BACKEND (Node.js) │
|
||||
├─────────────────────────────────────────────────────────────────────────────┤
|
||||
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
|
||||
│ │ Document │ │ Vector │ │ Monitoring │ │ Auth │ │
|
||||
│ │ Routes │ │ Routes │ │ Routes │ │ Middleware │ │
|
||||
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
|
||||
│ │
|
||||
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
|
||||
│ │ Unified │ │ Optimized │ │ LLM │ │ PDF │ │
|
||||
│ │ Document │ │ Agentic │ │ Service │ │ Generation │ │
|
||||
│ │ Processor │ │ RAG │ │ │ │ Service │ │
|
||||
│ │ │ │ Processor │ │ │ │ │ │
|
||||
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
|
||||
│ │
|
||||
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
|
||||
│ │ File │ │ Upload │ │ Session │ │ Job Queue │ │
|
||||
│ │ Storage │ │ Monitoring │ │ Service │ │ Service │ │
|
||||
│ │ Service │ │ Service │ │ │ │ │ │
|
||||
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ EXTERNAL SERVICES │
|
||||
├─────────────────────────────────────────────────────────────────────────────┤
|
||||
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
|
||||
│ │ Google │ │ Google │ │ Anthropic │ │ Firebase │ │
|
||||
│ │ Document AI │ │ Cloud │ │ Claude AI │ │ Auth │ │
|
||||
│ │ │ │ Storage │ │ │ │ │ │
|
||||
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ DATABASE (Supabase) │
|
||||
├─────────────────────────────────────────────────────────────────────────────┤
|
||||
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
|
||||
│ │ Documents │ │ Agentic │ │ Document │ │ Vector │ │
|
||||
│ │ Table │ │ RAG │ │ Chunks │ │ Embeddings │ │
|
||||
│ │ │ │ Sessions │ │ Table │ │ Table │ │
|
||||
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Document Processing Flow
|
||||
|
||||
```
|
||||
┌─────────────────┐
|
||||
│ User Uploads │
|
||||
│ PDF Document │
|
||||
└─────────┬───────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ 1. Get Upload │ ──► Generate signed URL from Google Cloud Storage
|
||||
│ URL │
|
||||
└─────────┬───────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ 2. Upload to │ ──► Direct upload to GCS bucket
|
||||
│ GCS │
|
||||
└─────────┬───────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ 3. Confirm │ ──► Update database, create processing job
|
||||
│ Upload │
|
||||
└─────────┬───────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ 4. Text │ ──► Google Document AI extracts text from PDF
|
||||
│ Extraction │ (documentAiGenkitProcessor or direct Document AI)
|
||||
└─────────┬───────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ 5. Intelligent │ ──► Split text into semantic chunks (4000 chars)
|
||||
│ Chunking │ with 200 char overlap
|
||||
└─────────┬───────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ 6. Vector │ ──► Generate embeddings for each chunk
|
||||
│ Embedding │ (rate-limited to 5 concurrent calls)
|
||||
└─────────┬───────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ 7. LLM Analysis │ ──► llmService → Claude AI analyzes chunks
|
||||
│ │ and generates structured CIM review data
|
||||
└─────────┬───────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ 8. PDF │ ──► pdfGenerationService generates summary PDF
|
||||
│ Generation │ using Puppeteer
|
||||
└─────────┬───────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ 9. Database │ ──► Store analysis data, update document status
|
||||
│ Storage │
|
||||
└─────────┬───────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ 10. Complete │ ──► Update session, notify user, cleanup
|
||||
│ Processing │
|
||||
└─────────────────┘
|
||||
```
|
||||
|
||||
## Error Handling Flow
|
||||
|
||||
```
|
||||
Processing Error
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ Error Logging │ ──► Log error with correlation ID
|
||||
└─────────┬───────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ Retry Logic │ ──► Retry failed operation (up to 3 times)
|
||||
└─────────┬───────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ Graceful │ ──► Return partial results or error message
|
||||
│ Degradation │
|
||||
└─────────────────┘
|
||||
```
|
||||
|
||||
## Component Dependency Map
|
||||
|
||||
### Backend Services
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ CORE SERVICES │
|
||||
├─────────────────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
|
||||
│ │ Unified │ │ Optimized │ │ LLM Service │ │
|
||||
│ │ Document │───►│ Agentic RAG │───►│ │ │
|
||||
│ │ Processor │ │ Processor │ │ (Claude AI/ │ │
|
||||
│ │ (Orchestrator) │ │ (Core AI) │ │ OpenAI) │ │
|
||||
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
|
||||
│ │ │ │ │
|
||||
│ ▼ ▼ ▼ │
|
||||
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
|
||||
│ │ PDF Generation │ │ File Storage │ │ Upload │ │
|
||||
│ │ Service │ │ Service │ │ Monitoring │ │
|
||||
│ │ (Puppeteer) │ │ (GCS) │ │ Service │ │
|
||||
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
|
||||
│ │ │ │ │
|
||||
│ ▼ ▼ ▼ │
|
||||
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
|
||||
│ │ Session │ │ Job Queue │ │ Upload │ │
|
||||
│ │ Service │ │ Service │ │ Progress │ │
|
||||
│ │ (Auth Mgmt) │ │ (Background) │ │ Service │ │
|
||||
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Frontend Components
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ FRONTEND COMPONENTS │
|
||||
├─────────────────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
|
||||
│ │ App.tsx │ │ AuthContext │ │ ProtectedRoute │ │
|
||||
│ │ (Main App) │───►│ (Auth State) │───►│ (Route Guard) │ │
|
||||
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
|
||||
│ │ DocumentUpload │ │ DocumentList │ │ DocumentViewer │ │
|
||||
│ │ (File Upload) │ │ (Document Mgmt) │ │ (View Results) │ │
|
||||
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
|
||||
│ │ │ │ │
|
||||
│ ▼ ▼ ▼ │
|
||||
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
|
||||
│ │ Analytics │ │ Upload │ │ LoginForm │ │
|
||||
│ │ (Dashboard) │ │ Monitoring │ │ (Auth) │ │
|
||||
│ │ │ │ Dashboard │ │ │ │
|
||||
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Service Dependencies Map
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ SERVICE DEPENDENCIES │
|
||||
├─────────────────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ ┌─────────────────┐ │
|
||||
│ │ unifiedDocumentProcessor (Main Orchestrator) │
|
||||
│ └─────────┬───────┘ │
|
||||
│ │ │
|
||||
│ ├───► optimizedAgenticRAGProcessor │
|
||||
│ │ ├───► llmService (AI Processing) │
|
||||
│ │ ├───► vectorDatabaseService (Embeddings) │
|
||||
│ │ └───► fileStorageService (File Operations) │
|
||||
│ │ │
|
||||
│ ├───► pdfGenerationService (PDF Creation) │
|
||||
│ │ └───► Puppeteer (PDF Generation) │
|
||||
│ │ │
|
||||
│ ├───► uploadMonitoringService (Real-time Tracking) │
|
||||
│ │ │
|
||||
│ ├───► sessionService (Session Management) │
|
||||
│ │ │
|
||||
│ └───► jobQueueService (Background Processing) │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## API Endpoint Map
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ API ENDPOINTS │
|
||||
├─────────────────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ DOCUMENT ROUTES │ │
|
||||
│ │ │ │
|
||||
│ │ POST /documents/upload-url ──► Get signed upload URL │ │
|
||||
│ │ POST /documents/:id/confirm-upload ──► Confirm upload & process │ │
|
||||
│ │ POST /documents/:id/process-optimized-agentic-rag ──► AI processing │ │
|
||||
│ │ GET /documents/:id/download ──► Download PDF │ │
|
||||
│ │ DELETE /documents/:id ──► Delete document │ │
|
||||
│ │ GET /documents/analytics ──► Get analytics │ │
|
||||
│ │ GET /documents/:id/agentic-rag-sessions ──► Get sessions │ │
|
||||
│ └─────────────────────────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ MONITORING ROUTES │ │
|
||||
│ │ │ │
|
||||
│ │ GET /monitoring/dashboard ──► Get monitoring dashboard │ │
|
||||
│ │ GET /monitoring/upload-metrics ──► Get upload metrics │ │
|
||||
│ │ GET /monitoring/upload-health ──► Get health status │ │
|
||||
│ │ GET /monitoring/real-time-stats ──► Get real-time stats │ │
|
||||
│ │ GET /monitoring/error-analysis ──► Get error analysis │ │
|
||||
│ └─────────────────────────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ VECTOR ROUTES │ │
|
||||
│ │ │ │
|
||||
│ │ GET /vector/document-chunks/:documentId ──► Get document chunks │ │
|
||||
│ │ GET /vector/analytics ──► Get vector analytics │ │
|
||||
│ │ GET /vector/stats ──► Get vector stats │ │
|
||||
│ └─────────────────────────────────────────────────────────────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Database Schema Map
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ DATABASE SCHEMA │
|
||||
├─────────────────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ DOCUMENTS TABLE │ │
|
||||
│ │ │ │
|
||||
│ │ id (UUID) ──► Primary key │ │
|
||||
│ │ user_id (TEXT) ──► User identifier │ │
|
||||
│ │ original_file_name (TEXT) ──► Original filename │ │
|
||||
│ │ file_path (TEXT) ──► GCS file path │ │
|
||||
│ │ file_size (INTEGER) ──► File size in bytes │ │
|
||||
│ │ status (TEXT) ──► Processing status │ │
|
||||
│ │ extracted_text (TEXT) ──► Extracted text content │ │
|
||||
│ │ generated_summary (TEXT) ──► Generated summary │ │
|
||||
│ │ summary_pdf_path (TEXT) ──► PDF summary path │ │
|
||||
│ │ analysis_data (JSONB) ──► Structured analysis data │ │
|
||||
│ │ created_at (TIMESTAMP) ──► Creation timestamp │ │
|
||||
│ │ updated_at (TIMESTAMP) ──► Last update timestamp │ │
|
||||
│ └─────────────────────────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ AGENTIC RAG SESSIONS TABLE │ │
|
||||
│ │ │ │
|
||||
│ │ id (UUID) ──► Primary key │ │
|
||||
│ │ document_id (UUID) ──► Foreign key to documents │ │
|
||||
│ │ strategy (TEXT) ──► Processing strategy used │ │
|
||||
│ │ status (TEXT) ──► Session status │ │
|
||||
│ │ total_agents (INTEGER) ──► Total agents in session │ │
|
||||
│ │ completed_agents (INTEGER) ──► Completed agents │ │
|
||||
│ │ failed_agents (INTEGER) ──► Failed agents │ │
|
||||
│ │ overall_validation_score (DECIMAL) ──► Quality score │ │
|
||||
│ │ processing_time_ms (INTEGER) ──► Processing time │ │
|
||||
│ │ api_calls_count (INTEGER) ──► Number of API calls │ │
|
||||
│ │ total_cost (DECIMAL) ──► Total processing cost │ │
|
||||
│ │ created_at (TIMESTAMP) ──► Creation timestamp │ │
|
||||
│ │ completed_at (TIMESTAMP) ──► Completion timestamp │ │
|
||||
│ └─────────────────────────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ DOCUMENT CHUNKS TABLE │ │
|
||||
│ │ │ │
|
||||
│ │ id (UUID) ──► Primary key │ │
|
||||
│ │ document_id (UUID) ──► Foreign key to documents │ │
|
||||
│ │ content (TEXT) ──► Chunk content │ │
|
||||
│ │ embedding (VECTOR(1536)) ──► Vector embedding │ │
|
||||
│ │ chunk_index (INTEGER) ──► Chunk order │ │
|
||||
│ │ metadata (JSONB) ──► Chunk metadata │ │
|
||||
│ │ created_at (TIMESTAMP) ──► Creation timestamp │ │
|
||||
│ └─────────────────────────────────────────────────────────────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## File Structure Map
|
||||
|
||||
```
|
||||
cim_summary/
|
||||
├── backend/
|
||||
│ ├── src/
|
||||
│ │ ├── config/ # Configuration files
|
||||
│ │ ├── controllers/ # Request handlers
|
||||
│ │ ├── middleware/ # Express middleware
|
||||
│ │ ├── models/ # Database models
|
||||
│ │ ├── routes/ # API route definitions
|
||||
│ │ ├── services/ # Business logic services
|
||||
│ │ │ ├── unifiedDocumentProcessor.ts # Main orchestrator
|
||||
│ │ │ ├── optimizedAgenticRAGProcessor.ts # Core AI processing
|
||||
│ │ │ ├── llmService.ts # LLM interactions
|
||||
│ │ │ ├── pdfGenerationService.ts # PDF generation
|
||||
│ │ │ ├── fileStorageService.ts # GCS operations
|
||||
│ │ │ ├── uploadMonitoringService.ts # Real-time tracking
|
||||
│ │ │ ├── sessionService.ts # Session management
|
||||
│ │ │ ├── jobQueueService.ts # Background processing
|
||||
│ │ │ └── uploadProgressService.ts # Progress tracking
|
||||
│ │ ├── utils/ # Utility functions
|
||||
│ │ └── index.ts # Main entry point
|
||||
│ ├── scripts/ # Setup and utility scripts
|
||||
│ └── package.json # Backend dependencies
|
||||
├── frontend/
|
||||
│ ├── src/
|
||||
│ │ ├── components/ # React components
|
||||
│ │ ├── contexts/ # React contexts
|
||||
│ │ ├── services/ # API service layer
|
||||
│ │ ├── utils/ # Utility functions
|
||||
│ │ ├── config/ # Frontend configuration
|
||||
│ │ ├── App.tsx # Main app component
|
||||
│ │ └── main.tsx # App entry point
|
||||
│ └── package.json # Frontend dependencies
|
||||
└── README.md # Project documentation
|
||||
```
|
||||
|
||||
## Key Data Flow Sequences
|
||||
|
||||
### 1. User Authentication Flow
|
||||
```
|
||||
User → LoginForm → Firebase Auth → AuthContext → ProtectedRoute → Dashboard
|
||||
```
|
||||
|
||||
### 2. Document Upload Flow
|
||||
```
|
||||
User → DocumentUpload → documentService.uploadDocument() →
|
||||
Backend /upload-url → GCS signed URL → Frontend upload →
|
||||
Backend /confirm-upload → Database update → Processing trigger
|
||||
```
|
||||
|
||||
### 3. Document Processing Flow
|
||||
```
|
||||
Processing trigger → unifiedDocumentProcessor →
|
||||
optimizedAgenticRAGProcessor → Document AI →
|
||||
Chunking → Embeddings → llmService → Claude AI →
|
||||
pdfGenerationService → PDF Generation →
|
||||
Database update → User notification
|
||||
```
|
||||
|
||||
### 4. Analytics Flow
|
||||
```
|
||||
User → Analytics component → documentService.getAnalytics() →
|
||||
Backend /analytics → agenticRAGDatabaseService →
|
||||
Database queries → Structured analytics data → Frontend display
|
||||
```
|
||||
|
||||
### 5. Error Handling Flow
|
||||
```
|
||||
Error occurs → Error logging with correlation ID →
|
||||
Retry logic (up to 3 attempts) →
|
||||
Graceful degradation → User notification
|
||||
```
|
||||
|
||||
## Processing Pipeline Details
|
||||
|
||||
### LLM Service Integration
|
||||
```
|
||||
optimizedAgenticRAGProcessor
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ llmService │ ──► Model selection based on task complexity
|
||||
└─────────┬───────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ Claude AI │ ──► Primary model (claude-3-opus-20240229)
|
||||
│ (Anthropic) │
|
||||
└─────────┬───────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ OpenAI │ ──► Fallback model (if Claude fails)
|
||||
│ (GPT-4) │
|
||||
└─────────────────┘
|
||||
```
|
||||
|
||||
### PDF Generation Pipeline
|
||||
```
|
||||
Analysis Data
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ pdfGenerationService.generateCIMReviewPDF() │
|
||||
└─────────┬───────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ HTML Generation │ ──► Convert analysis data to HTML
|
||||
└─────────┬───────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ Puppeteer │ ──► Convert HTML to PDF
|
||||
└─────────┬───────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ PDF Buffer │ ──► Return PDF as buffer for download
|
||||
└─────────────────┘
|
||||
```
|
||||
|
||||
This architecture provides a clear separation of concerns, scalable design, and comprehensive monitoring capabilities for the CIM Document Processor application.
|
||||
325
DEPENDENCY_ANALYSIS_REPORT.md
Normal file
325
DEPENDENCY_ANALYSIS_REPORT.md
Normal file
@@ -0,0 +1,325 @@
|
||||
# Dependency Analysis Report - CIM Document Processor
|
||||
|
||||
## Executive Summary
|
||||
|
||||
This report analyzes the dependencies in both backend and frontend packages to identify:
|
||||
- Unused dependencies that can be removed
|
||||
- Outdated packages that should be updated
|
||||
- Consolidation opportunities
|
||||
- Dependencies that are actually being used vs. placeholder implementations
|
||||
|
||||
## Backend Dependencies Analysis
|
||||
|
||||
### Core Dependencies (Actively Used)
|
||||
|
||||
#### ✅ **Essential Dependencies**
|
||||
- `express` - Main web framework
|
||||
- `cors` - CORS middleware
|
||||
- `helmet` - Security middleware
|
||||
- `morgan` - HTTP request logging
|
||||
- `express-rate-limit` - Rate limiting
|
||||
- `dotenv` - Environment variable management
|
||||
- `winston` - Logging framework
|
||||
- `@supabase/supabase-js` - Database client
|
||||
- `@google-cloud/storage` - Google Cloud Storage
|
||||
- `@google-cloud/documentai` - Document AI processing
|
||||
- `@anthropic-ai/sdk` - Claude AI integration
|
||||
- `openai` - OpenAI integration
|
||||
- `puppeteer` - PDF generation
|
||||
- `uuid` - UUID generation
|
||||
- `axios` - HTTP client
|
||||
|
||||
#### ✅ **Conditionally Used Dependencies**
|
||||
- `bcryptjs` - Used in auth.ts and seed.ts (legacy auth system)
|
||||
- `jsonwebtoken` - Used in auth.ts (legacy JWT system)
|
||||
- `joi` - Used for environment validation and middleware validation
|
||||
- `zod` - Used in llmSchemas.ts and llmService.ts for schema validation
|
||||
- `multer` - Used in upload middleware (legacy multipart upload)
|
||||
- `pdf-parse` - Used in documentAiGenkitProcessor.ts (legacy processor)
|
||||
|
||||
#### ⚠️ **Potentially Unused Dependencies**
|
||||
- `redis` - Only imported in sessionService.ts but may not be actively used
|
||||
- `pg` - PostgreSQL client (may be redundant with Supabase)
|
||||
|
||||
### Development Dependencies (Actively Used)
|
||||
|
||||
#### ✅ **Essential Dev Dependencies**
|
||||
- `typescript` - TypeScript compiler
|
||||
- `ts-node-dev` - Development server
|
||||
- `jest` - Testing framework
|
||||
- `supertest` - API testing
|
||||
- `@types/*` - TypeScript type definitions
|
||||
- `eslint` - Code linting
|
||||
- `@typescript-eslint/*` - TypeScript ESLint rules
|
||||
|
||||
### Unused Dependencies Analysis
|
||||
|
||||
#### ❌ **Confirmed Unused**
|
||||
None identified - all dependencies appear to be used somewhere in the codebase.
|
||||
|
||||
#### ⚠️ **Potentially Redundant**
|
||||
1. **Validation Libraries**: Both `joi` and `zod` are used for validation
|
||||
- `joi`: Environment validation, middleware validation
|
||||
- `zod`: LLM schemas, service validation
|
||||
- **Recommendation**: Consider consolidating to just `zod` for consistency
|
||||
|
||||
2. **Database Clients**: Both `pg` and `@supabase/supabase-js`
|
||||
- `pg`: Direct PostgreSQL client
|
||||
- `@supabase/supabase-js`: Supabase client (includes PostgreSQL)
|
||||
- **Recommendation**: Remove `pg` if only using Supabase
|
||||
|
||||
3. **Authentication**: Both `bcryptjs`/`jsonwebtoken` and Firebase Auth
|
||||
- Legacy JWT system vs. Firebase Authentication
|
||||
- **Recommendation**: Remove legacy auth dependencies if fully migrated to Firebase
|
||||
|
||||
## Frontend Dependencies Analysis
|
||||
|
||||
### Core Dependencies (Actively Used)
|
||||
|
||||
#### ✅ **Essential Dependencies**
|
||||
- `react` - React framework
|
||||
- `react-dom` - React DOM rendering
|
||||
- `react-router-dom` - Client-side routing
|
||||
- `axios` - HTTP client for API calls
|
||||
- `firebase` - Firebase Authentication
|
||||
- `lucide-react` - Icon library (used in 6 components)
|
||||
- `react-dropzone` - File upload component
|
||||
|
||||
#### ❌ **Unused Dependencies**
|
||||
- `clsx` - Not imported anywhere
|
||||
- `tailwind-merge` - Not imported anywhere
|
||||
|
||||
### Development Dependencies (Actively Used)
|
||||
|
||||
#### ✅ **Essential Dev Dependencies**
|
||||
- `typescript` - TypeScript compiler
|
||||
- `vite` - Build tool and dev server
|
||||
- `@vitejs/plugin-react` - React plugin for Vite
|
||||
- `tailwindcss` - CSS framework
|
||||
- `postcss` - CSS processing
|
||||
- `autoprefixer` - CSS vendor prefixing
|
||||
- `eslint` - Code linting
|
||||
- `@typescript-eslint/*` - TypeScript ESLint rules
|
||||
- `vitest` - Testing framework
|
||||
- `@testing-library/*` - React testing utilities
|
||||
|
||||
## Processing Strategy Analysis
|
||||
|
||||
### Current Active Strategy
|
||||
Based on the code analysis, the current processing strategy is:
|
||||
- **Primary**: `optimized_agentic_rag` (most actively used)
|
||||
- **Fallback**: `document_ai_genkit` (legacy implementation)
|
||||
|
||||
### Unused Processing Strategies
|
||||
The following strategies are implemented but not actively used:
|
||||
1. `chunking` - Legacy chunking strategy
|
||||
2. `rag` - Basic RAG strategy
|
||||
3. `agentic_rag` - Basic agentic RAG (superseded by optimized version)
|
||||
|
||||
### Services Analysis
|
||||
|
||||
#### ✅ **Actively Used Services**
|
||||
- `unifiedDocumentProcessor` - Main orchestrator
|
||||
- `optimizedAgenticRAGProcessor` - Core AI processing
|
||||
- `llmService` - LLM interactions
|
||||
- `pdfGenerationService` - PDF generation
|
||||
- `fileStorageService` - GCS operations
|
||||
- `uploadMonitoringService` - Real-time tracking
|
||||
- `sessionService` - Session management
|
||||
- `jobQueueService` - Background processing
|
||||
|
||||
#### ⚠️ **Legacy Services (Can be removed)**
|
||||
- `documentProcessingService` - Legacy chunking service
|
||||
- `documentAiGenkitProcessor` - Legacy Document AI processor
|
||||
- `ragDocumentProcessor` - Basic RAG processor
|
||||
|
||||
## Outdated Packages Analysis
|
||||
|
||||
### Backend Outdated Packages
|
||||
- `@types/express`: 4.17.23 → 5.0.3 (major version update)
|
||||
- `@types/jest`: 29.5.14 → 30.0.0 (major version update)
|
||||
- `@types/multer`: 1.4.13 → 2.0.0 (major version update)
|
||||
- `@types/node`: 20.19.9 → 24.1.0 (major version update)
|
||||
- `@types/pg`: 8.15.4 → 8.15.5 (patch update)
|
||||
- `@types/supertest`: 2.0.16 → 6.0.3 (major version update)
|
||||
- `@typescript-eslint/*`: 6.21.0 → 8.38.0 (major version update)
|
||||
- `bcryptjs`: 2.4.3 → 3.0.2 (major version update)
|
||||
- `dotenv`: 16.6.1 → 17.2.1 (major version update)
|
||||
- `eslint`: 8.57.1 → 9.32.0 (major version update)
|
||||
- `express`: 4.21.2 → 5.1.0 (major version update)
|
||||
- `express-rate-limit`: 7.5.1 → 8.0.1 (major version update)
|
||||
- `helmet`: 7.2.0 → 8.1.0 (major version update)
|
||||
- `jest`: 29.7.0 → 30.0.5 (major version update)
|
||||
- `multer`: 1.4.5-lts.2 → 2.0.2 (major version update)
|
||||
- `openai`: 5.10.2 → 5.11.0 (minor update)
|
||||
- `puppeteer`: 21.11.0 → 24.15.0 (major version update)
|
||||
- `redis`: 4.7.1 → 5.7.0 (major version update)
|
||||
- `supertest`: 6.3.4 → 7.1.4 (major version update)
|
||||
- `typescript`: 5.8.3 → 5.9.2 (minor update)
|
||||
- `zod`: 3.25.76 → 4.0.14 (major version update)
|
||||
|
||||
### Frontend Outdated Packages
|
||||
- `@testing-library/jest-dom`: 6.6.3 → 6.6.4 (patch update)
|
||||
- `@testing-library/react`: 13.4.0 → 16.3.0 (major version update)
|
||||
- `@types/react`: 18.3.23 → 19.1.9 (major version update)
|
||||
- `@types/react-dom`: 18.3.7 → 19.1.7 (major version update)
|
||||
- `@typescript-eslint/*`: 6.21.0 → 8.38.0 (major version update)
|
||||
- `eslint`: 8.57.1 → 9.32.0 (major version update)
|
||||
- `eslint-plugin-react-hooks`: 4.6.2 → 5.2.0 (major version update)
|
||||
- `lucide-react`: 0.294.0 → 0.536.0 (major version update)
|
||||
- `react`: 18.3.1 → 19.1.1 (major version update)
|
||||
- `react-dom`: 18.3.1 → 19.1.1 (major version update)
|
||||
- `react-router-dom`: 6.30.1 → 7.7.1 (major version update)
|
||||
- `tailwind-merge`: 2.6.0 → 3.3.1 (major version update)
|
||||
- `tailwindcss`: 3.4.17 → 4.1.11 (major version update)
|
||||
- `typescript`: 5.8.3 → 5.9.2 (minor update)
|
||||
- `vite`: 4.5.14 → 7.0.6 (major version update)
|
||||
- `vitest`: 0.34.6 → 3.2.4 (major version update)
|
||||
|
||||
### Update Strategy
|
||||
**⚠️ Warning**: Many packages have major version updates that may include breaking changes. Update strategy:
|
||||
|
||||
1. **Immediate Updates** (Low Risk):
|
||||
- `@types/pg`: 8.15.4 → 8.15.5 (patch update)
|
||||
- `openai`: 5.10.2 → 5.11.0 (minor update)
|
||||
- `typescript`: 5.8.3 → 5.9.2 (minor update)
|
||||
- `@testing-library/jest-dom`: 6.6.3 → 6.6.4 (patch update)
|
||||
|
||||
2. **Major Version Updates** (Require Testing):
|
||||
- React ecosystem updates (React 18 → 19)
|
||||
- Express updates (Express 4 → 5)
|
||||
- Testing framework updates (Jest 29 → 30, Vitest 0.34 → 3.2)
|
||||
- Build tool updates (Vite 4 → 7)
|
||||
|
||||
3. **Recommendation**: Update major versions after dependency cleanup to minimize risk
|
||||
|
||||
## Recommendations
|
||||
|
||||
### Phase 1: Immediate Cleanup (Low Risk)
|
||||
|
||||
#### Backend
|
||||
1. **Remove unused frontend dependencies**:
|
||||
```bash
|
||||
npm uninstall clsx tailwind-merge
|
||||
```
|
||||
|
||||
2. **Consolidate validation libraries**:
|
||||
- Migrate from `joi` to `zod` for consistency
|
||||
- Remove `joi` dependency
|
||||
|
||||
3. **Remove legacy auth dependencies** (if Firebase auth is fully implemented):
|
||||
```bash
|
||||
npm uninstall bcryptjs jsonwebtoken
|
||||
npm uninstall @types/bcryptjs @types/jsonwebtoken
|
||||
```
|
||||
|
||||
#### Frontend
|
||||
1. **Remove unused dependencies**:
|
||||
```bash
|
||||
npm uninstall clsx tailwind-merge
|
||||
```
|
||||
|
||||
### Phase 2: Service Consolidation (Medium Risk)
|
||||
|
||||
1. **Remove legacy processing services**:
|
||||
- `documentProcessingService.ts`
|
||||
- `documentAiGenkitProcessor.ts`
|
||||
- `ragDocumentProcessor.ts`
|
||||
|
||||
2. **Simplify unifiedDocumentProcessor**:
|
||||
- Remove unused strategy methods
|
||||
- Keep only `optimized_agentic_rag` strategy
|
||||
|
||||
3. **Remove unused database client**:
|
||||
- Remove `pg` if only using Supabase
|
||||
|
||||
### Phase 3: Configuration Cleanup (Low Risk)
|
||||
|
||||
1. **Remove unused environment variables**:
|
||||
- Legacy auth configuration
|
||||
- Unused processing strategy configs
|
||||
- Unused LLM configurations
|
||||
|
||||
2. **Update configuration validation**:
|
||||
- Remove validation for unused configs
|
||||
- Simplify environment schema
|
||||
|
||||
### Phase 4: Route Cleanup (Medium Risk)
|
||||
|
||||
1. **Remove legacy upload endpoints**:
|
||||
- Keep only `/upload-url` and `/confirm-upload`
|
||||
- Remove multipart upload endpoints
|
||||
|
||||
2. **Remove unused analytics endpoints**:
|
||||
- Keep only actively used monitoring endpoints
|
||||
|
||||
## Impact Assessment
|
||||
|
||||
### Risk Levels
|
||||
- **Low Risk**: Removing unused dependencies, updating packages
|
||||
- **Medium Risk**: Removing legacy services, consolidating routes
|
||||
- **High Risk**: Changing core processing logic
|
||||
|
||||
### Testing Requirements
|
||||
- Unit tests for all active services
|
||||
- Integration tests for upload flow
|
||||
- End-to-end tests for document processing
|
||||
- Performance testing for optimized agentic RAG
|
||||
|
||||
### Rollback Plan
|
||||
- Keep backup of removed files for 1-2 weeks
|
||||
- Maintain feature flags for major changes
|
||||
- Document all changes for easy rollback
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **Start with Phase 1** (unused dependencies)
|
||||
2. **Test thoroughly** after each phase
|
||||
3. **Document changes** for team reference
|
||||
4. **Update deployment scripts** if needed
|
||||
5. **Monitor performance** after cleanup
|
||||
|
||||
## Estimated Savings
|
||||
|
||||
### Bundle Size Reduction
|
||||
- **Frontend**: ~50KB (removing unused dependencies)
|
||||
- **Backend**: ~200KB (removing legacy services and dependencies)
|
||||
|
||||
### Maintenance Reduction
|
||||
- **Fewer dependencies** to maintain and update
|
||||
- **Simplified codebase** with fewer moving parts
|
||||
- **Reduced security vulnerabilities** from unused packages
|
||||
|
||||
### Performance Improvement
|
||||
- **Faster builds** with fewer dependencies
|
||||
- **Reduced memory usage** from removed services
|
||||
- **Simplified deployment** with fewer configuration options
|
||||
|
||||
## Summary
|
||||
|
||||
### Key Findings
|
||||
1. **Unused Dependencies**: 2 frontend dependencies (`clsx`, `tailwind-merge`) are completely unused
|
||||
2. **Legacy Services**: 3 processing services can be removed (`documentProcessingService`, `documentAiGenkitProcessor`, `ragDocumentProcessor`)
|
||||
3. **Redundant Dependencies**: Both `joi` and `zod` for validation, both `pg` and Supabase for database
|
||||
4. **Outdated Packages**: 21 backend and 15 frontend packages have updates available
|
||||
5. **Major Version Updates**: Many packages require major version updates with potential breaking changes
|
||||
|
||||
### Immediate Actions (Step 2 Complete)
|
||||
1. ✅ **Dependency Analysis Complete** - All dependencies mapped and usage identified
|
||||
2. ✅ **Outdated Packages Identified** - Version updates documented with risk assessment
|
||||
3. ✅ **Cleanup Strategy Defined** - Phased approach with risk levels assigned
|
||||
4. ✅ **Impact Assessment Complete** - Bundle size and maintenance savings estimated
|
||||
|
||||
### Next Steps (Step 3 - Service Layer Consolidation)
|
||||
1. Remove unused frontend dependencies (`clsx`, `tailwind-merge`)
|
||||
2. Remove legacy processing services
|
||||
3. Consolidate validation libraries (migrate from `joi` to `zod`)
|
||||
4. Remove redundant database client (`pg` if only using Supabase)
|
||||
5. Update low-risk package versions
|
||||
|
||||
### Risk Assessment
|
||||
- **Low Risk**: Removing unused dependencies, updating minor/patch versions
|
||||
- **Medium Risk**: Removing legacy services, consolidating libraries
|
||||
- **High Risk**: Major version updates, core processing logic changes
|
||||
|
||||
This dependency analysis provides a clear roadmap for cleaning up the codebase while maintaining functionality and minimizing risk.
|
||||
@@ -19,7 +19,7 @@
|
||||
"lint:fix": "eslint src --ext .ts --fix",
|
||||
"db:migrate": "ts-node src/scripts/setup-database.ts",
|
||||
"db:seed": "ts-node src/models/seed.ts",
|
||||
"db:setup": "npm run db:migrate",
|
||||
"db:setup": "npm run db:migrate && node scripts/setup_supabase.js",
|
||||
"deploy:firebase": "npm run build && firebase deploy --only functions",
|
||||
"deploy:cloud-run": "npm run build && gcloud run deploy cim-processor-backend --source . --region us-central1 --platform managed --allow-unauthenticated",
|
||||
"deploy:docker": "npm run build && docker build -t cim-processor-backend . && docker run -p 8080:8080 cim-processor-backend",
|
||||
@@ -77,4 +77,4 @@
|
||||
"ts-node-dev": "^2.0.0",
|
||||
"typescript": "^5.2.2"
|
||||
}
|
||||
}
|
||||
}
|
||||
23
backend/scripts/setup_supabase.js
Normal file
23
backend/scripts/setup_supabase.js
Normal file
@@ -0,0 +1,23 @@
|
||||
const { createClient } = require('@supabase/supabase-js');
|
||||
const fs = require('fs');
|
||||
const path = require('path');
|
||||
|
||||
const supabaseUrl = process.env.SUPABASE_URL;
|
||||
const supabaseKey = process.env.SUPABASE_SERVICE_KEY;
|
||||
const supabase = createClient(supabaseUrl, supabaseKey);
|
||||
|
||||
async function setupDatabase() {
|
||||
try {
|
||||
const sql = fs.readFileSync(path.join(__dirname, 'supabase_setup.sql'), 'utf8');
|
||||
const { error } = await supabase.rpc('exec', { sql });
|
||||
if (error) {
|
||||
console.error('Error setting up database:', error);
|
||||
} else {
|
||||
console.log('Database setup complete.');
|
||||
}
|
||||
} catch (error) {
|
||||
console.error('Error reading setup file:', error);
|
||||
}
|
||||
}
|
||||
|
||||
setupDatabase();
|
||||
21
backend/scripts/test_exec_sql.js
Normal file
21
backend/scripts/test_exec_sql.js
Normal file
@@ -0,0 +1,21 @@
|
||||
require('dotenv').config();
|
||||
const { createClient } = require('@supabase/supabase-js');
|
||||
|
||||
const supabaseUrl = process.env.SUPABASE_URL;
|
||||
const supabaseKey = process.env.SUPABASE_SERVICE_KEY;
|
||||
const supabase = createClient(supabaseUrl, supabaseKey);
|
||||
|
||||
async function testFunction() {
|
||||
try {
|
||||
const { error } = await supabase.rpc('exec_sql', { sql: 'SELECT 1' });
|
||||
if (error) {
|
||||
console.error('Error calling exec_sql:', error);
|
||||
} else {
|
||||
console.log('Successfully called exec_sql.');
|
||||
}
|
||||
} catch (error) {
|
||||
console.error('Error:', error);
|
||||
}
|
||||
}
|
||||
|
||||
testFunction();
|
||||
@@ -93,6 +93,13 @@ export const documentController = {
|
||||
},
|
||||
|
||||
async confirmUpload(req: Request, res: Response): Promise<void> {
|
||||
console.log('🔄 CONFIRM UPLOAD ENDPOINT CALLED');
|
||||
console.log('🔄 Request method:', req.method);
|
||||
console.log('🔄 Request path:', req.path);
|
||||
console.log('🔄 Request params:', req.params);
|
||||
console.log('🔄 Request body:', req.body);
|
||||
console.log('🔄 Request headers:', Object.keys(req.headers));
|
||||
|
||||
try {
|
||||
const userId = req.user?.uid;
|
||||
if (!userId) {
|
||||
@@ -138,36 +145,50 @@ export const documentController = {
|
||||
status: 'processing_llm'
|
||||
});
|
||||
|
||||
// Acknowledge the request immediately
|
||||
console.log('✅ Document status updated to processing_llm');
|
||||
|
||||
// Acknowledge the request immediately and return the document
|
||||
res.status(202).json({
|
||||
message: 'Upload confirmed, processing has started.',
|
||||
documentId: documentId,
|
||||
document: document,
|
||||
status: 'processing'
|
||||
});
|
||||
|
||||
console.log('✅ Response sent, starting background processing...');
|
||||
|
||||
// Process in the background
|
||||
(async () => {
|
||||
try {
|
||||
console.log('Background processing started.');
|
||||
// Download file from Firebase Storage for Document AI processing
|
||||
const { fileStorageService } = await import('../services/fileStorageService');
|
||||
|
||||
let fileBuffer: Buffer | null = null;
|
||||
let downloadError: string | null = null;
|
||||
for (let i = 0; i < 3; i++) {
|
||||
await new Promise(resolve => setTimeout(resolve, 2000)); // 2 second delay
|
||||
fileBuffer = await fileStorageService.getFile(document.file_path);
|
||||
if (fileBuffer) {
|
||||
break;
|
||||
try {
|
||||
await new Promise(resolve => setTimeout(resolve, 2000 * (i + 1)));
|
||||
fileBuffer = await fileStorageService.getFile(document.file_path);
|
||||
if (fileBuffer) {
|
||||
console.log(`✅ File downloaded from storage on attempt ${i + 1}`);
|
||||
break;
|
||||
}
|
||||
} catch (err) {
|
||||
downloadError = err instanceof Error ? err.message : String(err);
|
||||
console.log(`❌ File download attempt ${i + 1} failed:`, downloadError);
|
||||
}
|
||||
}
|
||||
|
||||
if (!fileBuffer) {
|
||||
const errMsg = downloadError || 'Failed to download uploaded file';
|
||||
console.log('Failed to download file from storage:', errMsg);
|
||||
await DocumentModel.updateById(documentId, {
|
||||
status: 'failed',
|
||||
error_message: 'Failed to download uploaded file'
|
||||
error_message: `Failed to download uploaded file: ${errMsg}`
|
||||
});
|
||||
return;
|
||||
}
|
||||
|
||||
console.log('File downloaded, starting unified processor.');
|
||||
// Process with Unified Document Processor
|
||||
const { unifiedDocumentProcessor } = await import('../services/unifiedDocumentProcessor');
|
||||
|
||||
@@ -175,17 +196,28 @@ export const documentController = {
|
||||
documentId,
|
||||
userId,
|
||||
'', // Text is not needed for this strategy
|
||||
{ strategy: 'optimized_agentic_rag' }
|
||||
{
|
||||
strategy: 'document_ai_genkit',
|
||||
fileBuffer: fileBuffer,
|
||||
fileName: document.original_file_name,
|
||||
mimeType: 'application/pdf'
|
||||
}
|
||||
);
|
||||
|
||||
if (result.success) {
|
||||
console.log('✅ Processing successful.');
|
||||
// Update document with results
|
||||
await DocumentModel.updateById(documentId, {
|
||||
status: 'completed',
|
||||
generated_summary: result.summary,
|
||||
analysis_data: result.analysisData,
|
||||
processing_completed_at: new Date()
|
||||
});
|
||||
|
||||
console.log('✅ Document AI processing completed successfully for document:', documentId);
|
||||
console.log('✅ Summary length:', result.summary?.length || 0);
|
||||
console.log('✅ Processing time:', new Date().toISOString());
|
||||
|
||||
// 🗑️ DELETE PDF after successful processing
|
||||
try {
|
||||
await fileStorageService.deleteFile(document.file_path);
|
||||
@@ -201,11 +233,15 @@ export const documentController = {
|
||||
|
||||
console.log('✅ Document AI processing completed successfully');
|
||||
} else {
|
||||
console.log('❌ Processing failed:', result.error);
|
||||
await DocumentModel.updateById(documentId, {
|
||||
status: 'failed',
|
||||
error_message: result.error
|
||||
});
|
||||
|
||||
console.log('❌ Document AI processing failed for document:', documentId);
|
||||
console.log('❌ Error:', result.error);
|
||||
|
||||
// Also delete PDF on processing failure to avoid storage costs
|
||||
try {
|
||||
await fileStorageService.deleteFile(document.file_path);
|
||||
@@ -215,14 +251,30 @@ export const documentController = {
|
||||
}
|
||||
}
|
||||
} catch (error) {
|
||||
console.log('❌ Background processing error:', error);
|
||||
const errorMessage = error instanceof Error ? error.message : 'Unknown error';
|
||||
const errorStack = error instanceof Error ? error.stack : undefined;
|
||||
const errorDetails = error instanceof Error ? {
|
||||
name: error.name,
|
||||
message: error.message,
|
||||
stack: error.stack
|
||||
} : {
|
||||
type: typeof error,
|
||||
value: error
|
||||
};
|
||||
|
||||
console.log('❌ Background processing error:', errorMessage);
|
||||
console.log('❌ Error details:', errorDetails);
|
||||
console.log('❌ Error stack:', errorStack);
|
||||
|
||||
logger.error('Background processing failed', {
|
||||
error,
|
||||
documentId
|
||||
error: errorMessage,
|
||||
errorDetails,
|
||||
documentId,
|
||||
stack: errorStack
|
||||
});
|
||||
await DocumentModel.updateById(documentId, {
|
||||
status: 'failed',
|
||||
error_message: 'Background processing failed'
|
||||
error_message: `Background processing failed: ${errorMessage}`
|
||||
});
|
||||
}
|
||||
})();
|
||||
|
||||
@@ -20,7 +20,11 @@ const app = express();
|
||||
|
||||
// Add this middleware to log all incoming requests
|
||||
app.use((req, res, next) => {
|
||||
console.log(`Incoming request: ${req.method} ${req.path}`);
|
||||
console.log(`🚀 Incoming request: ${req.method} ${req.path}`);
|
||||
console.log(`🚀 Request headers:`, Object.keys(req.headers));
|
||||
console.log(`🚀 Request body size:`, req.headers['content-length'] || 'unknown');
|
||||
console.log(`🚀 Origin:`, req.headers['origin']);
|
||||
console.log(`🚀 User-Agent:`, req.headers['user-agent']);
|
||||
next();
|
||||
});
|
||||
|
||||
@@ -40,9 +44,12 @@ const allowedOrigins = [
|
||||
|
||||
app.use(cors({
|
||||
origin: function (origin, callback) {
|
||||
console.log(`🌐 CORS check for origin: ${origin}`);
|
||||
if (!origin || allowedOrigins.indexOf(origin) !== -1) {
|
||||
console.log(`✅ CORS allowed for origin: ${origin}`);
|
||||
callback(null, true);
|
||||
} else {
|
||||
console.log(`❌ CORS blocked for origin: ${origin}`);
|
||||
logger.warn(`CORS blocked for origin: ${origin}`);
|
||||
callback(new Error('Not allowed by CORS'));
|
||||
}
|
||||
@@ -117,7 +124,7 @@ app.use(errorHandler);
|
||||
|
||||
// Configure Firebase Functions v2 for larger uploads
|
||||
export const api = onRequest({
|
||||
timeoutSeconds: 540, // 9 minutes
|
||||
timeoutSeconds: 1800, // 30 minutes (increased from 9 minutes)
|
||||
memory: '2GiB',
|
||||
cpu: 1,
|
||||
maxInstances: 10,
|
||||
|
||||
@@ -15,14 +15,21 @@ export interface DocumentChunk {
|
||||
updatedAt: Date;
|
||||
}
|
||||
|
||||
export interface VectorSearchResult {
|
||||
documentId: string;
|
||||
similarityScore: number;
|
||||
chunkContent: string;
|
||||
metadata: Record<string, any>;
|
||||
}
|
||||
|
||||
export class VectorDatabaseModel {
|
||||
static async storeDocumentChunks(chunks: Omit<DocumentChunk, 'id' | 'createdAt' | 'updatedAt'>[]): Promise<void> {
|
||||
const supabase = getSupabaseServiceClient();
|
||||
const { data, error } = await supabase
|
||||
const { error } = await supabase
|
||||
.from('document_chunks')
|
||||
.insert(chunks.map(chunk => ({
|
||||
...chunk,
|
||||
embedding: `[${chunk.embedding.join(',')}]` // Format for pgvector
|
||||
embedding: `[${chunk.embedding.join(',')}]`
|
||||
})));
|
||||
|
||||
if (error) {
|
||||
@@ -32,4 +39,104 @@ export class VectorDatabaseModel {
|
||||
|
||||
logger.info(`Stored ${chunks.length} document chunks in vector database`);
|
||||
}
|
||||
|
||||
static async getDocumentChunks(documentId: string): Promise<DocumentChunk[]> {
|
||||
const supabase = getSupabaseServiceClient();
|
||||
const { data, error } = await supabase
|
||||
.from('document_chunks')
|
||||
.select('*')
|
||||
.eq('document_id', documentId)
|
||||
.order('chunk_index');
|
||||
|
||||
if (error) {
|
||||
logger.error('Failed to get document chunks', error);
|
||||
throw error;
|
||||
}
|
||||
|
||||
return data || [];
|
||||
}
|
||||
|
||||
static async getAllChunks(): Promise<DocumentChunk[]> {
|
||||
const supabase = getSupabaseServiceClient();
|
||||
const { data, error } = await supabase
|
||||
.from('document_chunks')
|
||||
.select('*')
|
||||
.limit(1000);
|
||||
|
||||
if (error) {
|
||||
logger.error('Failed to get all chunks', error);
|
||||
throw error;
|
||||
}
|
||||
|
||||
return data || [];
|
||||
}
|
||||
|
||||
static async getTotalChunkCount(): Promise<number> {
|
||||
const supabase = getSupabaseServiceClient();
|
||||
const { count, error } = await supabase
|
||||
.from('document_chunks')
|
||||
.select('*', { count: 'exact', head: true });
|
||||
|
||||
if (error) {
|
||||
logger.error('Failed to get total chunk count', error);
|
||||
throw error;
|
||||
}
|
||||
|
||||
return count || 0;
|
||||
}
|
||||
|
||||
static async getTotalDocumentCount(): Promise<number> {
|
||||
const supabase = getSupabaseServiceClient();
|
||||
const { data, error } = await supabase.rpc('count_distinct_documents');
|
||||
|
||||
if (error) {
|
||||
logger.error('Failed to get total document count', error);
|
||||
throw error;
|
||||
}
|
||||
|
||||
return data || 0;
|
||||
}
|
||||
|
||||
static async getAverageChunkSize(): Promise<number> {
|
||||
const supabase = getSupabaseServiceClient();
|
||||
const { data, error } = await supabase.rpc('average_chunk_size');
|
||||
|
||||
if (error) {
|
||||
logger.error('Failed to get average chunk size', error);
|
||||
throw error;
|
||||
}
|
||||
|
||||
return data || 0;
|
||||
}
|
||||
|
||||
static async getSearchAnalytics(userId: string, days: number = 30): Promise<any[]> {
|
||||
const supabase = getSupabaseServiceClient();
|
||||
const { data, error } = await supabase.rpc('get_search_analytics', {
|
||||
user_id_param: userId,
|
||||
days_param: days
|
||||
});
|
||||
|
||||
if (error) {
|
||||
logger.error('Failed to get search analytics', error);
|
||||
throw error;
|
||||
}
|
||||
|
||||
return data || [];
|
||||
}
|
||||
|
||||
static async getVectorDatabaseStats(): Promise<{
|
||||
totalChunks: number;
|
||||
totalDocuments: number;
|
||||
averageSimilarity: number;
|
||||
}> {
|
||||
const supabase = getSupabaseServiceClient();
|
||||
const { data, error } = await supabase.rpc('get_vector_database_stats');
|
||||
|
||||
if (error) {
|
||||
logger.error('Failed to get vector database stats', error);
|
||||
throw error;
|
||||
}
|
||||
|
||||
return data[0] || { totalChunks: 0, totalDocuments: 0, averageSimilarity: 0 };
|
||||
}
|
||||
}
|
||||
@@ -1,6 +1,6 @@
|
||||
import fs from 'fs';
|
||||
import path from 'path';
|
||||
import pool from '../config/database';
|
||||
import { getSupabaseServiceClient } from '../config/supabase';
|
||||
import logger from '../utils/logger';
|
||||
|
||||
interface Migration {
|
||||
@@ -16,24 +16,18 @@ class DatabaseMigrator {
|
||||
this.migrationsDir = path.join(__dirname, 'migrations');
|
||||
}
|
||||
|
||||
/**
|
||||
* Get all migration files
|
||||
*/
|
||||
private async getMigrationFiles(): Promise<string[]> {
|
||||
try {
|
||||
const files = await fs.promises.readdir(this.migrationsDir);
|
||||
return files
|
||||
.filter(file => file.endsWith('.sql'))
|
||||
.sort(); // Sort to ensure proper order
|
||||
.sort();
|
||||
} catch (error) {
|
||||
logger.error('Error reading migrations directory:', error);
|
||||
throw error;
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Load migration content
|
||||
*/
|
||||
private async loadMigration(fileName: string): Promise<Migration> {
|
||||
const filePath = path.join(this.migrationsDir, fileName);
|
||||
const sql = await fs.promises.readFile(filePath, 'utf-8');
|
||||
@@ -45,68 +39,66 @@ class DatabaseMigrator {
|
||||
};
|
||||
}
|
||||
|
||||
/**
|
||||
* Create migrations table if it doesn't exist
|
||||
*/
|
||||
private async createMigrationsTable(): Promise<void> {
|
||||
const query = `
|
||||
CREATE TABLE IF NOT EXISTS migrations (
|
||||
id VARCHAR(255) PRIMARY KEY,
|
||||
name VARCHAR(255) NOT NULL,
|
||||
executed_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
|
||||
);
|
||||
`;
|
||||
const supabase = getSupabaseServiceClient();
|
||||
const { error } = await supabase.rpc('exec_sql', {
|
||||
sql: `
|
||||
CREATE TABLE IF NOT EXISTS migrations (
|
||||
id VARCHAR(255) PRIMARY KEY,
|
||||
name VARCHAR(255) NOT NULL,
|
||||
executed_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
|
||||
);
|
||||
`
|
||||
});
|
||||
|
||||
try {
|
||||
await pool.query(query);
|
||||
logger.info('Migrations table created or already exists');
|
||||
} catch (error) {
|
||||
if (error) {
|
||||
logger.error('Error creating migrations table:', error);
|
||||
throw error;
|
||||
}
|
||||
|
||||
logger.info('Migrations table created or already exists');
|
||||
}
|
||||
|
||||
/**
|
||||
* Check if migration has been executed
|
||||
*/
|
||||
private async isMigrationExecuted(migrationId: string): Promise<boolean> {
|
||||
const query = 'SELECT id FROM migrations WHERE id = $1';
|
||||
|
||||
try {
|
||||
const result = await pool.query(query, [migrationId]);
|
||||
return result.rows.length > 0;
|
||||
} catch (error) {
|
||||
const supabase = getSupabaseServiceClient();
|
||||
const { data, error } = await supabase
|
||||
.from('migrations')
|
||||
.select('id')
|
||||
.eq('id', migrationId);
|
||||
|
||||
if (error) {
|
||||
logger.error('Error checking migration status:', error);
|
||||
throw error;
|
||||
}
|
||||
|
||||
return data.length > 0;
|
||||
}
|
||||
|
||||
/**
|
||||
* Mark migration as executed
|
||||
*/
|
||||
private async markMigrationExecuted(migrationId: string, name: string): Promise<void> {
|
||||
const query = 'INSERT INTO migrations (id, name) VALUES ($1, $2)';
|
||||
|
||||
try {
|
||||
await pool.query(query, [migrationId, name]);
|
||||
logger.info(`Migration marked as executed: ${name}`);
|
||||
} catch (error) {
|
||||
const supabase = getSupabaseServiceClient();
|
||||
const { error } = await supabase
|
||||
.from('migrations')
|
||||
.insert([{ id: migrationId, name }]);
|
||||
|
||||
if (error) {
|
||||
logger.error('Error marking migration as executed:', error);
|
||||
throw error;
|
||||
}
|
||||
|
||||
logger.info(`Migration marked as executed: ${name}`);
|
||||
}
|
||||
|
||||
/**
|
||||
* Execute a single migration
|
||||
*/
|
||||
private async executeMigration(migration: Migration): Promise<void> {
|
||||
try {
|
||||
logger.info(`Executing migration: ${migration.name}`);
|
||||
|
||||
// Execute the migration SQL
|
||||
await pool.query(migration.sql);
|
||||
const supabase = getSupabaseServiceClient();
|
||||
const { error } = await supabase.rpc('exec_sql', { sql: migration.sql });
|
||||
|
||||
if (error) {
|
||||
throw error;
|
||||
}
|
||||
|
||||
// Mark as executed
|
||||
await this.markMigrationExecuted(migration.id, migration.name);
|
||||
|
||||
logger.info(`Migration completed: ${migration.name}`);
|
||||
@@ -116,25 +108,18 @@ class DatabaseMigrator {
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Run all pending migrations
|
||||
*/
|
||||
async migrate(): Promise<void> {
|
||||
try {
|
||||
logger.info('Starting database migration...');
|
||||
|
||||
// Create migrations table
|
||||
await this.createMigrationsTable();
|
||||
|
||||
// Get all migration files
|
||||
const migrationFiles = await this.getMigrationFiles();
|
||||
logger.info(`Found ${migrationFiles.length} migration files`);
|
||||
|
||||
// Execute each migration
|
||||
for (const fileName of migrationFiles) {
|
||||
const migration = await this.loadMigration(fileName);
|
||||
|
||||
// Check if already executed
|
||||
const isExecuted = await this.isMigrationExecuted(migration.id);
|
||||
|
||||
if (!isExecuted) {
|
||||
@@ -150,21 +135,6 @@ class DatabaseMigrator {
|
||||
throw error;
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Get migration status
|
||||
*/
|
||||
async getMigrationStatus(): Promise<{ id: string; name: string; executed_at: Date }[]> {
|
||||
const query = 'SELECT id, name, executed_at FROM migrations ORDER BY executed_at';
|
||||
|
||||
try {
|
||||
const result = await pool.query(query);
|
||||
return result.rows;
|
||||
} catch (error) {
|
||||
logger.error('Error getting migration status:', error);
|
||||
throw error;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
export default DatabaseMigrator;
|
||||
export default DatabaseMigrator;
|
||||
@@ -1,26 +1,19 @@
|
||||
import { v4 as uuidv4 } from 'uuid';
|
||||
import bcrypt from 'bcryptjs';
|
||||
import { UserModel } from './UserModel';
|
||||
import { DocumentModel } from './DocumentModel';
|
||||
import { ProcessingJobModel } from './ProcessingJobModel';
|
||||
import logger from '../utils/logger';
|
||||
import { config } from '../config/env';
|
||||
import pool from '../config/database';
|
||||
import { getSupabaseServiceClient } from '../config/supabase';
|
||||
|
||||
class DatabaseSeeder {
|
||||
/**
|
||||
* Seed the database with initial data
|
||||
*/
|
||||
async seed(): Promise<void> {
|
||||
try {
|
||||
logger.info('Starting database seeding...');
|
||||
|
||||
// Seed users
|
||||
await this.seedUsers();
|
||||
|
||||
// Seed documents (if any users were created)
|
||||
await this.seedDocuments();
|
||||
|
||||
// Seed processing jobs
|
||||
await this.seedProcessingJobs();
|
||||
|
||||
logger.info('Database seeding completed successfully');
|
||||
@@ -30,9 +23,6 @@ class DatabaseSeeder {
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Seed users
|
||||
*/
|
||||
private async seedUsers(): Promise<void> {
|
||||
const users = [
|
||||
{
|
||||
@@ -57,14 +47,11 @@ class DatabaseSeeder {
|
||||
|
||||
for (const userData of users) {
|
||||
try {
|
||||
// Check if user already exists
|
||||
const existingUser = await UserModel.findByEmail(userData.email);
|
||||
|
||||
if (!existingUser) {
|
||||
// Hash password
|
||||
const hashedPassword = await bcrypt.hash(userData.password, config.security.bcryptRounds);
|
||||
|
||||
// Create user
|
||||
await UserModel.create({
|
||||
...userData,
|
||||
password: hashedPassword
|
||||
@@ -80,12 +67,8 @@ class DatabaseSeeder {
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Seed documents
|
||||
*/
|
||||
private async seedDocuments(): Promise<void> {
|
||||
try {
|
||||
// Get a user to associate documents with
|
||||
const user = await UserModel.findByEmail('user1@example.com');
|
||||
|
||||
if (!user) {
|
||||
@@ -98,28 +81,27 @@ class DatabaseSeeder {
|
||||
user_id: user.id,
|
||||
original_file_name: 'sample_cim_1.pdf',
|
||||
file_path: '/uploads/sample_cim_1.pdf',
|
||||
file_size: 2048576, // 2MB
|
||||
file_size: 2048576,
|
||||
status: 'completed' as const
|
||||
},
|
||||
{
|
||||
user_id: user.id,
|
||||
original_file_name: 'sample_cim_2.pdf',
|
||||
file_path: '/uploads/sample_cim_2.pdf',
|
||||
file_size: 3145728, // 3MB
|
||||
file_size: 3145728,
|
||||
status: 'processing_llm' as const
|
||||
},
|
||||
{
|
||||
user_id: user.id,
|
||||
original_file_name: 'sample_cim_3.pdf',
|
||||
file_path: '/uploads/sample_cim_3.pdf',
|
||||
file_size: 1048576, // 1MB
|
||||
file_size: 1048576,
|
||||
status: 'uploaded' as const
|
||||
}
|
||||
];
|
||||
|
||||
for (const docData of documents) {
|
||||
try {
|
||||
// Check if document already exists (by file path)
|
||||
const existingDocs = await DocumentModel.findByUserId(user.id);
|
||||
const exists = existingDocs.some(doc => doc.file_path === docData.file_path);
|
||||
|
||||
@@ -138,12 +120,8 @@ class DatabaseSeeder {
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Seed processing jobs
|
||||
*/
|
||||
private async seedProcessingJobs(): Promise<void> {
|
||||
try {
|
||||
// Get a document to associate jobs with
|
||||
const user = await UserModel.findByEmail('user1@example.com');
|
||||
if (!user) {
|
||||
logger.warn('No user found for seeding processing jobs');
|
||||
@@ -157,7 +135,7 @@ class DatabaseSeeder {
|
||||
return;
|
||||
}
|
||||
|
||||
const document = documents[0]; // Use first document
|
||||
const document = documents[0];
|
||||
|
||||
if (!document) {
|
||||
logger.warn('No document found for seeding processing jobs');
|
||||
@@ -187,7 +165,6 @@ class DatabaseSeeder {
|
||||
|
||||
for (const jobData of jobs) {
|
||||
try {
|
||||
// Check if job already exists
|
||||
const existingJobs = await ProcessingJobModel.findByDocumentId(document.id);
|
||||
const exists = existingJobs.some(job => job.type === jobData.type);
|
||||
|
||||
@@ -197,7 +174,6 @@ class DatabaseSeeder {
|
||||
type: jobData.type
|
||||
});
|
||||
|
||||
// Update status and progress
|
||||
await ProcessingJobModel.updateStatus(job.id, jobData.status);
|
||||
await ProcessingJobModel.updateProgress(job.id, jobData.progress);
|
||||
|
||||
@@ -214,23 +190,16 @@ class DatabaseSeeder {
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Clear all seeded data
|
||||
*/
|
||||
async clear(): Promise<void> {
|
||||
try {
|
||||
logger.info('Clearing seeded data...');
|
||||
|
||||
// Clear in reverse order to respect foreign key constraints
|
||||
await pool.query('DELETE FROM processing_jobs');
|
||||
await pool.query('DELETE FROM document_versions');
|
||||
await pool.query('DELETE FROM document_feedback');
|
||||
await pool.query('DELETE FROM documents');
|
||||
await pool.query('DELETE FROM users WHERE email IN ($1, $2, $3)', [
|
||||
'admin@example.com',
|
||||
'user1@example.com',
|
||||
'user2@example.com'
|
||||
]);
|
||||
const supabase = getSupabaseServiceClient();
|
||||
await supabase.from('processing_jobs').delete().neq('id', uuidv4());
|
||||
await supabase.from('document_versions').delete().neq('id', uuidv4());
|
||||
await supabase.from('document_feedback').delete().neq('id', uuidv4());
|
||||
await supabase.from('documents').delete().neq('id', uuidv4());
|
||||
await supabase.from('users').delete().in('email', ['admin@example.com', 'user1@example.com', 'user2@example.com']);
|
||||
|
||||
logger.info('Seeded data cleared successfully');
|
||||
} catch (error) {
|
||||
@@ -240,4 +209,4 @@ class DatabaseSeeder {
|
||||
}
|
||||
}
|
||||
|
||||
export default DatabaseSeeder;
|
||||
export default DatabaseSeeder;
|
||||
@@ -23,16 +23,13 @@ const router = express.Router();
|
||||
router.use(verifyFirebaseToken);
|
||||
router.use(addCorrelationId);
|
||||
|
||||
// NEW Firebase Storage direct upload routes
|
||||
router.post('/upload-url', documentController.getUploadUrl);
|
||||
router.post('/:id/confirm-upload', validateUUID('id'), documentController.confirmUpload);
|
||||
// Add logging middleware for document routes
|
||||
router.use((req, res, next) => {
|
||||
console.log(`📄 Document route accessed: ${req.method} ${req.path}`);
|
||||
next();
|
||||
});
|
||||
|
||||
// LEGACY multipart upload routes (keeping for backward compatibility)
|
||||
router.post('/upload', handleFileUpload, documentController.uploadDocument);
|
||||
router.post('/', handleFileUpload, documentController.uploadDocument);
|
||||
router.get('/', documentController.getDocuments);
|
||||
|
||||
// Analytics endpoints (MUST come before /:id routes to avoid conflicts)
|
||||
// Analytics endpoints (MUST come before ANY routes with :id parameters)
|
||||
router.get('/analytics', async (req, res) => {
|
||||
try {
|
||||
const userId = req.user?.uid;
|
||||
@@ -44,11 +41,9 @@ router.get('/analytics', async (req, res) => {
|
||||
}
|
||||
|
||||
const days = parseInt(req.query['days'] as string) || 30;
|
||||
|
||||
// Import the service here to avoid circular dependencies
|
||||
const { agenticRAGDatabaseService } = await import('../services/agenticRAGDatabaseService');
|
||||
const analytics = await agenticRAGDatabaseService.getAnalyticsData(days);
|
||||
|
||||
return res.json({
|
||||
...analytics,
|
||||
correlationId: req.correlationId || undefined
|
||||
@@ -84,6 +79,15 @@ router.get('/processing-stats', async (req, res) => {
|
||||
}
|
||||
});
|
||||
|
||||
// NEW Firebase Storage direct upload routes
|
||||
router.post('/upload-url', documentController.getUploadUrl);
|
||||
router.post('/:id/confirm-upload', validateUUID('id'), documentController.confirmUpload);
|
||||
|
||||
// LEGACY multipart upload routes (keeping for backward compatibility)
|
||||
router.post('/upload', handleFileUpload, documentController.uploadDocument);
|
||||
router.post('/', handleFileUpload, documentController.uploadDocument);
|
||||
router.get('/', documentController.getDocuments);
|
||||
|
||||
// Document-specific routes with UUID validation
|
||||
router.get('/:id', validateUUID('id'), documentController.getDocument);
|
||||
router.get('/:id/progress', validateUUID('id'), documentController.getDocumentProgress);
|
||||
|
||||
@@ -1,4 +1,8 @@
|
||||
import { logger } from '../utils/logger';
|
||||
import { DocumentProcessorServiceClient } from '@google-cloud/documentai';
|
||||
import { Storage } from '@google-cloud/storage';
|
||||
import { config } from '../config/env';
|
||||
import pdf from 'pdf-parse';
|
||||
|
||||
interface ProcessingResult {
|
||||
success: boolean;
|
||||
@@ -7,11 +11,46 @@ interface ProcessingResult {
|
||||
error?: string;
|
||||
}
|
||||
|
||||
interface DocumentAIOutput {
|
||||
text: string;
|
||||
entities: Array<{
|
||||
type: string;
|
||||
mentionText: string;
|
||||
confidence: number;
|
||||
}>;
|
||||
tables: Array<any>;
|
||||
pages: Array<any>;
|
||||
mimeType: string;
|
||||
}
|
||||
|
||||
interface PageChunk {
|
||||
startPage: number;
|
||||
endPage: number;
|
||||
buffer: Buffer;
|
||||
}
|
||||
|
||||
export class DocumentAiGenkitProcessor {
|
||||
private gcsBucketName: string;
|
||||
private documentAiClient: DocumentProcessorServiceClient;
|
||||
private storageClient: Storage;
|
||||
private processorName: string;
|
||||
private readonly MAX_PAGES_PER_CHUNK = 30;
|
||||
|
||||
constructor() {
|
||||
this.gcsBucketName = process.env['GCS_BUCKET_NAME'] || 'cim-summarizer-uploads';
|
||||
this.gcsBucketName = config.googleCloud.gcsBucketName;
|
||||
this.documentAiClient = new DocumentProcessorServiceClient();
|
||||
this.storageClient = new Storage();
|
||||
|
||||
// Construct the processor name
|
||||
this.processorName = `projects/${config.googleCloud.projectId}/locations/${config.googleCloud.documentAiLocation}/processors/${config.googleCloud.documentAiProcessorId}`;
|
||||
|
||||
logger.info('Document AI + Genkit processor initialized', {
|
||||
projectId: config.googleCloud.projectId,
|
||||
location: config.googleCloud.documentAiLocation,
|
||||
processorId: config.googleCloud.documentAiProcessorId,
|
||||
processorName: this.processorName,
|
||||
maxPagesPerChunk: this.MAX_PAGES_PER_CHUNK
|
||||
});
|
||||
}
|
||||
|
||||
async processDocument(
|
||||
@@ -19,135 +58,331 @@ export class DocumentAiGenkitProcessor {
|
||||
userId: string,
|
||||
fileBuffer: Buffer,
|
||||
fileName: string,
|
||||
_mimeType: string
|
||||
mimeType: string
|
||||
): Promise<ProcessingResult> {
|
||||
const startTime = Date.now();
|
||||
|
||||
try {
|
||||
logger.info('Starting Document AI + Genkit processing', {
|
||||
logger.info('Starting Document AI + Agentic RAG processing', {
|
||||
documentId,
|
||||
userId,
|
||||
fileName,
|
||||
fileSize: fileBuffer.length
|
||||
fileSize: fileBuffer.length,
|
||||
mimeType
|
||||
});
|
||||
|
||||
// Step 1: Upload file to GCS
|
||||
const gcsFilePath = await this.uploadToGCS(fileBuffer, fileName);
|
||||
logger.info('File uploaded to GCS', { gcsFilePath });
|
||||
// Step 1: Extract text using Document AI or fallback
|
||||
const extractedText = await this.extractTextFromDocument(fileBuffer, fileName, mimeType);
|
||||
|
||||
if (!extractedText) {
|
||||
throw new Error('Failed to extract text from document');
|
||||
}
|
||||
|
||||
// Step 2: Process with Document AI
|
||||
const documentAiOutput = await this.processWithDocumentAI(gcsFilePath);
|
||||
logger.info('Document AI processing completed', {
|
||||
textLength: documentAiOutput?.text?.length || 0,
|
||||
entitiesCount: documentAiOutput?.entities?.length || 0
|
||||
logger.info('Text extraction completed', {
|
||||
textLength: extractedText.length
|
||||
});
|
||||
|
||||
// Step 3: Process with Genkit
|
||||
const genkitOutput = await this.processWithGenkit(fileName);
|
||||
logger.info('Genkit processing completed', {
|
||||
outputLength: genkitOutput?.markdownOutput?.length || 0
|
||||
});
|
||||
|
||||
// Step 4: Cleanup GCS files
|
||||
await this.cleanupGCSFiles(gcsFilePath);
|
||||
logger.info('GCS cleanup completed');
|
||||
|
||||
// Step 2: Process extracted text through Agentic RAG
|
||||
const agenticRagResult = await this.processWithAgenticRAG(documentId, extractedText);
|
||||
|
||||
const processingTime = Date.now() - startTime;
|
||||
|
||||
return {
|
||||
success: true,
|
||||
content: genkitOutput?.markdownOutput || 'No analysis generated',
|
||||
content: agenticRagResult.summary || extractedText,
|
||||
metadata: {
|
||||
processingStrategy: 'document_ai_genkit',
|
||||
processingStrategy: 'document_ai_agentic_rag',
|
||||
processingTime,
|
||||
documentAiOutput,
|
||||
genkitOutput,
|
||||
extractedTextLength: extractedText.length,
|
||||
agenticRagResult,
|
||||
fileSize: fileBuffer.length,
|
||||
fileName
|
||||
fileName,
|
||||
mimeType
|
||||
}
|
||||
};
|
||||
|
||||
} catch (error) {
|
||||
const processingTime = Date.now() - startTime;
|
||||
logger.error('Document AI + Genkit processing failed', {
|
||||
const errorMessage = error instanceof Error ? error.message : String(error);
|
||||
const errorStack = error instanceof Error ? error.stack : undefined;
|
||||
const errorDetails = error instanceof Error ? {
|
||||
name: error.name,
|
||||
message: error.message,
|
||||
stack: error.stack
|
||||
} : {
|
||||
type: typeof error,
|
||||
value: error
|
||||
};
|
||||
|
||||
logger.error('Document AI + Agentic RAG processing failed', {
|
||||
documentId,
|
||||
error: error instanceof Error ? error.message : String(error),
|
||||
stack: error instanceof Error ? error.stack : undefined
|
||||
error: errorMessage,
|
||||
errorDetails,
|
||||
stack: errorStack,
|
||||
processingTime
|
||||
});
|
||||
|
||||
return {
|
||||
success: false,
|
||||
content: '',
|
||||
error: `Document AI + Genkit processing failed: ${error instanceof Error ? error.message : String(error)}`,
|
||||
error: `Document AI + Agentic RAG processing failed: ${errorMessage}`,
|
||||
metadata: {
|
||||
processingStrategy: 'document_ai_genkit',
|
||||
processingStrategy: 'document_ai_agentic_rag',
|
||||
processingTime,
|
||||
error: error instanceof Error ? error.message : String(error)
|
||||
error: errorMessage,
|
||||
errorDetails,
|
||||
stack: errorStack
|
||||
}
|
||||
};
|
||||
}
|
||||
}
|
||||
|
||||
private async extractTextFromDocument(fileBuffer: Buffer, fileName: string, mimeType: string): Promise<string> {
|
||||
try {
|
||||
// Check document size first
|
||||
const pdfData = await pdf(fileBuffer);
|
||||
const totalPages = pdfData.numpages;
|
||||
|
||||
logger.info('PDF analysis completed', {
|
||||
totalPages,
|
||||
textLength: pdfData.text?.length || 0
|
||||
});
|
||||
|
||||
// If document has more than 30 pages, use pdf-parse fallback
|
||||
if (totalPages > this.MAX_PAGES_PER_CHUNK) {
|
||||
logger.warn('Document exceeds Document AI page limit, using pdf-parse fallback', {
|
||||
totalPages,
|
||||
maxPagesPerChunk: this.MAX_PAGES_PER_CHUNK
|
||||
});
|
||||
|
||||
return pdfData.text || '';
|
||||
}
|
||||
|
||||
// For documents <= 30 pages, use Document AI
|
||||
logger.info('Using Document AI for text extraction', {
|
||||
totalPages,
|
||||
maxPagesPerChunk: this.MAX_PAGES_PER_CHUNK
|
||||
});
|
||||
|
||||
// Upload file to GCS
|
||||
const gcsFilePath = await this.uploadToGCS(fileBuffer, fileName);
|
||||
|
||||
// Process with Document AI
|
||||
const documentAiOutput = await this.processWithDocumentAI(gcsFilePath, mimeType);
|
||||
|
||||
// Cleanup GCS file
|
||||
await this.cleanupGCSFiles(gcsFilePath);
|
||||
|
||||
return documentAiOutput.text;
|
||||
|
||||
} catch (error) {
|
||||
logger.error('Text extraction failed, using pdf-parse fallback', {
|
||||
error: error instanceof Error ? error.message : String(error)
|
||||
});
|
||||
|
||||
// Fallback to pdf-parse
|
||||
try {
|
||||
const pdfData = await pdf(fileBuffer);
|
||||
return pdfData.text || '';
|
||||
} catch (fallbackError) {
|
||||
logger.error('Both Document AI and pdf-parse failed', {
|
||||
originalError: error instanceof Error ? error.message : String(error),
|
||||
fallbackError: fallbackError instanceof Error ? fallbackError.message : String(fallbackError)
|
||||
});
|
||||
throw new Error('Failed to extract text from document using any method');
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
private async processWithAgenticRAG(documentId: string, extractedText: string): Promise<any> {
|
||||
try {
|
||||
logger.info('Processing extracted text with Agentic RAG', {
|
||||
documentId,
|
||||
textLength: extractedText.length
|
||||
});
|
||||
|
||||
// Import and use the optimized agentic RAG processor
|
||||
logger.info('Importing optimized agentic RAG processor...');
|
||||
const { optimizedAgenticRAGProcessor } = await import('./optimizedAgenticRAGProcessor');
|
||||
|
||||
logger.info('Agentic RAG processor imported successfully', {
|
||||
processorType: typeof optimizedAgenticRAGProcessor,
|
||||
hasProcessLargeDocument: typeof optimizedAgenticRAGProcessor?.processLargeDocument === 'function'
|
||||
});
|
||||
|
||||
logger.info('Calling processLargeDocument...');
|
||||
const result = await optimizedAgenticRAGProcessor.processLargeDocument(
|
||||
documentId,
|
||||
extractedText,
|
||||
{}
|
||||
);
|
||||
|
||||
logger.info('Agentic RAG processing completed', {
|
||||
success: result.success,
|
||||
summaryLength: result.summary?.length || 0,
|
||||
analysisDataKeys: result.analysisData ? Object.keys(result.analysisData) : [],
|
||||
resultType: typeof result
|
||||
});
|
||||
|
||||
return result;
|
||||
|
||||
} catch (error) {
|
||||
const errorMessage = error instanceof Error ? error.message : String(error);
|
||||
const errorStack = error instanceof Error ? error.stack : undefined;
|
||||
const errorDetails = error instanceof Error ? {
|
||||
name: error.name,
|
||||
message: error.message,
|
||||
stack: error.stack
|
||||
} : {
|
||||
type: typeof error,
|
||||
value: error
|
||||
};
|
||||
|
||||
logger.error('Agentic RAG processing failed', {
|
||||
documentId,
|
||||
error: errorMessage,
|
||||
errorDetails,
|
||||
stack: errorStack
|
||||
});
|
||||
throw error;
|
||||
}
|
||||
}
|
||||
|
||||
private async uploadToGCS(fileBuffer: Buffer, fileName: string): Promise<string> {
|
||||
// This is a placeholder implementation
|
||||
// In production, this would upload to Google Cloud Storage
|
||||
logger.info('Uploading file to GCS (placeholder)', { fileName, fileSize: fileBuffer.length });
|
||||
|
||||
// Simulate upload delay
|
||||
await new Promise(resolve => setTimeout(resolve, 100));
|
||||
|
||||
return `gs://${this.gcsBucketName}/uploads/${fileName}`;
|
||||
try {
|
||||
const bucket = this.storageClient.bucket(this.gcsBucketName);
|
||||
const file = bucket.file(`uploads/${Date.now()}_${fileName}`);
|
||||
|
||||
logger.info('Uploading file to GCS', {
|
||||
fileName,
|
||||
fileSize: fileBuffer.length,
|
||||
bucket: this.gcsBucketName,
|
||||
destination: file.name
|
||||
});
|
||||
|
||||
await file.save(fileBuffer, {
|
||||
metadata: {
|
||||
contentType: 'application/pdf'
|
||||
}
|
||||
});
|
||||
|
||||
logger.info('File uploaded successfully to GCS', {
|
||||
gcsPath: `gs://${this.gcsBucketName}/${file.name}`
|
||||
});
|
||||
|
||||
return `gs://${this.gcsBucketName}/${file.name}`;
|
||||
} catch (error) {
|
||||
logger.error('Failed to upload file to GCS', {
|
||||
fileName,
|
||||
error: error instanceof Error ? error.message : String(error)
|
||||
});
|
||||
throw error;
|
||||
}
|
||||
}
|
||||
|
||||
private async processWithDocumentAI(gcsFilePath: string): Promise<any> {
|
||||
// This is a placeholder implementation
|
||||
// In production, this would call Google Cloud Document AI
|
||||
logger.info('Processing with Document AI (placeholder)', { gcsFilePath });
|
||||
|
||||
// Simulate Document AI processing
|
||||
await new Promise(resolve => setTimeout(resolve, 200));
|
||||
|
||||
return {
|
||||
text: 'Sample extracted text from Document AI',
|
||||
entities: [
|
||||
{ type: 'COMPANY_NAME', mentionText: 'Sample Company', confidence: 0.95 },
|
||||
{ type: 'MONEY', mentionText: '$10M', confidence: 0.90 }
|
||||
],
|
||||
tables: []
|
||||
};
|
||||
}
|
||||
private async processWithDocumentAI(gcsFilePath: string, mimeType: string): Promise<DocumentAIOutput> {
|
||||
try {
|
||||
logger.info('Processing with Document AI', {
|
||||
gcsFilePath,
|
||||
processorName: this.processorName,
|
||||
mimeType
|
||||
});
|
||||
|
||||
private async processWithGenkit(fileName: string): Promise<any> {
|
||||
// This is a placeholder implementation
|
||||
// In production, this would call Genkit for AI analysis
|
||||
logger.info('Processing with Genkit (placeholder)', { fileName });
|
||||
|
||||
// Simulate Genkit processing
|
||||
await new Promise(resolve => setTimeout(resolve, 300));
|
||||
|
||||
return {
|
||||
markdownOutput: `# CIM Analysis: ${fileName}
|
||||
// Create the request
|
||||
const request = {
|
||||
name: this.processorName,
|
||||
rawDocument: {
|
||||
content: '', // We'll use GCS source instead
|
||||
mimeType: mimeType
|
||||
},
|
||||
gcsDocument: {
|
||||
gcsUri: gcsFilePath,
|
||||
mimeType: mimeType
|
||||
}
|
||||
};
|
||||
|
||||
## Executive Summary
|
||||
Sample analysis generated by Document AI + Genkit integration.
|
||||
logger.info('Sending Document AI request', {
|
||||
processorName: this.processorName,
|
||||
gcsUri: gcsFilePath
|
||||
});
|
||||
|
||||
## Key Findings
|
||||
- Document processed successfully
|
||||
- AI analysis completed
|
||||
- Integration working as expected
|
||||
// Process the document
|
||||
const [result] = await this.documentAiClient.processDocument(request);
|
||||
const { document } = result;
|
||||
|
||||
---
|
||||
*Generated by Document AI + Genkit integration*`
|
||||
};
|
||||
if (!document) {
|
||||
throw new Error('Document AI returned no document');
|
||||
}
|
||||
|
||||
logger.info('Document AI processing successful', {
|
||||
textLength: document.text?.length || 0,
|
||||
pagesCount: document.pages?.length || 0,
|
||||
entitiesCount: document.entities?.length || 0
|
||||
});
|
||||
|
||||
// Extract text
|
||||
const text = document.text || '';
|
||||
|
||||
// Extract entities
|
||||
const entities = document.entities?.map(entity => ({
|
||||
type: entity.type || 'UNKNOWN',
|
||||
mentionText: entity.mentionText || '',
|
||||
confidence: entity.confidence || 0
|
||||
})) || [];
|
||||
|
||||
// Extract tables
|
||||
const tables = document.pages?.flatMap(page =>
|
||||
page.tables?.map(table => ({
|
||||
rows: table.headerRows?.length || 0,
|
||||
columns: table.bodyRows?.[0]?.cells?.length || 0
|
||||
})) || []
|
||||
) || [];
|
||||
|
||||
// Extract pages info
|
||||
const pages = document.pages?.map(page => ({
|
||||
pageNumber: page.pageNumber || 0,
|
||||
blocksCount: page.blocks?.length || 0
|
||||
})) || [];
|
||||
|
||||
return {
|
||||
text,
|
||||
entities,
|
||||
tables,
|
||||
pages,
|
||||
mimeType: document.mimeType || mimeType
|
||||
};
|
||||
|
||||
} catch (error) {
|
||||
logger.error('Document AI processing failed', {
|
||||
gcsFilePath,
|
||||
processorName: this.processorName,
|
||||
error: error instanceof Error ? error.message : String(error),
|
||||
stack: error instanceof Error ? error.stack : undefined
|
||||
});
|
||||
throw error;
|
||||
}
|
||||
}
|
||||
|
||||
private async cleanupGCSFiles(gcsFilePath: string): Promise<void> {
|
||||
// This is a placeholder implementation
|
||||
// In production, this would delete files from Google Cloud Storage
|
||||
logger.info('Cleaning up GCS files (placeholder)', { gcsFilePath });
|
||||
|
||||
// Simulate cleanup delay
|
||||
await new Promise(resolve => setTimeout(resolve, 50));
|
||||
try {
|
||||
const bucketName = gcsFilePath.replace('gs://', '').split('/')[0];
|
||||
const fileName = gcsFilePath.replace(`gs://${bucketName}/`, '');
|
||||
|
||||
logger.info('Cleaning up GCS files', { gcsFilePath, bucketName, fileName });
|
||||
|
||||
const bucket = this.storageClient.bucket(bucketName);
|
||||
const file = bucket.file(fileName);
|
||||
|
||||
await file.delete();
|
||||
|
||||
logger.info('GCS file cleanup completed', { gcsFilePath });
|
||||
} catch (error) {
|
||||
logger.warn('Failed to cleanup GCS files', {
|
||||
gcsFilePath,
|
||||
error: error instanceof Error ? error.message : String(error)
|
||||
});
|
||||
// Don't throw error for cleanup failures
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -83,9 +83,19 @@ export class OptimizedAgenticRAGProcessor {
|
||||
|
||||
logger.info(`Optimized processing completed for document: ${documentId}`, result);
|
||||
|
||||
console.log('✅ Optimized agentic RAG processing completed successfully for document:', documentId);
|
||||
console.log('✅ Total chunks processed:', result.processedChunks);
|
||||
console.log('✅ Processing time:', result.processingTime, 'ms');
|
||||
console.log('✅ Memory usage:', result.memoryUsage, 'MB');
|
||||
console.log('✅ Summary length:', result.summary?.length || 0);
|
||||
|
||||
return result;
|
||||
} catch (error) {
|
||||
logger.error(`Optimized processing failed for document: ${documentId}`, error);
|
||||
|
||||
console.log('❌ Optimized agentic RAG processing failed for document:', documentId);
|
||||
console.log('❌ Error:', error instanceof Error ? error.message : String(error));
|
||||
|
||||
throw error;
|
||||
}
|
||||
}
|
||||
|
||||
@@ -169,6 +169,9 @@ class UnifiedDocumentProcessor {
|
||||
} catch (error) {
|
||||
logger.error('Optimized agentic RAG processing failed', { documentId, error });
|
||||
|
||||
console.log('❌ Unified document processor - optimized agentic RAG failed for document:', documentId);
|
||||
console.log('❌ Error:', error instanceof Error ? error.message : String(error));
|
||||
|
||||
return {
|
||||
success: false,
|
||||
summary: '',
|
||||
@@ -188,33 +191,60 @@ class UnifiedDocumentProcessor {
|
||||
documentId: string,
|
||||
userId: string,
|
||||
text: string,
|
||||
_options: any
|
||||
options: any
|
||||
): Promise<ProcessingResult> {
|
||||
logger.info('Using Document AI + Genkit processing strategy', { documentId });
|
||||
|
||||
const startTime = Date.now();
|
||||
|
||||
try {
|
||||
// For now, we'll use the existing text extraction
|
||||
// In a full implementation, this would use the Document AI processor
|
||||
// Get the file buffer from options if available, otherwise use text
|
||||
const fileBuffer = options.fileBuffer || Buffer.from(text);
|
||||
const fileName = options.fileName || `document-${documentId}.pdf`;
|
||||
const mimeType = options.mimeType || 'application/pdf';
|
||||
|
||||
logger.info('Document AI processing with file data', {
|
||||
documentId,
|
||||
fileSize: fileBuffer.length,
|
||||
fileName,
|
||||
mimeType
|
||||
});
|
||||
|
||||
const result = await documentAiGenkitProcessor.processDocument(
|
||||
documentId,
|
||||
userId,
|
||||
Buffer.from(text), // Convert text to buffer for processing
|
||||
`document-${documentId}.txt`,
|
||||
'text/plain'
|
||||
fileBuffer,
|
||||
fileName,
|
||||
mimeType
|
||||
);
|
||||
|
||||
if (!result.success) {
|
||||
logger.error('Document AI processing failed', {
|
||||
documentId,
|
||||
error: result.error,
|
||||
metadata: result.metadata
|
||||
});
|
||||
}
|
||||
|
||||
return {
|
||||
success: result.success,
|
||||
summary: result.content || '',
|
||||
analysisData: (result.metadata?.analysisData as CIMReview) || {} as CIMReview,
|
||||
analysisData: (result.metadata?.agenticRagResult?.analysisData as CIMReview) || {} as CIMReview,
|
||||
processingStrategy: 'document_ai_genkit',
|
||||
processingTime: Date.now() - startTime,
|
||||
apiCalls: 1, // Document AI + Genkit typically uses fewer API calls
|
||||
apiCalls: 1, // Document AI + Agentic RAG typically uses fewer API calls
|
||||
error: result.error || undefined
|
||||
};
|
||||
} catch (error) {
|
||||
const errorMessage = error instanceof Error ? error.message : String(error);
|
||||
const errorStack = error instanceof Error ? error.stack : undefined;
|
||||
|
||||
logger.error('Document AI + Genkit processing failed with exception', {
|
||||
documentId,
|
||||
error: errorMessage,
|
||||
stack: errorStack
|
||||
});
|
||||
|
||||
return {
|
||||
success: false,
|
||||
summary: '',
|
||||
@@ -222,7 +252,7 @@ class UnifiedDocumentProcessor {
|
||||
processingStrategy: 'document_ai_genkit',
|
||||
processingTime: Date.now() - startTime,
|
||||
apiCalls: 0,
|
||||
error: error instanceof Error ? error.message : 'Unknown error'
|
||||
error: errorMessage
|
||||
};
|
||||
}
|
||||
}
|
||||
|
||||
@@ -241,13 +241,14 @@ class VectorDatabaseService {
|
||||
* Store document chunks with embeddings
|
||||
*/
|
||||
async storeDocumentChunks(chunks: DocumentChunk[]): Promise<void> {
|
||||
const initialized = await this.ensureInitialized();
|
||||
if (!initialized) {
|
||||
logger.warn('Vector database not available, skipping chunk storage');
|
||||
return;
|
||||
}
|
||||
|
||||
try {
|
||||
const isInitialized = await this.ensureInitialized();
|
||||
|
||||
if (!isInitialized) {
|
||||
logger.warn('Vector database not initialized, skipping chunk storage');
|
||||
return;
|
||||
}
|
||||
|
||||
switch (this.provider) {
|
||||
case 'pinecone':
|
||||
await this.storeInPinecone(chunks);
|
||||
@@ -261,11 +262,14 @@ class VectorDatabaseService {
|
||||
case 'supabase':
|
||||
await this.storeInSupabase(chunks);
|
||||
break;
|
||||
default:
|
||||
logger.warn(`Vector database provider ${this.provider} not supported for storage`);
|
||||
}
|
||||
logger.info(`Stored ${chunks.length} document chunks in vector database`);
|
||||
} catch (error) {
|
||||
logger.error('Failed to store document chunks', error);
|
||||
throw new Error('Vector storage failed');
|
||||
// Log the error but don't fail the entire upload process
|
||||
logger.error('Failed to store document chunks in vector database:', error);
|
||||
logger.warn('Continuing with upload process without vector storage');
|
||||
// Don't throw the error - let the upload continue
|
||||
}
|
||||
}
|
||||
|
||||
@@ -422,7 +426,6 @@ class VectorDatabaseService {
|
||||
async getVectorDatabaseStats(): Promise<{
|
||||
totalChunks: number;
|
||||
totalDocuments: number;
|
||||
totalSearches: number;
|
||||
averageSimilarity: number;
|
||||
}> {
|
||||
try {
|
||||
@@ -521,13 +524,19 @@ class VectorDatabaseService {
|
||||
.upsert(supabaseRows);
|
||||
|
||||
if (error) {
|
||||
// Check if it's a table/column missing error
|
||||
if (error.message && (error.message.includes('chunkIndex') || error.message.includes('document_chunks'))) {
|
||||
logger.warn('Vector database table/columns not available, skipping vector storage:', error.message);
|
||||
return; // Don't throw, just skip vector storage
|
||||
}
|
||||
throw error;
|
||||
}
|
||||
|
||||
logger.info(`Successfully stored ${chunks.length} chunks in Supabase`);
|
||||
} catch (error) {
|
||||
logger.error('Failed to store chunks in Supabase:', error);
|
||||
throw error;
|
||||
// Don't throw the error - let the upload continue without vector storage
|
||||
logger.warn('Continuing upload process without vector storage');
|
||||
}
|
||||
}
|
||||
|
||||
@@ -581,4 +590,4 @@ class VectorDatabaseService {
|
||||
}
|
||||
}
|
||||
|
||||
export const vectorDatabaseService = new VectorDatabaseService();
|
||||
export const vectorDatabaseService = new VectorDatabaseService();
|
||||
@@ -1,89 +1,76 @@
|
||||
-- Enable the pgvector extension
|
||||
CREATE EXTENSION IF NOT EXISTS vector;
|
||||
|
||||
-- Create document_chunks table with vector support
|
||||
-- Create the document_chunks table
|
||||
CREATE TABLE IF NOT EXISTS document_chunks (
|
||||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
document_id VARCHAR(255) NOT NULL,
|
||||
chunk_index INTEGER NOT NULL,
|
||||
content TEXT NOT NULL,
|
||||
embedding vector(1536), -- OpenAI embeddings are 1536 dimensions
|
||||
metadata JSONB DEFAULT '{}',
|
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
||||
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
|
||||
);
|
||||
|
||||
-- Create indexes for better performance
|
||||
CREATE INDEX IF NOT EXISTS document_chunks_document_id_idx ON document_chunks(document_id);
|
||||
CREATE INDEX IF NOT EXISTS document_chunks_embedding_idx ON document_chunks USING ivfflat (embedding vector_cosine_ops);
|
||||
|
||||
-- Create function to enable pgvector (for RPC calls)
|
||||
CREATE OR REPLACE FUNCTION enable_pgvector()
|
||||
RETURNS VOID AS $$
|
||||
BEGIN
|
||||
CREATE EXTENSION IF NOT EXISTS vector;
|
||||
END;
|
||||
$$ LANGUAGE plpgsql;
|
||||
|
||||
-- Create function to create document_chunks table (for RPC calls)
|
||||
CREATE OR REPLACE FUNCTION create_document_chunks_table()
|
||||
RETURNS VOID AS $$
|
||||
BEGIN
|
||||
CREATE TABLE IF NOT EXISTS document_chunks (
|
||||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
document_id VARCHAR(255) NOT NULL,
|
||||
chunk_index INTEGER NOT NULL,
|
||||
content TEXT NOT NULL,
|
||||
embedding vector(1536),
|
||||
metadata JSONB DEFAULT '{}',
|
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
||||
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
|
||||
);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS document_chunks_document_id_idx ON document_chunks(document_id);
|
||||
CREATE INDEX IF NOT EXISTS document_chunks_embedding_idx ON document_chunks USING ivfflat (embedding vector_cosine_ops);
|
||||
END;
|
||||
$$ LANGUAGE plpgsql;
|
||||
|
||||
-- Create function to match documents based on vector similarity
|
||||
CREATE OR REPLACE FUNCTION match_documents(
|
||||
query_embedding vector(1536),
|
||||
match_threshold float DEFAULT 0.7,
|
||||
match_count int DEFAULT 10
|
||||
)
|
||||
RETURNS TABLE(
|
||||
id UUID,
|
||||
document_id UUID NOT NULL,
|
||||
content TEXT,
|
||||
metadata JSONB,
|
||||
document_id VARCHAR(255),
|
||||
similarity FLOAT
|
||||
) AS $$
|
||||
embedding VECTOR(1536),
|
||||
chunk_index INTEGER,
|
||||
section TEXT,
|
||||
page_number INTEGER,
|
||||
created_at TIMESTAMPTZ DEFAULT NOW(),
|
||||
updated_at TIMESTAMPTZ DEFAULT NOW()
|
||||
);
|
||||
|
||||
-- Create the vector_similarity_searches table
|
||||
CREATE TABLE IF NOT EXISTS vector_similarity_searches (
|
||||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
user_id UUID,
|
||||
query_text TEXT,
|
||||
query_embedding VECTOR(1536),
|
||||
search_results JSONB,
|
||||
filters JSONB,
|
||||
limit_count INTEGER,
|
||||
similarity_threshold REAL,
|
||||
processing_time_ms INTEGER,
|
||||
created_at TIMESTAMPTZ DEFAULT NOW()
|
||||
);
|
||||
|
||||
-- Create the function to count distinct documents
|
||||
CREATE OR REPLACE FUNCTION count_distinct_documents()
|
||||
RETURNS INTEGER AS $$
|
||||
BEGIN
|
||||
RETURN (SELECT COUNT(DISTINCT document_id) FROM document_chunks);
|
||||
END;
|
||||
$$ LANGUAGE plpgsql;
|
||||
|
||||
-- Create the function to get the average chunk size
|
||||
CREATE OR REPLACE FUNCTION average_chunk_size()
|
||||
RETURNS INTEGER AS $$
|
||||
BEGIN
|
||||
RETURN (SELECT AVG(LENGTH(content)) FROM document_chunks);
|
||||
END;
|
||||
$$ LANGUAGE plpgsql;
|
||||
|
||||
-- Create the function to get search analytics
|
||||
CREATE OR REPLACE FUNCTION get_search_analytics(user_id_param UUID, days_param INTEGER)
|
||||
RETURNS TABLE(query_text TEXT, search_count BIGINT) AS $$
|
||||
BEGIN
|
||||
RETURN QUERY
|
||||
SELECT
|
||||
document_chunks.id,
|
||||
document_chunks.content,
|
||||
document_chunks.metadata,
|
||||
document_chunks.document_id,
|
||||
1 - (document_chunks.embedding <=> query_embedding) AS similarity
|
||||
FROM document_chunks
|
||||
WHERE 1 - (document_chunks.embedding <=> query_embedding) > match_threshold
|
||||
ORDER BY document_chunks.embedding <=> query_embedding
|
||||
LIMIT match_count;
|
||||
vs.query_text,
|
||||
COUNT(*) as search_count
|
||||
FROM
|
||||
vector_similarity_searches vs
|
||||
WHERE
|
||||
vs.user_id = user_id_param AND
|
||||
vs.created_at >= NOW() - (days_param * INTERVAL '1 day')
|
||||
GROUP BY
|
||||
vs.query_text
|
||||
ORDER BY
|
||||
search_count DESC
|
||||
LIMIT 20;
|
||||
END;
|
||||
$$ LANGUAGE plpgsql;
|
||||
|
||||
-- Enable Row Level Security (RLS) if needed
|
||||
-- ALTER TABLE document_chunks ENABLE ROW LEVEL SECURITY;
|
||||
|
||||
-- Create policies for RLS (adjust as needed for your auth requirements)
|
||||
-- CREATE POLICY "Users can view all document chunks" ON document_chunks FOR SELECT USING (true);
|
||||
-- CREATE POLICY "Users can insert document chunks" ON document_chunks FOR INSERT WITH CHECK (true);
|
||||
-- CREATE POLICY "Users can update document chunks" ON document_chunks FOR UPDATE USING (true);
|
||||
-- CREATE POLICY "Users can delete document chunks" ON document_chunks FOR DELETE USING (true);
|
||||
|
||||
-- Grant necessary permissions
|
||||
GRANT ALL ON document_chunks TO authenticated;
|
||||
GRANT ALL ON document_chunks TO anon;
|
||||
GRANT EXECUTE ON FUNCTION match_documents TO authenticated;
|
||||
GRANT EXECUTE ON FUNCTION match_documents TO anon;
|
||||
-- Create the function to get vector database stats
|
||||
CREATE OR REPLACE FUNCTION get_vector_database_stats()
|
||||
RETURNS TABLE(total_chunks BIGINT, total_documents BIGINT, average_similarity REAL) AS $$
|
||||
BEGIN
|
||||
RETURN QUERY
|
||||
SELECT
|
||||
(SELECT COUNT(*) FROM document_chunks),
|
||||
(SELECT COUNT(DISTINCT document_id) FROM document_chunks),
|
||||
(SELECT AVG(similarity_score) FROM document_similarities WHERE similarity_score > 0);
|
||||
END;
|
||||
$$ LANGUAGE plpgsql;
|
||||
|
||||
374
currrent_output.json
Normal file
374
currrent_output.json
Normal file
File diff suppressed because one or more lines are too long
@@ -10,7 +10,7 @@ import Analytics from './components/Analytics';
|
||||
import UploadMonitoringDashboard from './components/UploadMonitoringDashboard';
|
||||
import LogoutButton from './components/LogoutButton';
|
||||
import { documentService, GCSErrorHandler, GCSError } from './services/documentService';
|
||||
import { debugAuth, testAPIAuth } from './utils/authDebug';
|
||||
// import { debugAuth, testAPIAuth } from './utils/authDebug';
|
||||
|
||||
import {
|
||||
Home,
|
||||
@@ -75,13 +75,14 @@ const Dashboard: React.FC = () => {
|
||||
|
||||
if (response.ok) {
|
||||
const result = await response.json();
|
||||
// The API returns an array directly, not wrapped in success/data
|
||||
if (Array.isArray(result)) {
|
||||
// The API returns documents wrapped in a documents property
|
||||
const documentsArray = result.documents || result;
|
||||
if (Array.isArray(documentsArray)) {
|
||||
// Transform backend data to frontend format
|
||||
const transformedDocs = result.map((doc: any) => ({
|
||||
const transformedDocs = documentsArray.map((doc: any) => ({
|
||||
id: doc.id,
|
||||
name: doc.name || doc.originalName,
|
||||
originalName: doc.originalName,
|
||||
name: doc.name || doc.originalName || 'Unknown',
|
||||
originalName: doc.originalName || doc.name || 'Unknown',
|
||||
status: mapBackendStatus(doc.status),
|
||||
uploadedAt: doc.uploadedAt,
|
||||
processedAt: doc.processedAt,
|
||||
@@ -216,10 +217,22 @@ const Dashboard: React.FC = () => {
|
||||
return () => clearInterval(refreshInterval);
|
||||
}, [fetchDocuments]);
|
||||
|
||||
const handleUploadComplete = (fileId: string) => {
|
||||
console.log('Upload completed:', fileId);
|
||||
// Refresh documents list after upload
|
||||
fetchDocuments();
|
||||
const handleUploadComplete = (documentId: string) => {
|
||||
console.log('Upload completed:', documentId);
|
||||
// Add the new document to the list with a "processing" status
|
||||
// Since we only have the ID, we'll create a minimal document object
|
||||
const newDocument = {
|
||||
id: documentId,
|
||||
status: 'processing',
|
||||
name: 'Processing...',
|
||||
originalName: 'Processing...',
|
||||
uploadedAt: new Date().toISOString(),
|
||||
fileSize: 0,
|
||||
user_id: user?.id || '',
|
||||
created_at: new Date().toISOString(),
|
||||
updated_at: new Date().toISOString()
|
||||
};
|
||||
setDocuments(prev => [...prev, newDocument]);
|
||||
};
|
||||
|
||||
const handleUploadError = (error: string) => {
|
||||
@@ -291,18 +304,18 @@ const Dashboard: React.FC = () => {
|
||||
setViewingDocument(null);
|
||||
};
|
||||
|
||||
// Debug functions
|
||||
const handleDebugAuth = async () => {
|
||||
await debugAuth();
|
||||
};
|
||||
// Debug functions (commented out for now)
|
||||
// const handleDebugAuth = async () => {
|
||||
// await debugAuth();
|
||||
// };
|
||||
|
||||
const handleTestAPIAuth = async () => {
|
||||
await testAPIAuth();
|
||||
};
|
||||
// const handleTestAPIAuth = async () => {
|
||||
// await testAPIAuth();
|
||||
// };
|
||||
|
||||
const filteredDocuments = documents.filter(doc =>
|
||||
doc.name.toLowerCase().includes(searchTerm.toLowerCase()) ||
|
||||
doc.originalName.toLowerCase().includes(searchTerm.toLowerCase())
|
||||
(doc.name?.toLowerCase() || '').includes(searchTerm.toLowerCase()) ||
|
||||
(doc.originalName?.toLowerCase() || '').includes(searchTerm.toLowerCase())
|
||||
);
|
||||
|
||||
const stats = {
|
||||
|
||||
@@ -21,7 +21,7 @@ interface UploadedFile {
|
||||
}
|
||||
|
||||
interface DocumentUploadProps {
|
||||
onUploadComplete?: (fileId: string) => void;
|
||||
onUploadComplete?: (documentId: string) => void;
|
||||
onUploadError?: (error: string) => void;
|
||||
}
|
||||
|
||||
@@ -104,15 +104,15 @@ const DocumentUpload: React.FC<DocumentUploadProps> = ({
|
||||
abortController.signal
|
||||
);
|
||||
|
||||
// Upload completed - update status to "uploaded"
|
||||
// Upload completed - update status to "processing" immediately
|
||||
setUploadedFiles(prev =>
|
||||
prev.map(f =>
|
||||
f.id === uploadedFile.id
|
||||
? {
|
||||
...f,
|
||||
id: document.id,
|
||||
documentId: document.id,
|
||||
status: 'uploaded',
|
||||
id: result.id,
|
||||
documentId: result.id,
|
||||
status: 'processing', // Changed from 'uploaded' to 'processing'
|
||||
progress: 100
|
||||
}
|
||||
: f
|
||||
@@ -120,10 +120,10 @@ const DocumentUpload: React.FC<DocumentUploadProps> = ({
|
||||
);
|
||||
|
||||
// Call the completion callback with the document ID
|
||||
onUploadComplete?.(document.id);
|
||||
onUploadComplete?.(result.id);
|
||||
|
||||
// Start monitoring processing progress
|
||||
monitorProcessingProgress(document.id, uploadedFile.id);
|
||||
// Start monitoring processing progress immediately
|
||||
monitorProcessingProgress(result.id, uploadedFile.id);
|
||||
|
||||
} catch (error) {
|
||||
// Check if this was an abort error
|
||||
@@ -189,8 +189,29 @@ const DocumentUpload: React.FC<DocumentUploadProps> = ({
|
||||
console.warn('Attempted to monitor progress for document with invalid UUID format:', documentId);
|
||||
return;
|
||||
}
|
||||
|
||||
// Add timeout to prevent infinite polling (30 minutes max)
|
||||
const startTime = Date.now();
|
||||
const maxPollingTime = 30 * 60 * 1000; // 30 minutes
|
||||
|
||||
const checkProgress = async () => {
|
||||
// Check if we've exceeded the maximum polling time
|
||||
if (Date.now() - startTime > maxPollingTime) {
|
||||
console.warn(`Polling timeout for document ${documentId} after ${maxPollingTime / 1000 / 60} minutes`);
|
||||
setUploadedFiles(prev =>
|
||||
prev.map(f =>
|
||||
f.id === fileId
|
||||
? {
|
||||
...f,
|
||||
status: 'error',
|
||||
error: 'Processing timeout - please check document status manually'
|
||||
}
|
||||
: f
|
||||
)
|
||||
);
|
||||
return;
|
||||
}
|
||||
|
||||
try {
|
||||
const response = await fetch(`${import.meta.env.VITE_API_BASE_URL}/documents/${documentId}/progress`, {
|
||||
headers: {
|
||||
@@ -203,8 +224,10 @@ const DocumentUpload: React.FC<DocumentUploadProps> = ({
|
||||
const progress = await response.json();
|
||||
|
||||
// Update status based on progress
|
||||
let newStatus: UploadedFile['status'] = 'uploaded';
|
||||
if (progress.status === 'processing' || progress.status === 'extracting_text' || progress.status === 'processing_llm' || progress.status === 'generating_pdf') {
|
||||
let newStatus: UploadedFile['status'] = 'processing'; // Default to processing
|
||||
if (progress.status === 'uploading' || progress.status === 'uploaded') {
|
||||
newStatus = 'processing'; // Still processing
|
||||
} else if (progress.status === 'processing' || progress.status === 'extracting_text' || progress.status === 'processing_llm' || progress.status === 'generating_pdf') {
|
||||
newStatus = 'processing';
|
||||
} else if (progress.status === 'completed') {
|
||||
newStatus = 'completed';
|
||||
@@ -242,12 +265,12 @@ const DocumentUpload: React.FC<DocumentUploadProps> = ({
|
||||
// Don't stop monitoring on network errors, just log and continue
|
||||
}
|
||||
|
||||
// Continue monitoring
|
||||
setTimeout(checkProgress, 2000);
|
||||
// Continue monitoring with shorter intervals for better responsiveness
|
||||
setTimeout(checkProgress, 3000); // Check every 3 seconds
|
||||
};
|
||||
|
||||
// Start monitoring
|
||||
setTimeout(checkProgress, 1000);
|
||||
// Start monitoring immediately
|
||||
setTimeout(checkProgress, 500); // Start checking after 500ms
|
||||
}, [token]);
|
||||
|
||||
const { getRootProps, getInputProps, isDragActive } = useDropzone({
|
||||
@@ -378,7 +401,7 @@ const DocumentUpload: React.FC<DocumentUploadProps> = ({
|
||||
<h4 className="text-sm font-medium text-success-800">Upload Complete</h4>
|
||||
<p className="text-sm text-success-700 mt-1">
|
||||
Files have been uploaded successfully to Firebase Storage! You can now navigate away from this page.
|
||||
Processing will continue in the background using Document AI + Optimized Agentic RAG. PDFs will be automatically deleted after processing to save costs.
|
||||
Processing will continue in the background using Document AI + Optimized Agentic RAG. This can take several minutes. PDFs will be automatically deleted after processing to save costs.
|
||||
</p>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
@@ -7,7 +7,7 @@ const API_BASE_URL = config.apiBaseUrl;
|
||||
// Create axios instance with auth interceptor
|
||||
const apiClient = axios.create({
|
||||
baseURL: API_BASE_URL,
|
||||
timeout: 30000, // 30 seconds
|
||||
timeout: 300000, // 5 minutes
|
||||
});
|
||||
|
||||
// Add auth token to requests
|
||||
@@ -263,14 +263,46 @@ class DocumentService {
|
||||
// Step 3: Confirm upload and trigger processing
|
||||
onProgress?.(95); // 95% - Confirming upload
|
||||
|
||||
const confirmResponse = await apiClient.post(`/documents/${documentId}/confirm-upload`, {}, { signal });
|
||||
console.log('🔄 Making confirm-upload request for document:', documentId);
|
||||
console.log('🔄 Confirm-upload URL:', `/documents/${documentId}/confirm-upload`);
|
||||
|
||||
// Add retry logic for confirm-upload (based on Google Cloud best practices)
|
||||
let confirmResponse;
|
||||
let lastError;
|
||||
|
||||
for (let attempt = 1; attempt <= 3; attempt++) {
|
||||
try {
|
||||
console.log(`🔄 Confirm-upload attempt ${attempt}/3`);
|
||||
confirmResponse = await apiClient.post(`/documents/${documentId}/confirm-upload`, {}, {
|
||||
signal,
|
||||
timeout: 60000 // 60 second timeout for confirm-upload
|
||||
});
|
||||
console.log('✅ Confirm-upload response received:', confirmResponse.status);
|
||||
console.log('✅ Confirm-upload response data:', confirmResponse.data);
|
||||
break; // Success, exit retry loop
|
||||
} catch (error: any) {
|
||||
lastError = error;
|
||||
console.log(`❌ Confirm-upload attempt ${attempt} failed:`, error.message);
|
||||
|
||||
if (attempt < 3) {
|
||||
// Wait before retry (exponential backoff)
|
||||
const delay = Math.pow(2, attempt) * 1000; // 2s, 4s
|
||||
console.log(`⏳ Waiting ${delay}ms before retry...`);
|
||||
await new Promise(resolve => setTimeout(resolve, delay));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if (!confirmResponse) {
|
||||
throw lastError || new Error('Confirm-upload failed after 3 attempts');
|
||||
}
|
||||
|
||||
onProgress?.(100); // 100% - Complete
|
||||
console.log('✅ Upload confirmed and processing started');
|
||||
|
||||
return {
|
||||
id: documentId,
|
||||
...confirmResponse.data
|
||||
...confirmResponse.data.document
|
||||
};
|
||||
|
||||
} catch (error: any) {
|
||||
@@ -281,6 +313,16 @@ class DocumentService {
|
||||
throw new Error('Upload was cancelled.');
|
||||
}
|
||||
|
||||
// Handle network timeouts
|
||||
if (error.code === 'ECONNABORTED' || error.message?.includes('timeout')) {
|
||||
throw new Error('Request timed out. Please check your connection and try again.');
|
||||
}
|
||||
|
||||
// Handle network errors
|
||||
if (error.code === 'ERR_NETWORK' || error.message?.includes('Network Error')) {
|
||||
throw new Error('Network error. Please check your connection and try again.');
|
||||
}
|
||||
|
||||
if (error.response?.status === 401) {
|
||||
throw new Error('Authentication required. Please log in again.');
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user