Pre-cleanup commit: Current state before service layer consolidation

This commit is contained in:
Jon
2025-08-01 14:57:56 -04:00
parent 95c92946de
commit f453efb0f8
21 changed files with 2560 additions and 363 deletions

533
APP_DESIGN_DOCUMENTATION.md Normal file
View File

@@ -0,0 +1,533 @@
# CIM Document Processor - Application Design Documentation
## Overview
The CIM Document Processor is a web application that processes Confidential Information Memorandums (CIMs) using AI to extract key business information and generate structured analysis reports. The system uses Google Document AI for text extraction and an optimized Agentic RAG (Retrieval-Augmented Generation) approach for intelligent document analysis.
## Architecture Overview
```
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Frontend │ │ Backend │ │ External │
│ (React) │◄──►│ (Node.js) │◄──►│ Services │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ Database │ │ Google Cloud │
│ (Supabase) │ │ Services │
└─────────────────┘ └─────────────────┘
```
## Core Components
### 1. Frontend (React + TypeScript)
**Location**: `frontend/src/`
**Key Components**:
- **App.tsx**: Main application with tabbed interface
- **DocumentUpload**: File upload with Firebase Storage integration
- **DocumentList**: Display and manage uploaded documents
- **DocumentViewer**: View processed documents and analysis
- **Analytics**: Dashboard for processing statistics
- **UploadMonitoringDashboard**: Real-time upload monitoring
**Authentication**: Firebase Authentication with protected routes
### 2. Backend (Node.js + Express + TypeScript)
**Location**: `backend/src/`
**Key Services**:
- **unifiedDocumentProcessor**: Main orchestrator for document processing
- **optimizedAgenticRAGProcessor**: Core AI processing engine
- **llmService**: LLM interaction service (Claude AI/OpenAI)
- **pdfGenerationService**: PDF report generation using Puppeteer
- **fileStorageService**: Google Cloud Storage operations
- **uploadMonitoringService**: Real-time upload tracking
- **agenticRAGDatabaseService**: Analytics and session management
- **sessionService**: User session management
- **jobQueueService**: Background job processing
- **uploadProgressService**: Upload progress tracking
## Data Flow
### 1. Document Upload Process
```
User Uploads PDF
┌─────────────────┐
│ 1. Get Upload │ ──► Generate signed URL from Google Cloud Storage
│ URL │
└─────────┬───────┘
┌─────────────────┐
│ 2. Upload to │ ──► Direct upload to GCS bucket
│ GCS │
└─────────┬───────┘
┌─────────────────┐
│ 3. Confirm │ ──► Update database, create processing job
│ Upload │
└─────────┬───────┘
```
### 2. Document Processing Pipeline
```
Document Uploaded
┌─────────────────┐
│ 1. Text │ ──► Google Document AI extracts text from PDF
│ Extraction │ (documentAiGenkitProcessor or direct Document AI)
└─────────┬───────┘
┌─────────────────┐
│ 2. Intelligent │ ──► Split text into semantic chunks (4000 chars)
│ Chunking │ with 200 char overlap
└─────────┬───────┘
┌─────────────────┐
│ 3. Vector │ ──► Generate embeddings for each chunk
│ Embedding │ (rate-limited to 5 concurrent calls)
└─────────┬───────┘
┌─────────────────┐
│ 4. LLM Analysis │ ──► llmService → Claude AI analyzes chunks
│ │ and generates structured CIM review data
└─────────┬───────┘
┌─────────────────┐
│ 5. PDF │ ──► pdfGenerationService generates summary PDF
│ Generation │ using Puppeteer
└─────────┬───────┘
┌─────────────────┐
│ 6. Database │ ──► Store analysis data, update document status
│ Storage │
└─────────┬───────┘
┌─────────────────┐
│ 7. Complete │ ──► Update session, notify user, cleanup
│ Processing │
└─────────────────┘
```
### 3. Error Handling Flow
```
Processing Error
┌─────────────────┐
│ Error Logging │ ──► Log error with correlation ID
└─────────┬───────┘
┌─────────────────┐
│ Retry Logic │ ──► Retry failed operation (up to 3 times)
└─────────┬───────┘
┌─────────────────┐
│ Graceful │ ──► Return partial results or error message
│ Degradation │
└─────────────────┘
```
## Key Services Explained
### 1. Unified Document Processor (`unifiedDocumentProcessor.ts`)
**Purpose**: Main orchestrator that routes documents to the appropriate processing strategy.
**Current Strategy**: `optimized_agentic_rag` (only active strategy)
**Methods**:
- `processDocument()`: Main processing entry point
- `processWithOptimizedAgenticRAG()`: Current active processing method
- `getProcessingStats()`: Returns processing statistics
### 2. Optimized Agentic RAG Processor (`optimizedAgenticRAGProcessor.ts`)
**Purpose**: Core AI processing engine that handles large documents efficiently.
**Key Features**:
- **Intelligent Chunking**: Splits text at semantic boundaries (sections, paragraphs)
- **Batch Processing**: Processes chunks in batches of 10 to manage memory
- **Rate Limiting**: Limits concurrent API calls to 5
- **Memory Optimization**: Tracks memory usage and processes efficiently
**Processing Steps**:
1. **Create Intelligent Chunks**: Split text into 4000-char chunks with semantic boundaries
2. **Process Chunks in Batches**: Generate embeddings and metadata for each chunk
3. **Store Chunks Optimized**: Save to vector database with batching
4. **Generate LLM Analysis**: Use llmService to analyze and create structured data
### 3. LLM Service (`llmService.ts`)
**Purpose**: Handles all LLM interactions with Claude AI and OpenAI.
**Key Features**:
- **Model Selection**: Automatically selects optimal model based on task complexity
- **Retry Logic**: Implements retry mechanism for failed API calls
- **Cost Tracking**: Tracks token usage and API costs
- **Error Handling**: Graceful error handling with fallback options
**Methods**:
- `processCIMDocument()`: Main CIM analysis method
- `callLLM()`: Generic LLM call method
- `callAnthropic()`: Claude AI specific calls
- `callOpenAI()`: OpenAI specific calls
### 4. PDF Generation Service (`pdfGenerationService.ts`)
**Purpose**: Generates PDF reports from analysis data using Puppeteer.
**Key Features**:
- **HTML to PDF**: Converts HTML content to PDF using Puppeteer
- **Markdown Support**: Converts markdown to HTML then to PDF
- **Custom Styling**: Professional PDF formatting with CSS
- **CIM Review Templates**: Specialized templates for CIM analysis reports
**Methods**:
- `generateCIMReviewPDF()`: Generate CIM review PDF from analysis data
- `generatePDFFromMarkdown()`: Convert markdown to PDF
- `generatePDFBuffer()`: Generate PDF as buffer for immediate download
### 5. File Storage Service (`fileStorageService.ts`)
**Purpose**: Handles all Google Cloud Storage operations.
**Key Operations**:
- `generateSignedUploadUrl()`: Creates secure upload URLs
- `getFile()`: Downloads files from GCS
- `uploadFile()`: Uploads files to GCS
- `deleteFile()`: Removes files from GCS
### 6. Upload Monitoring Service (`uploadMonitoringService.ts`)
**Purpose**: Tracks upload progress and provides real-time monitoring.
**Key Features**:
- Real-time upload tracking
- Error analysis and reporting
- Performance metrics
- Health status monitoring
### 7. Session Service (`sessionService.ts`)
**Purpose**: Manages user sessions and authentication state.
**Key Features**:
- Session storage and retrieval
- Token management
- Session cleanup
- Security token blacklisting
### 8. Job Queue Service (`jobQueueService.ts`)
**Purpose**: Manages background job processing and queuing.
**Key Features**:
- Job queuing and scheduling
- Background processing
- Job status tracking
- Error recovery
## Service Dependencies
```
unifiedDocumentProcessor
├── optimizedAgenticRAGProcessor
│ ├── llmService (for AI processing)
│ ├── vectorDatabaseService (for embeddings)
│ └── fileStorageService (for file operations)
├── pdfGenerationService (for PDF creation)
├── uploadMonitoringService (for tracking)
├── sessionService (for session management)
└── jobQueueService (for background processing)
```
## Database Schema
### Core Tables
#### 1. Documents Table
```sql
CREATE TABLE documents (
id UUID PRIMARY KEY,
user_id TEXT NOT NULL,
original_file_name TEXT NOT NULL,
file_path TEXT NOT NULL,
file_size INTEGER NOT NULL,
status TEXT NOT NULL,
extracted_text TEXT,
generated_summary TEXT,
summary_pdf_path TEXT,
analysis_data JSONB,
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
);
```
#### 2. Agentic RAG Sessions Table
```sql
CREATE TABLE agentic_rag_sessions (
id UUID PRIMARY KEY,
document_id UUID REFERENCES documents(id),
strategy TEXT NOT NULL,
status TEXT NOT NULL,
total_agents INTEGER,
completed_agents INTEGER,
failed_agents INTEGER,
overall_validation_score DECIMAL,
processing_time_ms INTEGER,
api_calls_count INTEGER,
total_cost DECIMAL,
created_at TIMESTAMP DEFAULT NOW(),
completed_at TIMESTAMP
);
```
#### 3. Vector Database Tables
```sql
CREATE TABLE document_chunks (
id UUID PRIMARY KEY,
document_id UUID REFERENCES documents(id),
content TEXT NOT NULL,
embedding VECTOR(1536),
chunk_index INTEGER,
metadata JSONB,
created_at TIMESTAMP DEFAULT NOW()
);
```
## API Endpoints
### Active Endpoints
#### Document Management
- `POST /documents/upload-url` - Get signed upload URL
- `POST /documents/:id/confirm-upload` - Confirm upload and start processing
- `POST /documents/:id/process-optimized-agentic-rag` - Trigger AI processing
- `GET /documents/:id/download` - Download processed PDF
- `DELETE /documents/:id` - Delete document
#### Analytics & Monitoring
- `GET /documents/analytics` - Get processing analytics
- `GET /documents/:id/agentic-rag-sessions` - Get processing sessions
- `GET /monitoring/dashboard` - Get monitoring dashboard
- `GET /vector/stats` - Get vector database statistics
### Legacy Endpoints (Kept for Backward Compatibility)
- `POST /documents/upload` - Multipart file upload (legacy)
- `GET /documents` - List documents (basic CRUD)
## Configuration
### Environment Variables
**Backend** (`backend/src/config/env.ts`):
```typescript
// Google Cloud
GOOGLE_CLOUD_PROJECT_ID
GOOGLE_CLOUD_STORAGE_BUCKET
GOOGLE_APPLICATION_CREDENTIALS
// Document AI
GOOGLE_DOCUMENT_AI_LOCATION
GOOGLE_DOCUMENT_AI_PROCESSOR_ID
// Database
DATABASE_URL
SUPABASE_URL
SUPABASE_ANON_KEY
// AI Services
ANTHROPIC_API_KEY
OPENAI_API_KEY
// Processing
AGENTIC_RAG_ENABLED=true
PROCESSING_STRATEGY=optimized_agentic_rag
// LLM Configuration
LLM_PROVIDER=anthropic
LLM_MODEL=claude-3-opus-20240229
LLM_MAX_TOKENS=4000
LLM_TEMPERATURE=0.1
```
**Frontend** (`frontend/src/config/env.ts`):
```typescript
// API
VITE_API_BASE_URL
VITE_FIREBASE_API_KEY
VITE_FIREBASE_AUTH_DOMAIN
```
## Processing Strategy Details
### Current Strategy: Optimized Agentic RAG
**Why This Strategy**:
- Handles large documents efficiently
- Provides structured analysis output
- Optimizes memory usage and API costs
- Generates high-quality summaries
**How It Works**:
1. **Text Extraction**: Google Document AI extracts text from PDF
2. **Semantic Chunking**: Splits text at natural boundaries (sections, paragraphs)
3. **Vector Embedding**: Creates embeddings for each chunk
4. **LLM Analysis**: llmService calls Claude AI to analyze chunks and generate structured data
5. **PDF Generation**: pdfGenerationService creates summary PDF with analysis results
**Output Format**: Structured CIM Review data including:
- Deal Overview
- Business Description
- Market Analysis
- Financial Summary
- Management Team
- Investment Thesis
- Key Questions & Next Steps
## Error Handling
### Frontend Error Handling
- **Network Errors**: Automatic retry with exponential backoff
- **Authentication Errors**: Automatic token refresh or redirect to login
- **Upload Errors**: User-friendly error messages with retry options
- **Processing Errors**: Real-time error display with retry functionality
### Backend Error Handling
- **Validation Errors**: Input validation with detailed error messages
- **Processing Errors**: Graceful degradation with error logging
- **Storage Errors**: Retry logic for transient failures
- **Database Errors**: Connection pooling and retry mechanisms
- **LLM API Errors**: Retry logic with exponential backoff
- **PDF Generation Errors**: Fallback to text-only output
### Error Recovery Mechanisms
- **LLM API Failures**: Up to 3 retry attempts with different models
- **Processing Timeouts**: Graceful timeout handling with partial results
- **Memory Issues**: Automatic garbage collection and memory cleanup
- **File Storage Errors**: Retry with exponential backoff
## Monitoring & Analytics
### Real-time Monitoring
- Upload progress tracking
- Processing status updates
- Error rate monitoring
- Performance metrics
- API usage tracking
- Cost monitoring
### Analytics Dashboard
- Processing success rates
- Average processing times
- API usage statistics
- Cost tracking
- User activity metrics
- Error analysis reports
## Security
### Authentication
- Firebase Authentication
- JWT token validation
- Protected API endpoints
- User-specific data isolation
- Session management with secure token handling
### File Security
- Signed URLs for secure uploads
- File type validation (PDF only)
- File size limits (50MB max)
- User-specific file storage paths
- Secure file deletion
### API Security
- Rate limiting (1000 requests per 15 minutes)
- CORS configuration
- Input validation
- SQL injection prevention
- Request correlation IDs for tracking
## Performance Optimization
### Memory Management
- Batch processing to limit memory usage
- Garbage collection optimization
- Connection pooling for database
- Efficient chunking to minimize memory footprint
### API Optimization
- Rate limiting to prevent API quota exhaustion
- Caching for frequently accessed data
- Efficient chunking to minimize API calls
- Model selection based on task complexity
### Processing Optimization
- Concurrent processing with limits
- Intelligent chunking for optimal processing
- Background job processing
- Progress tracking for user feedback
## Deployment
### Backend Deployment
- **Firebase Functions**: Serverless deployment
- **Google Cloud Run**: Containerized deployment
- **Docker**: Container support
### Frontend Deployment
- **Firebase Hosting**: Static hosting
- **Vite**: Build tool
- **TypeScript**: Type safety
## Development Workflow
### Local Development
1. **Backend**: `npm run dev` (runs on port 5001)
2. **Frontend**: `npm run dev` (runs on port 5173)
3. **Database**: Supabase local development
4. **Storage**: Google Cloud Storage (development bucket)
### Testing
- **Unit Tests**: Jest for backend, Vitest for frontend
- **Integration Tests**: End-to-end testing
- **API Tests**: Supertest for backend endpoints
## Troubleshooting
### Common Issues
1. **Upload Failures**: Check GCS permissions and bucket configuration
2. **Processing Timeouts**: Increase timeout limits for large documents
3. **Memory Issues**: Monitor memory usage and adjust batch sizes
4. **API Quotas**: Check API usage and implement rate limiting
5. **PDF Generation Failures**: Check Puppeteer installation and memory
6. **LLM API Errors**: Verify API keys and check rate limits
### Debug Tools
- Real-time logging with correlation IDs
- Upload monitoring dashboard
- Processing session details
- Error analysis reports
- Performance metrics dashboard
This documentation provides a comprehensive overview of the CIM Document Processor architecture, helping junior programmers understand the system's design, data flow, and key components.

463
ARCHITECTURE_DIAGRAMS.md Normal file
View File

@@ -0,0 +1,463 @@
# CIM Document Processor - Architecture Diagrams
## System Architecture Overview
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ FRONTEND (React) │
├─────────────────────────────────────────────────────────────────────────────┤
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Login │ │ Document │ │ Document │ │ Analytics │ │
│ │ Form │ │ Upload │ │ List │ │ Dashboard │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Document │ │ Upload │ │ Protected │ │ Auth │ │
│ │ Viewer │ │ Monitoring │ │ Route │ │ Context │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
▼ HTTP/HTTPS
┌─────────────────────────────────────────────────────────────────────────────┐
│ BACKEND (Node.js) │
├─────────────────────────────────────────────────────────────────────────────┤
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Document │ │ Vector │ │ Monitoring │ │ Auth │ │
│ │ Routes │ │ Routes │ │ Routes │ │ Middleware │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Unified │ │ Optimized │ │ LLM │ │ PDF │ │
│ │ Document │ │ Agentic │ │ Service │ │ Generation │ │
│ │ Processor │ │ RAG │ │ │ │ Service │ │
│ │ │ │ Processor │ │ │ │ │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ File │ │ Upload │ │ Session │ │ Job Queue │ │
│ │ Storage │ │ Monitoring │ │ Service │ │ Service │ │
│ │ Service │ │ Service │ │ │ │ │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ EXTERNAL SERVICES │
├─────────────────────────────────────────────────────────────────────────────┤
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Google │ │ Google │ │ Anthropic │ │ Firebase │ │
│ │ Document AI │ │ Cloud │ │ Claude AI │ │ Auth │ │
│ │ │ │ Storage │ │ │ │ │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ DATABASE (Supabase) │
├─────────────────────────────────────────────────────────────────────────────┤
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Documents │ │ Agentic │ │ Document │ │ Vector │ │
│ │ Table │ │ RAG │ │ Chunks │ │ Embeddings │ │
│ │ │ │ Sessions │ │ Table │ │ Table │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
```
## Document Processing Flow
```
┌─────────────────┐
│ User Uploads │
│ PDF Document │
└─────────┬───────┘
┌─────────────────┐
│ 1. Get Upload │ ──► Generate signed URL from Google Cloud Storage
│ URL │
└─────────┬───────┘
┌─────────────────┐
│ 2. Upload to │ ──► Direct upload to GCS bucket
│ GCS │
└─────────┬───────┘
┌─────────────────┐
│ 3. Confirm │ ──► Update database, create processing job
│ Upload │
└─────────┬───────┘
┌─────────────────┐
│ 4. Text │ ──► Google Document AI extracts text from PDF
│ Extraction │ (documentAiGenkitProcessor or direct Document AI)
└─────────┬───────┘
┌─────────────────┐
│ 5. Intelligent │ ──► Split text into semantic chunks (4000 chars)
│ Chunking │ with 200 char overlap
└─────────┬───────┘
┌─────────────────┐
│ 6. Vector │ ──► Generate embeddings for each chunk
│ Embedding │ (rate-limited to 5 concurrent calls)
└─────────┬───────┘
┌─────────────────┐
│ 7. LLM Analysis │ ──► llmService → Claude AI analyzes chunks
│ │ and generates structured CIM review data
└─────────┬───────┘
┌─────────────────┐
│ 8. PDF │ ──► pdfGenerationService generates summary PDF
│ Generation │ using Puppeteer
└─────────┬───────┘
┌─────────────────┐
│ 9. Database │ ──► Store analysis data, update document status
│ Storage │
└─────────┬───────┘
┌─────────────────┐
│ 10. Complete │ ──► Update session, notify user, cleanup
│ Processing │
└─────────────────┘
```
## Error Handling Flow
```
Processing Error
┌─────────────────┐
│ Error Logging │ ──► Log error with correlation ID
└─────────┬───────┘
┌─────────────────┐
│ Retry Logic │ ──► Retry failed operation (up to 3 times)
└─────────┬───────┘
┌─────────────────┐
│ Graceful │ ──► Return partial results or error message
│ Degradation │
└─────────────────┘
```
## Component Dependency Map
### Backend Services
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ CORE SERVICES │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Unified │ │ Optimized │ │ LLM Service │ │
│ │ Document │───►│ Agentic RAG │───►│ │ │
│ │ Processor │ │ Processor │ │ (Claude AI/ │ │
│ │ (Orchestrator) │ │ (Core AI) │ │ OpenAI) │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ PDF Generation │ │ File Storage │ │ Upload │ │
│ │ Service │ │ Service │ │ Monitoring │ │
│ │ (Puppeteer) │ │ (GCS) │ │ Service │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Session │ │ Job Queue │ │ Upload │ │
│ │ Service │ │ Service │ │ Progress │ │
│ │ (Auth Mgmt) │ │ (Background) │ │ Service │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
```
### Frontend Components
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ FRONTEND COMPONENTS │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ App.tsx │ │ AuthContext │ │ ProtectedRoute │ │
│ │ (Main App) │───►│ (Auth State) │───►│ (Route Guard) │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ DocumentUpload │ │ DocumentList │ │ DocumentViewer │ │
│ │ (File Upload) │ │ (Document Mgmt) │ │ (View Results) │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Analytics │ │ Upload │ │ LoginForm │ │
│ │ (Dashboard) │ │ Monitoring │ │ (Auth) │ │
│ │ │ │ Dashboard │ │ │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
```
## Service Dependencies Map
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ SERVICE DEPENDENCIES │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ │
│ │ unifiedDocumentProcessor (Main Orchestrator) │
│ └─────────┬───────┘ │
│ │ │
│ ├───► optimizedAgenticRAGProcessor │
│ │ ├───► llmService (AI Processing) │
│ │ ├───► vectorDatabaseService (Embeddings) │
│ │ └───► fileStorageService (File Operations) │
│ │ │
│ ├───► pdfGenerationService (PDF Creation) │
│ │ └───► Puppeteer (PDF Generation) │
│ │ │
│ ├───► uploadMonitoringService (Real-time Tracking) │
│ │ │
│ ├───► sessionService (Session Management) │
│ │ │
│ └───► jobQueueService (Background Processing) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
```
## API Endpoint Map
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ API ENDPOINTS │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ DOCUMENT ROUTES │ │
│ │ │ │
│ │ POST /documents/upload-url ──► Get signed upload URL │ │
│ │ POST /documents/:id/confirm-upload ──► Confirm upload & process │ │
│ │ POST /documents/:id/process-optimized-agentic-rag ──► AI processing │ │
│ │ GET /documents/:id/download ──► Download PDF │ │
│ │ DELETE /documents/:id ──► Delete document │ │
│ │ GET /documents/analytics ──► Get analytics │ │
│ │ GET /documents/:id/agentic-rag-sessions ──► Get sessions │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ MONITORING ROUTES │ │
│ │ │ │
│ │ GET /monitoring/dashboard ──► Get monitoring dashboard │ │
│ │ GET /monitoring/upload-metrics ──► Get upload metrics │ │
│ │ GET /monitoring/upload-health ──► Get health status │ │
│ │ GET /monitoring/real-time-stats ──► Get real-time stats │ │
│ │ GET /monitoring/error-analysis ──► Get error analysis │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ VECTOR ROUTES │ │
│ │ │ │
│ │ GET /vector/document-chunks/:documentId ──► Get document chunks │ │
│ │ GET /vector/analytics ──► Get vector analytics │ │
│ │ GET /vector/stats ──► Get vector stats │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
```
## Database Schema Map
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ DATABASE SCHEMA │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ DOCUMENTS TABLE │ │
│ │ │ │
│ │ id (UUID) ──► Primary key │ │
│ │ user_id (TEXT) ──► User identifier │ │
│ │ original_file_name (TEXT) ──► Original filename │ │
│ │ file_path (TEXT) ──► GCS file path │ │
│ │ file_size (INTEGER) ──► File size in bytes │ │
│ │ status (TEXT) ──► Processing status │ │
│ │ extracted_text (TEXT) ──► Extracted text content │ │
│ │ generated_summary (TEXT) ──► Generated summary │ │
│ │ summary_pdf_path (TEXT) ──► PDF summary path │ │
│ │ analysis_data (JSONB) ──► Structured analysis data │ │
│ │ created_at (TIMESTAMP) ──► Creation timestamp │ │
│ │ updated_at (TIMESTAMP) ──► Last update timestamp │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ AGENTIC RAG SESSIONS TABLE │ │
│ │ │ │
│ │ id (UUID) ──► Primary key │ │
│ │ document_id (UUID) ──► Foreign key to documents │ │
│ │ strategy (TEXT) ──► Processing strategy used │ │
│ │ status (TEXT) ──► Session status │ │
│ │ total_agents (INTEGER) ──► Total agents in session │ │
│ │ completed_agents (INTEGER) ──► Completed agents │ │
│ │ failed_agents (INTEGER) ──► Failed agents │ │
│ │ overall_validation_score (DECIMAL) ──► Quality score │ │
│ │ processing_time_ms (INTEGER) ──► Processing time │ │
│ │ api_calls_count (INTEGER) ──► Number of API calls │ │
│ │ total_cost (DECIMAL) ──► Total processing cost │ │
│ │ created_at (TIMESTAMP) ──► Creation timestamp │ │
│ │ completed_at (TIMESTAMP) ──► Completion timestamp │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ DOCUMENT CHUNKS TABLE │ │
│ │ │ │
│ │ id (UUID) ──► Primary key │ │
│ │ document_id (UUID) ──► Foreign key to documents │ │
│ │ content (TEXT) ──► Chunk content │ │
│ │ embedding (VECTOR(1536)) ──► Vector embedding │ │
│ │ chunk_index (INTEGER) ──► Chunk order │ │
│ │ metadata (JSONB) ──► Chunk metadata │ │
│ │ created_at (TIMESTAMP) ──► Creation timestamp │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
```
## File Structure Map
```
cim_summary/
├── backend/
│ ├── src/
│ │ ├── config/ # Configuration files
│ │ ├── controllers/ # Request handlers
│ │ ├── middleware/ # Express middleware
│ │ ├── models/ # Database models
│ │ ├── routes/ # API route definitions
│ │ ├── services/ # Business logic services
│ │ │ ├── unifiedDocumentProcessor.ts # Main orchestrator
│ │ │ ├── optimizedAgenticRAGProcessor.ts # Core AI processing
│ │ │ ├── llmService.ts # LLM interactions
│ │ │ ├── pdfGenerationService.ts # PDF generation
│ │ │ ├── fileStorageService.ts # GCS operations
│ │ │ ├── uploadMonitoringService.ts # Real-time tracking
│ │ │ ├── sessionService.ts # Session management
│ │ │ ├── jobQueueService.ts # Background processing
│ │ │ └── uploadProgressService.ts # Progress tracking
│ │ ├── utils/ # Utility functions
│ │ └── index.ts # Main entry point
│ ├── scripts/ # Setup and utility scripts
│ └── package.json # Backend dependencies
├── frontend/
│ ├── src/
│ │ ├── components/ # React components
│ │ ├── contexts/ # React contexts
│ │ ├── services/ # API service layer
│ │ ├── utils/ # Utility functions
│ │ ├── config/ # Frontend configuration
│ │ ├── App.tsx # Main app component
│ │ └── main.tsx # App entry point
│ └── package.json # Frontend dependencies
└── README.md # Project documentation
```
## Key Data Flow Sequences
### 1. User Authentication Flow
```
User → LoginForm → Firebase Auth → AuthContext → ProtectedRoute → Dashboard
```
### 2. Document Upload Flow
```
User → DocumentUpload → documentService.uploadDocument() →
Backend /upload-url → GCS signed URL → Frontend upload →
Backend /confirm-upload → Database update → Processing trigger
```
### 3. Document Processing Flow
```
Processing trigger → unifiedDocumentProcessor →
optimizedAgenticRAGProcessor → Document AI →
Chunking → Embeddings → llmService → Claude AI →
pdfGenerationService → PDF Generation →
Database update → User notification
```
### 4. Analytics Flow
```
User → Analytics component → documentService.getAnalytics() →
Backend /analytics → agenticRAGDatabaseService →
Database queries → Structured analytics data → Frontend display
```
### 5. Error Handling Flow
```
Error occurs → Error logging with correlation ID →
Retry logic (up to 3 attempts) →
Graceful degradation → User notification
```
## Processing Pipeline Details
### LLM Service Integration
```
optimizedAgenticRAGProcessor
┌─────────────────┐
│ llmService │ ──► Model selection based on task complexity
└─────────┬───────┘
┌─────────────────┐
│ Claude AI │ ──► Primary model (claude-3-opus-20240229)
│ (Anthropic) │
└─────────┬───────┘
┌─────────────────┐
│ OpenAI │ ──► Fallback model (if Claude fails)
│ (GPT-4) │
└─────────────────┘
```
### PDF Generation Pipeline
```
Analysis Data
┌─────────────────┐
│ pdfGenerationService.generateCIMReviewPDF() │
└─────────┬───────┘
┌─────────────────┐
│ HTML Generation │ ──► Convert analysis data to HTML
└─────────┬───────┘
┌─────────────────┐
│ Puppeteer │ ──► Convert HTML to PDF
└─────────┬───────┘
┌─────────────────┐
│ PDF Buffer │ ──► Return PDF as buffer for download
└─────────────────┘
```
This architecture provides a clear separation of concerns, scalable design, and comprehensive monitoring capabilities for the CIM Document Processor application.

View File

@@ -0,0 +1,325 @@
# Dependency Analysis Report - CIM Document Processor
## Executive Summary
This report analyzes the dependencies in both backend and frontend packages to identify:
- Unused dependencies that can be removed
- Outdated packages that should be updated
- Consolidation opportunities
- Dependencies that are actually being used vs. placeholder implementations
## Backend Dependencies Analysis
### Core Dependencies (Actively Used)
#### ✅ **Essential Dependencies**
- `express` - Main web framework
- `cors` - CORS middleware
- `helmet` - Security middleware
- `morgan` - HTTP request logging
- `express-rate-limit` - Rate limiting
- `dotenv` - Environment variable management
- `winston` - Logging framework
- `@supabase/supabase-js` - Database client
- `@google-cloud/storage` - Google Cloud Storage
- `@google-cloud/documentai` - Document AI processing
- `@anthropic-ai/sdk` - Claude AI integration
- `openai` - OpenAI integration
- `puppeteer` - PDF generation
- `uuid` - UUID generation
- `axios` - HTTP client
#### ✅ **Conditionally Used Dependencies**
- `bcryptjs` - Used in auth.ts and seed.ts (legacy auth system)
- `jsonwebtoken` - Used in auth.ts (legacy JWT system)
- `joi` - Used for environment validation and middleware validation
- `zod` - Used in llmSchemas.ts and llmService.ts for schema validation
- `multer` - Used in upload middleware (legacy multipart upload)
- `pdf-parse` - Used in documentAiGenkitProcessor.ts (legacy processor)
#### ⚠️ **Potentially Unused Dependencies**
- `redis` - Only imported in sessionService.ts but may not be actively used
- `pg` - PostgreSQL client (may be redundant with Supabase)
### Development Dependencies (Actively Used)
#### ✅ **Essential Dev Dependencies**
- `typescript` - TypeScript compiler
- `ts-node-dev` - Development server
- `jest` - Testing framework
- `supertest` - API testing
- `@types/*` - TypeScript type definitions
- `eslint` - Code linting
- `@typescript-eslint/*` - TypeScript ESLint rules
### Unused Dependencies Analysis
#### ❌ **Confirmed Unused**
None identified - all dependencies appear to be used somewhere in the codebase.
#### ⚠️ **Potentially Redundant**
1. **Validation Libraries**: Both `joi` and `zod` are used for validation
- `joi`: Environment validation, middleware validation
- `zod`: LLM schemas, service validation
- **Recommendation**: Consider consolidating to just `zod` for consistency
2. **Database Clients**: Both `pg` and `@supabase/supabase-js`
- `pg`: Direct PostgreSQL client
- `@supabase/supabase-js`: Supabase client (includes PostgreSQL)
- **Recommendation**: Remove `pg` if only using Supabase
3. **Authentication**: Both `bcryptjs`/`jsonwebtoken` and Firebase Auth
- Legacy JWT system vs. Firebase Authentication
- **Recommendation**: Remove legacy auth dependencies if fully migrated to Firebase
## Frontend Dependencies Analysis
### Core Dependencies (Actively Used)
#### ✅ **Essential Dependencies**
- `react` - React framework
- `react-dom` - React DOM rendering
- `react-router-dom` - Client-side routing
- `axios` - HTTP client for API calls
- `firebase` - Firebase Authentication
- `lucide-react` - Icon library (used in 6 components)
- `react-dropzone` - File upload component
#### ❌ **Unused Dependencies**
- `clsx` - Not imported anywhere
- `tailwind-merge` - Not imported anywhere
### Development Dependencies (Actively Used)
#### ✅ **Essential Dev Dependencies**
- `typescript` - TypeScript compiler
- `vite` - Build tool and dev server
- `@vitejs/plugin-react` - React plugin for Vite
- `tailwindcss` - CSS framework
- `postcss` - CSS processing
- `autoprefixer` - CSS vendor prefixing
- `eslint` - Code linting
- `@typescript-eslint/*` - TypeScript ESLint rules
- `vitest` - Testing framework
- `@testing-library/*` - React testing utilities
## Processing Strategy Analysis
### Current Active Strategy
Based on the code analysis, the current processing strategy is:
- **Primary**: `optimized_agentic_rag` (most actively used)
- **Fallback**: `document_ai_genkit` (legacy implementation)
### Unused Processing Strategies
The following strategies are implemented but not actively used:
1. `chunking` - Legacy chunking strategy
2. `rag` - Basic RAG strategy
3. `agentic_rag` - Basic agentic RAG (superseded by optimized version)
### Services Analysis
#### ✅ **Actively Used Services**
- `unifiedDocumentProcessor` - Main orchestrator
- `optimizedAgenticRAGProcessor` - Core AI processing
- `llmService` - LLM interactions
- `pdfGenerationService` - PDF generation
- `fileStorageService` - GCS operations
- `uploadMonitoringService` - Real-time tracking
- `sessionService` - Session management
- `jobQueueService` - Background processing
#### ⚠️ **Legacy Services (Can be removed)**
- `documentProcessingService` - Legacy chunking service
- `documentAiGenkitProcessor` - Legacy Document AI processor
- `ragDocumentProcessor` - Basic RAG processor
## Outdated Packages Analysis
### Backend Outdated Packages
- `@types/express`: 4.17.23 → 5.0.3 (major version update)
- `@types/jest`: 29.5.14 → 30.0.0 (major version update)
- `@types/multer`: 1.4.13 → 2.0.0 (major version update)
- `@types/node`: 20.19.9 → 24.1.0 (major version update)
- `@types/pg`: 8.15.4 → 8.15.5 (patch update)
- `@types/supertest`: 2.0.16 → 6.0.3 (major version update)
- `@typescript-eslint/*`: 6.21.0 → 8.38.0 (major version update)
- `bcryptjs`: 2.4.3 → 3.0.2 (major version update)
- `dotenv`: 16.6.1 → 17.2.1 (major version update)
- `eslint`: 8.57.1 → 9.32.0 (major version update)
- `express`: 4.21.2 → 5.1.0 (major version update)
- `express-rate-limit`: 7.5.1 → 8.0.1 (major version update)
- `helmet`: 7.2.0 → 8.1.0 (major version update)
- `jest`: 29.7.0 → 30.0.5 (major version update)
- `multer`: 1.4.5-lts.2 → 2.0.2 (major version update)
- `openai`: 5.10.2 → 5.11.0 (minor update)
- `puppeteer`: 21.11.0 → 24.15.0 (major version update)
- `redis`: 4.7.1 → 5.7.0 (major version update)
- `supertest`: 6.3.4 → 7.1.4 (major version update)
- `typescript`: 5.8.3 → 5.9.2 (minor update)
- `zod`: 3.25.76 → 4.0.14 (major version update)
### Frontend Outdated Packages
- `@testing-library/jest-dom`: 6.6.3 → 6.6.4 (patch update)
- `@testing-library/react`: 13.4.0 → 16.3.0 (major version update)
- `@types/react`: 18.3.23 → 19.1.9 (major version update)
- `@types/react-dom`: 18.3.7 → 19.1.7 (major version update)
- `@typescript-eslint/*`: 6.21.0 → 8.38.0 (major version update)
- `eslint`: 8.57.1 → 9.32.0 (major version update)
- `eslint-plugin-react-hooks`: 4.6.2 → 5.2.0 (major version update)
- `lucide-react`: 0.294.0 → 0.536.0 (major version update)
- `react`: 18.3.1 → 19.1.1 (major version update)
- `react-dom`: 18.3.1 → 19.1.1 (major version update)
- `react-router-dom`: 6.30.1 → 7.7.1 (major version update)
- `tailwind-merge`: 2.6.0 → 3.3.1 (major version update)
- `tailwindcss`: 3.4.17 → 4.1.11 (major version update)
- `typescript`: 5.8.3 → 5.9.2 (minor update)
- `vite`: 4.5.14 → 7.0.6 (major version update)
- `vitest`: 0.34.6 → 3.2.4 (major version update)
### Update Strategy
**⚠️ Warning**: Many packages have major version updates that may include breaking changes. Update strategy:
1. **Immediate Updates** (Low Risk):
- `@types/pg`: 8.15.4 → 8.15.5 (patch update)
- `openai`: 5.10.2 → 5.11.0 (minor update)
- `typescript`: 5.8.3 → 5.9.2 (minor update)
- `@testing-library/jest-dom`: 6.6.3 → 6.6.4 (patch update)
2. **Major Version Updates** (Require Testing):
- React ecosystem updates (React 18 → 19)
- Express updates (Express 4 → 5)
- Testing framework updates (Jest 29 → 30, Vitest 0.34 → 3.2)
- Build tool updates (Vite 4 → 7)
3. **Recommendation**: Update major versions after dependency cleanup to minimize risk
## Recommendations
### Phase 1: Immediate Cleanup (Low Risk)
#### Backend
1. **Remove unused frontend dependencies**:
```bash
npm uninstall clsx tailwind-merge
```
2. **Consolidate validation libraries**:
- Migrate from `joi` to `zod` for consistency
- Remove `joi` dependency
3. **Remove legacy auth dependencies** (if Firebase auth is fully implemented):
```bash
npm uninstall bcryptjs jsonwebtoken
npm uninstall @types/bcryptjs @types/jsonwebtoken
```
#### Frontend
1. **Remove unused dependencies**:
```bash
npm uninstall clsx tailwind-merge
```
### Phase 2: Service Consolidation (Medium Risk)
1. **Remove legacy processing services**:
- `documentProcessingService.ts`
- `documentAiGenkitProcessor.ts`
- `ragDocumentProcessor.ts`
2. **Simplify unifiedDocumentProcessor**:
- Remove unused strategy methods
- Keep only `optimized_agentic_rag` strategy
3. **Remove unused database client**:
- Remove `pg` if only using Supabase
### Phase 3: Configuration Cleanup (Low Risk)
1. **Remove unused environment variables**:
- Legacy auth configuration
- Unused processing strategy configs
- Unused LLM configurations
2. **Update configuration validation**:
- Remove validation for unused configs
- Simplify environment schema
### Phase 4: Route Cleanup (Medium Risk)
1. **Remove legacy upload endpoints**:
- Keep only `/upload-url` and `/confirm-upload`
- Remove multipart upload endpoints
2. **Remove unused analytics endpoints**:
- Keep only actively used monitoring endpoints
## Impact Assessment
### Risk Levels
- **Low Risk**: Removing unused dependencies, updating packages
- **Medium Risk**: Removing legacy services, consolidating routes
- **High Risk**: Changing core processing logic
### Testing Requirements
- Unit tests for all active services
- Integration tests for upload flow
- End-to-end tests for document processing
- Performance testing for optimized agentic RAG
### Rollback Plan
- Keep backup of removed files for 1-2 weeks
- Maintain feature flags for major changes
- Document all changes for easy rollback
## Next Steps
1. **Start with Phase 1** (unused dependencies)
2. **Test thoroughly** after each phase
3. **Document changes** for team reference
4. **Update deployment scripts** if needed
5. **Monitor performance** after cleanup
## Estimated Savings
### Bundle Size Reduction
- **Frontend**: ~50KB (removing unused dependencies)
- **Backend**: ~200KB (removing legacy services and dependencies)
### Maintenance Reduction
- **Fewer dependencies** to maintain and update
- **Simplified codebase** with fewer moving parts
- **Reduced security vulnerabilities** from unused packages
### Performance Improvement
- **Faster builds** with fewer dependencies
- **Reduced memory usage** from removed services
- **Simplified deployment** with fewer configuration options
## Summary
### Key Findings
1. **Unused Dependencies**: 2 frontend dependencies (`clsx`, `tailwind-merge`) are completely unused
2. **Legacy Services**: 3 processing services can be removed (`documentProcessingService`, `documentAiGenkitProcessor`, `ragDocumentProcessor`)
3. **Redundant Dependencies**: Both `joi` and `zod` for validation, both `pg` and Supabase for database
4. **Outdated Packages**: 21 backend and 15 frontend packages have updates available
5. **Major Version Updates**: Many packages require major version updates with potential breaking changes
### Immediate Actions (Step 2 Complete)
1. ✅ **Dependency Analysis Complete** - All dependencies mapped and usage identified
2. ✅ **Outdated Packages Identified** - Version updates documented with risk assessment
3. ✅ **Cleanup Strategy Defined** - Phased approach with risk levels assigned
4. ✅ **Impact Assessment Complete** - Bundle size and maintenance savings estimated
### Next Steps (Step 3 - Service Layer Consolidation)
1. Remove unused frontend dependencies (`clsx`, `tailwind-merge`)
2. Remove legacy processing services
3. Consolidate validation libraries (migrate from `joi` to `zod`)
4. Remove redundant database client (`pg` if only using Supabase)
5. Update low-risk package versions
### Risk Assessment
- **Low Risk**: Removing unused dependencies, updating minor/patch versions
- **Medium Risk**: Removing legacy services, consolidating libraries
- **High Risk**: Major version updates, core processing logic changes
This dependency analysis provides a clear roadmap for cleaning up the codebase while maintaining functionality and minimizing risk.

View File

@@ -19,7 +19,7 @@
"lint:fix": "eslint src --ext .ts --fix", "lint:fix": "eslint src --ext .ts --fix",
"db:migrate": "ts-node src/scripts/setup-database.ts", "db:migrate": "ts-node src/scripts/setup-database.ts",
"db:seed": "ts-node src/models/seed.ts", "db:seed": "ts-node src/models/seed.ts",
"db:setup": "npm run db:migrate", "db:setup": "npm run db:migrate && node scripts/setup_supabase.js",
"deploy:firebase": "npm run build && firebase deploy --only functions", "deploy:firebase": "npm run build && firebase deploy --only functions",
"deploy:cloud-run": "npm run build && gcloud run deploy cim-processor-backend --source . --region us-central1 --platform managed --allow-unauthenticated", "deploy:cloud-run": "npm run build && gcloud run deploy cim-processor-backend --source . --region us-central1 --platform managed --allow-unauthenticated",
"deploy:docker": "npm run build && docker build -t cim-processor-backend . && docker run -p 8080:8080 cim-processor-backend", "deploy:docker": "npm run build && docker build -t cim-processor-backend . && docker run -p 8080:8080 cim-processor-backend",
@@ -77,4 +77,4 @@
"ts-node-dev": "^2.0.0", "ts-node-dev": "^2.0.0",
"typescript": "^5.2.2" "typescript": "^5.2.2"
} }
} }

View File

@@ -0,0 +1,23 @@
const { createClient } = require('@supabase/supabase-js');
const fs = require('fs');
const path = require('path');
const supabaseUrl = process.env.SUPABASE_URL;
const supabaseKey = process.env.SUPABASE_SERVICE_KEY;
const supabase = createClient(supabaseUrl, supabaseKey);
async function setupDatabase() {
try {
const sql = fs.readFileSync(path.join(__dirname, 'supabase_setup.sql'), 'utf8');
const { error } = await supabase.rpc('exec', { sql });
if (error) {
console.error('Error setting up database:', error);
} else {
console.log('Database setup complete.');
}
} catch (error) {
console.error('Error reading setup file:', error);
}
}
setupDatabase();

View File

@@ -0,0 +1,21 @@
require('dotenv').config();
const { createClient } = require('@supabase/supabase-js');
const supabaseUrl = process.env.SUPABASE_URL;
const supabaseKey = process.env.SUPABASE_SERVICE_KEY;
const supabase = createClient(supabaseUrl, supabaseKey);
async function testFunction() {
try {
const { error } = await supabase.rpc('exec_sql', { sql: 'SELECT 1' });
if (error) {
console.error('Error calling exec_sql:', error);
} else {
console.log('Successfully called exec_sql.');
}
} catch (error) {
console.error('Error:', error);
}
}
testFunction();

View File

@@ -93,6 +93,13 @@ export const documentController = {
}, },
async confirmUpload(req: Request, res: Response): Promise<void> { async confirmUpload(req: Request, res: Response): Promise<void> {
console.log('🔄 CONFIRM UPLOAD ENDPOINT CALLED');
console.log('🔄 Request method:', req.method);
console.log('🔄 Request path:', req.path);
console.log('🔄 Request params:', req.params);
console.log('🔄 Request body:', req.body);
console.log('🔄 Request headers:', Object.keys(req.headers));
try { try {
const userId = req.user?.uid; const userId = req.user?.uid;
if (!userId) { if (!userId) {
@@ -138,36 +145,50 @@ export const documentController = {
status: 'processing_llm' status: 'processing_llm'
}); });
// Acknowledge the request immediately console.log('✅ Document status updated to processing_llm');
// Acknowledge the request immediately and return the document
res.status(202).json({ res.status(202).json({
message: 'Upload confirmed, processing has started.', message: 'Upload confirmed, processing has started.',
documentId: documentId, document: document,
status: 'processing' status: 'processing'
}); });
console.log('✅ Response sent, starting background processing...');
// Process in the background // Process in the background
(async () => { (async () => {
try { try {
console.log('Background processing started.');
// Download file from Firebase Storage for Document AI processing // Download file from Firebase Storage for Document AI processing
const { fileStorageService } = await import('../services/fileStorageService'); const { fileStorageService } = await import('../services/fileStorageService');
let fileBuffer: Buffer | null = null; let fileBuffer: Buffer | null = null;
let downloadError: string | null = null;
for (let i = 0; i < 3; i++) { for (let i = 0; i < 3; i++) {
await new Promise(resolve => setTimeout(resolve, 2000)); // 2 second delay try {
fileBuffer = await fileStorageService.getFile(document.file_path); await new Promise(resolve => setTimeout(resolve, 2000 * (i + 1)));
if (fileBuffer) { fileBuffer = await fileStorageService.getFile(document.file_path);
break; if (fileBuffer) {
console.log(`✅ File downloaded from storage on attempt ${i + 1}`);
break;
}
} catch (err) {
downloadError = err instanceof Error ? err.message : String(err);
console.log(`❌ File download attempt ${i + 1} failed:`, downloadError);
} }
} }
if (!fileBuffer) { if (!fileBuffer) {
const errMsg = downloadError || 'Failed to download uploaded file';
console.log('Failed to download file from storage:', errMsg);
await DocumentModel.updateById(documentId, { await DocumentModel.updateById(documentId, {
status: 'failed', status: 'failed',
error_message: 'Failed to download uploaded file' error_message: `Failed to download uploaded file: ${errMsg}`
}); });
return; return;
} }
console.log('File downloaded, starting unified processor.');
// Process with Unified Document Processor // Process with Unified Document Processor
const { unifiedDocumentProcessor } = await import('../services/unifiedDocumentProcessor'); const { unifiedDocumentProcessor } = await import('../services/unifiedDocumentProcessor');
@@ -175,17 +196,28 @@ export const documentController = {
documentId, documentId,
userId, userId,
'', // Text is not needed for this strategy '', // Text is not needed for this strategy
{ strategy: 'optimized_agentic_rag' } {
strategy: 'document_ai_genkit',
fileBuffer: fileBuffer,
fileName: document.original_file_name,
mimeType: 'application/pdf'
}
); );
if (result.success) { if (result.success) {
console.log('✅ Processing successful.');
// Update document with results // Update document with results
await DocumentModel.updateById(documentId, { await DocumentModel.updateById(documentId, {
status: 'completed', status: 'completed',
generated_summary: result.summary, generated_summary: result.summary,
analysis_data: result.analysisData,
processing_completed_at: new Date() processing_completed_at: new Date()
}); });
console.log('✅ Document AI processing completed successfully for document:', documentId);
console.log('✅ Summary length:', result.summary?.length || 0);
console.log('✅ Processing time:', new Date().toISOString());
// 🗑️ DELETE PDF after successful processing // 🗑️ DELETE PDF after successful processing
try { try {
await fileStorageService.deleteFile(document.file_path); await fileStorageService.deleteFile(document.file_path);
@@ -201,11 +233,15 @@ export const documentController = {
console.log('✅ Document AI processing completed successfully'); console.log('✅ Document AI processing completed successfully');
} else { } else {
console.log('❌ Processing failed:', result.error);
await DocumentModel.updateById(documentId, { await DocumentModel.updateById(documentId, {
status: 'failed', status: 'failed',
error_message: result.error error_message: result.error
}); });
console.log('❌ Document AI processing failed for document:', documentId);
console.log('❌ Error:', result.error);
// Also delete PDF on processing failure to avoid storage costs // Also delete PDF on processing failure to avoid storage costs
try { try {
await fileStorageService.deleteFile(document.file_path); await fileStorageService.deleteFile(document.file_path);
@@ -215,14 +251,30 @@ export const documentController = {
} }
} }
} catch (error) { } catch (error) {
console.log('❌ Background processing error:', error); const errorMessage = error instanceof Error ? error.message : 'Unknown error';
const errorStack = error instanceof Error ? error.stack : undefined;
const errorDetails = error instanceof Error ? {
name: error.name,
message: error.message,
stack: error.stack
} : {
type: typeof error,
value: error
};
console.log('❌ Background processing error:', errorMessage);
console.log('❌ Error details:', errorDetails);
console.log('❌ Error stack:', errorStack);
logger.error('Background processing failed', { logger.error('Background processing failed', {
error, error: errorMessage,
documentId errorDetails,
documentId,
stack: errorStack
}); });
await DocumentModel.updateById(documentId, { await DocumentModel.updateById(documentId, {
status: 'failed', status: 'failed',
error_message: 'Background processing failed' error_message: `Background processing failed: ${errorMessage}`
}); });
} }
})(); })();

View File

@@ -20,7 +20,11 @@ const app = express();
// Add this middleware to log all incoming requests // Add this middleware to log all incoming requests
app.use((req, res, next) => { app.use((req, res, next) => {
console.log(`Incoming request: ${req.method} ${req.path}`); console.log(`🚀 Incoming request: ${req.method} ${req.path}`);
console.log(`🚀 Request headers:`, Object.keys(req.headers));
console.log(`🚀 Request body size:`, req.headers['content-length'] || 'unknown');
console.log(`🚀 Origin:`, req.headers['origin']);
console.log(`🚀 User-Agent:`, req.headers['user-agent']);
next(); next();
}); });
@@ -40,9 +44,12 @@ const allowedOrigins = [
app.use(cors({ app.use(cors({
origin: function (origin, callback) { origin: function (origin, callback) {
console.log(`🌐 CORS check for origin: ${origin}`);
if (!origin || allowedOrigins.indexOf(origin) !== -1) { if (!origin || allowedOrigins.indexOf(origin) !== -1) {
console.log(`✅ CORS allowed for origin: ${origin}`);
callback(null, true); callback(null, true);
} else { } else {
console.log(`❌ CORS blocked for origin: ${origin}`);
logger.warn(`CORS blocked for origin: ${origin}`); logger.warn(`CORS blocked for origin: ${origin}`);
callback(new Error('Not allowed by CORS')); callback(new Error('Not allowed by CORS'));
} }
@@ -117,7 +124,7 @@ app.use(errorHandler);
// Configure Firebase Functions v2 for larger uploads // Configure Firebase Functions v2 for larger uploads
export const api = onRequest({ export const api = onRequest({
timeoutSeconds: 540, // 9 minutes timeoutSeconds: 1800, // 30 minutes (increased from 9 minutes)
memory: '2GiB', memory: '2GiB',
cpu: 1, cpu: 1,
maxInstances: 10, maxInstances: 10,

View File

@@ -15,14 +15,21 @@ export interface DocumentChunk {
updatedAt: Date; updatedAt: Date;
} }
export interface VectorSearchResult {
documentId: string;
similarityScore: number;
chunkContent: string;
metadata: Record<string, any>;
}
export class VectorDatabaseModel { export class VectorDatabaseModel {
static async storeDocumentChunks(chunks: Omit<DocumentChunk, 'id' | 'createdAt' | 'updatedAt'>[]): Promise<void> { static async storeDocumentChunks(chunks: Omit<DocumentChunk, 'id' | 'createdAt' | 'updatedAt'>[]): Promise<void> {
const supabase = getSupabaseServiceClient(); const supabase = getSupabaseServiceClient();
const { data, error } = await supabase const { error } = await supabase
.from('document_chunks') .from('document_chunks')
.insert(chunks.map(chunk => ({ .insert(chunks.map(chunk => ({
...chunk, ...chunk,
embedding: `[${chunk.embedding.join(',')}]` // Format for pgvector embedding: `[${chunk.embedding.join(',')}]`
}))); })));
if (error) { if (error) {
@@ -32,4 +39,104 @@ export class VectorDatabaseModel {
logger.info(`Stored ${chunks.length} document chunks in vector database`); logger.info(`Stored ${chunks.length} document chunks in vector database`);
} }
static async getDocumentChunks(documentId: string): Promise<DocumentChunk[]> {
const supabase = getSupabaseServiceClient();
const { data, error } = await supabase
.from('document_chunks')
.select('*')
.eq('document_id', documentId)
.order('chunk_index');
if (error) {
logger.error('Failed to get document chunks', error);
throw error;
}
return data || [];
}
static async getAllChunks(): Promise<DocumentChunk[]> {
const supabase = getSupabaseServiceClient();
const { data, error } = await supabase
.from('document_chunks')
.select('*')
.limit(1000);
if (error) {
logger.error('Failed to get all chunks', error);
throw error;
}
return data || [];
}
static async getTotalChunkCount(): Promise<number> {
const supabase = getSupabaseServiceClient();
const { count, error } = await supabase
.from('document_chunks')
.select('*', { count: 'exact', head: true });
if (error) {
logger.error('Failed to get total chunk count', error);
throw error;
}
return count || 0;
}
static async getTotalDocumentCount(): Promise<number> {
const supabase = getSupabaseServiceClient();
const { data, error } = await supabase.rpc('count_distinct_documents');
if (error) {
logger.error('Failed to get total document count', error);
throw error;
}
return data || 0;
}
static async getAverageChunkSize(): Promise<number> {
const supabase = getSupabaseServiceClient();
const { data, error } = await supabase.rpc('average_chunk_size');
if (error) {
logger.error('Failed to get average chunk size', error);
throw error;
}
return data || 0;
}
static async getSearchAnalytics(userId: string, days: number = 30): Promise<any[]> {
const supabase = getSupabaseServiceClient();
const { data, error } = await supabase.rpc('get_search_analytics', {
user_id_param: userId,
days_param: days
});
if (error) {
logger.error('Failed to get search analytics', error);
throw error;
}
return data || [];
}
static async getVectorDatabaseStats(): Promise<{
totalChunks: number;
totalDocuments: number;
averageSimilarity: number;
}> {
const supabase = getSupabaseServiceClient();
const { data, error } = await supabase.rpc('get_vector_database_stats');
if (error) {
logger.error('Failed to get vector database stats', error);
throw error;
}
return data[0] || { totalChunks: 0, totalDocuments: 0, averageSimilarity: 0 };
}
} }

View File

@@ -1,6 +1,6 @@
import fs from 'fs'; import fs from 'fs';
import path from 'path'; import path from 'path';
import pool from '../config/database'; import { getSupabaseServiceClient } from '../config/supabase';
import logger from '../utils/logger'; import logger from '../utils/logger';
interface Migration { interface Migration {
@@ -16,24 +16,18 @@ class DatabaseMigrator {
this.migrationsDir = path.join(__dirname, 'migrations'); this.migrationsDir = path.join(__dirname, 'migrations');
} }
/**
* Get all migration files
*/
private async getMigrationFiles(): Promise<string[]> { private async getMigrationFiles(): Promise<string[]> {
try { try {
const files = await fs.promises.readdir(this.migrationsDir); const files = await fs.promises.readdir(this.migrationsDir);
return files return files
.filter(file => file.endsWith('.sql')) .filter(file => file.endsWith('.sql'))
.sort(); // Sort to ensure proper order .sort();
} catch (error) { } catch (error) {
logger.error('Error reading migrations directory:', error); logger.error('Error reading migrations directory:', error);
throw error; throw error;
} }
} }
/**
* Load migration content
*/
private async loadMigration(fileName: string): Promise<Migration> { private async loadMigration(fileName: string): Promise<Migration> {
const filePath = path.join(this.migrationsDir, fileName); const filePath = path.join(this.migrationsDir, fileName);
const sql = await fs.promises.readFile(filePath, 'utf-8'); const sql = await fs.promises.readFile(filePath, 'utf-8');
@@ -45,68 +39,66 @@ class DatabaseMigrator {
}; };
} }
/**
* Create migrations table if it doesn't exist
*/
private async createMigrationsTable(): Promise<void> { private async createMigrationsTable(): Promise<void> {
const query = ` const supabase = getSupabaseServiceClient();
CREATE TABLE IF NOT EXISTS migrations ( const { error } = await supabase.rpc('exec_sql', {
id VARCHAR(255) PRIMARY KEY, sql: `
name VARCHAR(255) NOT NULL, CREATE TABLE IF NOT EXISTS migrations (
executed_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP id VARCHAR(255) PRIMARY KEY,
); name VARCHAR(255) NOT NULL,
`; executed_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);
`
});
try { if (error) {
await pool.query(query);
logger.info('Migrations table created or already exists');
} catch (error) {
logger.error('Error creating migrations table:', error); logger.error('Error creating migrations table:', error);
throw error; throw error;
} }
logger.info('Migrations table created or already exists');
} }
/**
* Check if migration has been executed
*/
private async isMigrationExecuted(migrationId: string): Promise<boolean> { private async isMigrationExecuted(migrationId: string): Promise<boolean> {
const query = 'SELECT id FROM migrations WHERE id = $1'; const supabase = getSupabaseServiceClient();
const { data, error } = await supabase
try { .from('migrations')
const result = await pool.query(query, [migrationId]); .select('id')
return result.rows.length > 0; .eq('id', migrationId);
} catch (error) {
if (error) {
logger.error('Error checking migration status:', error); logger.error('Error checking migration status:', error);
throw error; throw error;
} }
return data.length > 0;
} }
/**
* Mark migration as executed
*/
private async markMigrationExecuted(migrationId: string, name: string): Promise<void> { private async markMigrationExecuted(migrationId: string, name: string): Promise<void> {
const query = 'INSERT INTO migrations (id, name) VALUES ($1, $2)'; const supabase = getSupabaseServiceClient();
const { error } = await supabase
try { .from('migrations')
await pool.query(query, [migrationId, name]); .insert([{ id: migrationId, name }]);
logger.info(`Migration marked as executed: ${name}`);
} catch (error) { if (error) {
logger.error('Error marking migration as executed:', error); logger.error('Error marking migration as executed:', error);
throw error; throw error;
} }
logger.info(`Migration marked as executed: ${name}`);
} }
/**
* Execute a single migration
*/
private async executeMigration(migration: Migration): Promise<void> { private async executeMigration(migration: Migration): Promise<void> {
try { try {
logger.info(`Executing migration: ${migration.name}`); logger.info(`Executing migration: ${migration.name}`);
// Execute the migration SQL const supabase = getSupabaseServiceClient();
await pool.query(migration.sql); const { error } = await supabase.rpc('exec_sql', { sql: migration.sql });
if (error) {
throw error;
}
// Mark as executed
await this.markMigrationExecuted(migration.id, migration.name); await this.markMigrationExecuted(migration.id, migration.name);
logger.info(`Migration completed: ${migration.name}`); logger.info(`Migration completed: ${migration.name}`);
@@ -116,25 +108,18 @@ class DatabaseMigrator {
} }
} }
/**
* Run all pending migrations
*/
async migrate(): Promise<void> { async migrate(): Promise<void> {
try { try {
logger.info('Starting database migration...'); logger.info('Starting database migration...');
// Create migrations table
await this.createMigrationsTable(); await this.createMigrationsTable();
// Get all migration files
const migrationFiles = await this.getMigrationFiles(); const migrationFiles = await this.getMigrationFiles();
logger.info(`Found ${migrationFiles.length} migration files`); logger.info(`Found ${migrationFiles.length} migration files`);
// Execute each migration
for (const fileName of migrationFiles) { for (const fileName of migrationFiles) {
const migration = await this.loadMigration(fileName); const migration = await this.loadMigration(fileName);
// Check if already executed
const isExecuted = await this.isMigrationExecuted(migration.id); const isExecuted = await this.isMigrationExecuted(migration.id);
if (!isExecuted) { if (!isExecuted) {
@@ -150,21 +135,6 @@ class DatabaseMigrator {
throw error; throw error;
} }
} }
/**
* Get migration status
*/
async getMigrationStatus(): Promise<{ id: string; name: string; executed_at: Date }[]> {
const query = 'SELECT id, name, executed_at FROM migrations ORDER BY executed_at';
try {
const result = await pool.query(query);
return result.rows;
} catch (error) {
logger.error('Error getting migration status:', error);
throw error;
}
}
} }
export default DatabaseMigrator; export default DatabaseMigrator;

View File

@@ -1,26 +1,19 @@
import { v4 as uuidv4 } from 'uuid';
import bcrypt from 'bcryptjs'; import bcrypt from 'bcryptjs';
import { UserModel } from './UserModel'; import { UserModel } from './UserModel';
import { DocumentModel } from './DocumentModel'; import { DocumentModel } from './DocumentModel';
import { ProcessingJobModel } from './ProcessingJobModel'; import { ProcessingJobModel } from './ProcessingJobModel';
import logger from '../utils/logger'; import logger from '../utils/logger';
import { config } from '../config/env'; import { config } from '../config/env';
import pool from '../config/database'; import { getSupabaseServiceClient } from '../config/supabase';
class DatabaseSeeder { class DatabaseSeeder {
/**
* Seed the database with initial data
*/
async seed(): Promise<void> { async seed(): Promise<void> {
try { try {
logger.info('Starting database seeding...'); logger.info('Starting database seeding...');
// Seed users
await this.seedUsers(); await this.seedUsers();
// Seed documents (if any users were created)
await this.seedDocuments(); await this.seedDocuments();
// Seed processing jobs
await this.seedProcessingJobs(); await this.seedProcessingJobs();
logger.info('Database seeding completed successfully'); logger.info('Database seeding completed successfully');
@@ -30,9 +23,6 @@ class DatabaseSeeder {
} }
} }
/**
* Seed users
*/
private async seedUsers(): Promise<void> { private async seedUsers(): Promise<void> {
const users = [ const users = [
{ {
@@ -57,14 +47,11 @@ class DatabaseSeeder {
for (const userData of users) { for (const userData of users) {
try { try {
// Check if user already exists
const existingUser = await UserModel.findByEmail(userData.email); const existingUser = await UserModel.findByEmail(userData.email);
if (!existingUser) { if (!existingUser) {
// Hash password
const hashedPassword = await bcrypt.hash(userData.password, config.security.bcryptRounds); const hashedPassword = await bcrypt.hash(userData.password, config.security.bcryptRounds);
// Create user
await UserModel.create({ await UserModel.create({
...userData, ...userData,
password: hashedPassword password: hashedPassword
@@ -80,12 +67,8 @@ class DatabaseSeeder {
} }
} }
/**
* Seed documents
*/
private async seedDocuments(): Promise<void> { private async seedDocuments(): Promise<void> {
try { try {
// Get a user to associate documents with
const user = await UserModel.findByEmail('user1@example.com'); const user = await UserModel.findByEmail('user1@example.com');
if (!user) { if (!user) {
@@ -98,28 +81,27 @@ class DatabaseSeeder {
user_id: user.id, user_id: user.id,
original_file_name: 'sample_cim_1.pdf', original_file_name: 'sample_cim_1.pdf',
file_path: '/uploads/sample_cim_1.pdf', file_path: '/uploads/sample_cim_1.pdf',
file_size: 2048576, // 2MB file_size: 2048576,
status: 'completed' as const status: 'completed' as const
}, },
{ {
user_id: user.id, user_id: user.id,
original_file_name: 'sample_cim_2.pdf', original_file_name: 'sample_cim_2.pdf',
file_path: '/uploads/sample_cim_2.pdf', file_path: '/uploads/sample_cim_2.pdf',
file_size: 3145728, // 3MB file_size: 3145728,
status: 'processing_llm' as const status: 'processing_llm' as const
}, },
{ {
user_id: user.id, user_id: user.id,
original_file_name: 'sample_cim_3.pdf', original_file_name: 'sample_cim_3.pdf',
file_path: '/uploads/sample_cim_3.pdf', file_path: '/uploads/sample_cim_3.pdf',
file_size: 1048576, // 1MB file_size: 1048576,
status: 'uploaded' as const status: 'uploaded' as const
} }
]; ];
for (const docData of documents) { for (const docData of documents) {
try { try {
// Check if document already exists (by file path)
const existingDocs = await DocumentModel.findByUserId(user.id); const existingDocs = await DocumentModel.findByUserId(user.id);
const exists = existingDocs.some(doc => doc.file_path === docData.file_path); const exists = existingDocs.some(doc => doc.file_path === docData.file_path);
@@ -138,12 +120,8 @@ class DatabaseSeeder {
} }
} }
/**
* Seed processing jobs
*/
private async seedProcessingJobs(): Promise<void> { private async seedProcessingJobs(): Promise<void> {
try { try {
// Get a document to associate jobs with
const user = await UserModel.findByEmail('user1@example.com'); const user = await UserModel.findByEmail('user1@example.com');
if (!user) { if (!user) {
logger.warn('No user found for seeding processing jobs'); logger.warn('No user found for seeding processing jobs');
@@ -157,7 +135,7 @@ class DatabaseSeeder {
return; return;
} }
const document = documents[0]; // Use first document const document = documents[0];
if (!document) { if (!document) {
logger.warn('No document found for seeding processing jobs'); logger.warn('No document found for seeding processing jobs');
@@ -187,7 +165,6 @@ class DatabaseSeeder {
for (const jobData of jobs) { for (const jobData of jobs) {
try { try {
// Check if job already exists
const existingJobs = await ProcessingJobModel.findByDocumentId(document.id); const existingJobs = await ProcessingJobModel.findByDocumentId(document.id);
const exists = existingJobs.some(job => job.type === jobData.type); const exists = existingJobs.some(job => job.type === jobData.type);
@@ -197,7 +174,6 @@ class DatabaseSeeder {
type: jobData.type type: jobData.type
}); });
// Update status and progress
await ProcessingJobModel.updateStatus(job.id, jobData.status); await ProcessingJobModel.updateStatus(job.id, jobData.status);
await ProcessingJobModel.updateProgress(job.id, jobData.progress); await ProcessingJobModel.updateProgress(job.id, jobData.progress);
@@ -214,23 +190,16 @@ class DatabaseSeeder {
} }
} }
/**
* Clear all seeded data
*/
async clear(): Promise<void> { async clear(): Promise<void> {
try { try {
logger.info('Clearing seeded data...'); logger.info('Clearing seeded data...');
// Clear in reverse order to respect foreign key constraints const supabase = getSupabaseServiceClient();
await pool.query('DELETE FROM processing_jobs'); await supabase.from('processing_jobs').delete().neq('id', uuidv4());
await pool.query('DELETE FROM document_versions'); await supabase.from('document_versions').delete().neq('id', uuidv4());
await pool.query('DELETE FROM document_feedback'); await supabase.from('document_feedback').delete().neq('id', uuidv4());
await pool.query('DELETE FROM documents'); await supabase.from('documents').delete().neq('id', uuidv4());
await pool.query('DELETE FROM users WHERE email IN ($1, $2, $3)', [ await supabase.from('users').delete().in('email', ['admin@example.com', 'user1@example.com', 'user2@example.com']);
'admin@example.com',
'user1@example.com',
'user2@example.com'
]);
logger.info('Seeded data cleared successfully'); logger.info('Seeded data cleared successfully');
} catch (error) { } catch (error) {
@@ -240,4 +209,4 @@ class DatabaseSeeder {
} }
} }
export default DatabaseSeeder; export default DatabaseSeeder;

View File

@@ -23,16 +23,13 @@ const router = express.Router();
router.use(verifyFirebaseToken); router.use(verifyFirebaseToken);
router.use(addCorrelationId); router.use(addCorrelationId);
// NEW Firebase Storage direct upload routes // Add logging middleware for document routes
router.post('/upload-url', documentController.getUploadUrl); router.use((req, res, next) => {
router.post('/:id/confirm-upload', validateUUID('id'), documentController.confirmUpload); console.log(`📄 Document route accessed: ${req.method} ${req.path}`);
next();
});
// LEGACY multipart upload routes (keeping for backward compatibility) // Analytics endpoints (MUST come before ANY routes with :id parameters)
router.post('/upload', handleFileUpload, documentController.uploadDocument);
router.post('/', handleFileUpload, documentController.uploadDocument);
router.get('/', documentController.getDocuments);
// Analytics endpoints (MUST come before /:id routes to avoid conflicts)
router.get('/analytics', async (req, res) => { router.get('/analytics', async (req, res) => {
try { try {
const userId = req.user?.uid; const userId = req.user?.uid;
@@ -44,11 +41,9 @@ router.get('/analytics', async (req, res) => {
} }
const days = parseInt(req.query['days'] as string) || 30; const days = parseInt(req.query['days'] as string) || 30;
// Import the service here to avoid circular dependencies // Import the service here to avoid circular dependencies
const { agenticRAGDatabaseService } = await import('../services/agenticRAGDatabaseService'); const { agenticRAGDatabaseService } = await import('../services/agenticRAGDatabaseService');
const analytics = await agenticRAGDatabaseService.getAnalyticsData(days); const analytics = await agenticRAGDatabaseService.getAnalyticsData(days);
return res.json({ return res.json({
...analytics, ...analytics,
correlationId: req.correlationId || undefined correlationId: req.correlationId || undefined
@@ -84,6 +79,15 @@ router.get('/processing-stats', async (req, res) => {
} }
}); });
// NEW Firebase Storage direct upload routes
router.post('/upload-url', documentController.getUploadUrl);
router.post('/:id/confirm-upload', validateUUID('id'), documentController.confirmUpload);
// LEGACY multipart upload routes (keeping for backward compatibility)
router.post('/upload', handleFileUpload, documentController.uploadDocument);
router.post('/', handleFileUpload, documentController.uploadDocument);
router.get('/', documentController.getDocuments);
// Document-specific routes with UUID validation // Document-specific routes with UUID validation
router.get('/:id', validateUUID('id'), documentController.getDocument); router.get('/:id', validateUUID('id'), documentController.getDocument);
router.get('/:id/progress', validateUUID('id'), documentController.getDocumentProgress); router.get('/:id/progress', validateUUID('id'), documentController.getDocumentProgress);

View File

@@ -1,4 +1,8 @@
import { logger } from '../utils/logger'; import { logger } from '../utils/logger';
import { DocumentProcessorServiceClient } from '@google-cloud/documentai';
import { Storage } from '@google-cloud/storage';
import { config } from '../config/env';
import pdf from 'pdf-parse';
interface ProcessingResult { interface ProcessingResult {
success: boolean; success: boolean;
@@ -7,11 +11,46 @@ interface ProcessingResult {
error?: string; error?: string;
} }
interface DocumentAIOutput {
text: string;
entities: Array<{
type: string;
mentionText: string;
confidence: number;
}>;
tables: Array<any>;
pages: Array<any>;
mimeType: string;
}
interface PageChunk {
startPage: number;
endPage: number;
buffer: Buffer;
}
export class DocumentAiGenkitProcessor { export class DocumentAiGenkitProcessor {
private gcsBucketName: string; private gcsBucketName: string;
private documentAiClient: DocumentProcessorServiceClient;
private storageClient: Storage;
private processorName: string;
private readonly MAX_PAGES_PER_CHUNK = 30;
constructor() { constructor() {
this.gcsBucketName = process.env['GCS_BUCKET_NAME'] || 'cim-summarizer-uploads'; this.gcsBucketName = config.googleCloud.gcsBucketName;
this.documentAiClient = new DocumentProcessorServiceClient();
this.storageClient = new Storage();
// Construct the processor name
this.processorName = `projects/${config.googleCloud.projectId}/locations/${config.googleCloud.documentAiLocation}/processors/${config.googleCloud.documentAiProcessorId}`;
logger.info('Document AI + Genkit processor initialized', {
projectId: config.googleCloud.projectId,
location: config.googleCloud.documentAiLocation,
processorId: config.googleCloud.documentAiProcessorId,
processorName: this.processorName,
maxPagesPerChunk: this.MAX_PAGES_PER_CHUNK
});
} }
async processDocument( async processDocument(
@@ -19,135 +58,331 @@ export class DocumentAiGenkitProcessor {
userId: string, userId: string,
fileBuffer: Buffer, fileBuffer: Buffer,
fileName: string, fileName: string,
_mimeType: string mimeType: string
): Promise<ProcessingResult> { ): Promise<ProcessingResult> {
const startTime = Date.now(); const startTime = Date.now();
try { try {
logger.info('Starting Document AI + Genkit processing', { logger.info('Starting Document AI + Agentic RAG processing', {
documentId, documentId,
userId, userId,
fileName, fileName,
fileSize: fileBuffer.length fileSize: fileBuffer.length,
mimeType
}); });
// Step 1: Upload file to GCS // Step 1: Extract text using Document AI or fallback
const gcsFilePath = await this.uploadToGCS(fileBuffer, fileName); const extractedText = await this.extractTextFromDocument(fileBuffer, fileName, mimeType);
logger.info('File uploaded to GCS', { gcsFilePath });
if (!extractedText) {
throw new Error('Failed to extract text from document');
}
// Step 2: Process with Document AI logger.info('Text extraction completed', {
const documentAiOutput = await this.processWithDocumentAI(gcsFilePath); textLength: extractedText.length
logger.info('Document AI processing completed', {
textLength: documentAiOutput?.text?.length || 0,
entitiesCount: documentAiOutput?.entities?.length || 0
}); });
// Step 3: Process with Genkit // Step 2: Process extracted text through Agentic RAG
const genkitOutput = await this.processWithGenkit(fileName); const agenticRagResult = await this.processWithAgenticRAG(documentId, extractedText);
logger.info('Genkit processing completed', {
outputLength: genkitOutput?.markdownOutput?.length || 0
});
// Step 4: Cleanup GCS files
await this.cleanupGCSFiles(gcsFilePath);
logger.info('GCS cleanup completed');
const processingTime = Date.now() - startTime; const processingTime = Date.now() - startTime;
return { return {
success: true, success: true,
content: genkitOutput?.markdownOutput || 'No analysis generated', content: agenticRagResult.summary || extractedText,
metadata: { metadata: {
processingStrategy: 'document_ai_genkit', processingStrategy: 'document_ai_agentic_rag',
processingTime, processingTime,
documentAiOutput, extractedTextLength: extractedText.length,
genkitOutput, agenticRagResult,
fileSize: fileBuffer.length, fileSize: fileBuffer.length,
fileName fileName,
mimeType
} }
}; };
} catch (error) { } catch (error) {
const processingTime = Date.now() - startTime; const processingTime = Date.now() - startTime;
logger.error('Document AI + Genkit processing failed', { const errorMessage = error instanceof Error ? error.message : String(error);
const errorStack = error instanceof Error ? error.stack : undefined;
const errorDetails = error instanceof Error ? {
name: error.name,
message: error.message,
stack: error.stack
} : {
type: typeof error,
value: error
};
logger.error('Document AI + Agentic RAG processing failed', {
documentId, documentId,
error: error instanceof Error ? error.message : String(error), error: errorMessage,
stack: error instanceof Error ? error.stack : undefined errorDetails,
stack: errorStack,
processingTime
}); });
return { return {
success: false, success: false,
content: '', content: '',
error: `Document AI + Genkit processing failed: ${error instanceof Error ? error.message : String(error)}`, error: `Document AI + Agentic RAG processing failed: ${errorMessage}`,
metadata: { metadata: {
processingStrategy: 'document_ai_genkit', processingStrategy: 'document_ai_agentic_rag',
processingTime, processingTime,
error: error instanceof Error ? error.message : String(error) error: errorMessage,
errorDetails,
stack: errorStack
} }
}; };
} }
} }
private async extractTextFromDocument(fileBuffer: Buffer, fileName: string, mimeType: string): Promise<string> {
try {
// Check document size first
const pdfData = await pdf(fileBuffer);
const totalPages = pdfData.numpages;
logger.info('PDF analysis completed', {
totalPages,
textLength: pdfData.text?.length || 0
});
// If document has more than 30 pages, use pdf-parse fallback
if (totalPages > this.MAX_PAGES_PER_CHUNK) {
logger.warn('Document exceeds Document AI page limit, using pdf-parse fallback', {
totalPages,
maxPagesPerChunk: this.MAX_PAGES_PER_CHUNK
});
return pdfData.text || '';
}
// For documents <= 30 pages, use Document AI
logger.info('Using Document AI for text extraction', {
totalPages,
maxPagesPerChunk: this.MAX_PAGES_PER_CHUNK
});
// Upload file to GCS
const gcsFilePath = await this.uploadToGCS(fileBuffer, fileName);
// Process with Document AI
const documentAiOutput = await this.processWithDocumentAI(gcsFilePath, mimeType);
// Cleanup GCS file
await this.cleanupGCSFiles(gcsFilePath);
return documentAiOutput.text;
} catch (error) {
logger.error('Text extraction failed, using pdf-parse fallback', {
error: error instanceof Error ? error.message : String(error)
});
// Fallback to pdf-parse
try {
const pdfData = await pdf(fileBuffer);
return pdfData.text || '';
} catch (fallbackError) {
logger.error('Both Document AI and pdf-parse failed', {
originalError: error instanceof Error ? error.message : String(error),
fallbackError: fallbackError instanceof Error ? fallbackError.message : String(fallbackError)
});
throw new Error('Failed to extract text from document using any method');
}
}
}
private async processWithAgenticRAG(documentId: string, extractedText: string): Promise<any> {
try {
logger.info('Processing extracted text with Agentic RAG', {
documentId,
textLength: extractedText.length
});
// Import and use the optimized agentic RAG processor
logger.info('Importing optimized agentic RAG processor...');
const { optimizedAgenticRAGProcessor } = await import('./optimizedAgenticRAGProcessor');
logger.info('Agentic RAG processor imported successfully', {
processorType: typeof optimizedAgenticRAGProcessor,
hasProcessLargeDocument: typeof optimizedAgenticRAGProcessor?.processLargeDocument === 'function'
});
logger.info('Calling processLargeDocument...');
const result = await optimizedAgenticRAGProcessor.processLargeDocument(
documentId,
extractedText,
{}
);
logger.info('Agentic RAG processing completed', {
success: result.success,
summaryLength: result.summary?.length || 0,
analysisDataKeys: result.analysisData ? Object.keys(result.analysisData) : [],
resultType: typeof result
});
return result;
} catch (error) {
const errorMessage = error instanceof Error ? error.message : String(error);
const errorStack = error instanceof Error ? error.stack : undefined;
const errorDetails = error instanceof Error ? {
name: error.name,
message: error.message,
stack: error.stack
} : {
type: typeof error,
value: error
};
logger.error('Agentic RAG processing failed', {
documentId,
error: errorMessage,
errorDetails,
stack: errorStack
});
throw error;
}
}
private async uploadToGCS(fileBuffer: Buffer, fileName: string): Promise<string> { private async uploadToGCS(fileBuffer: Buffer, fileName: string): Promise<string> {
// This is a placeholder implementation try {
// In production, this would upload to Google Cloud Storage const bucket = this.storageClient.bucket(this.gcsBucketName);
logger.info('Uploading file to GCS (placeholder)', { fileName, fileSize: fileBuffer.length }); const file = bucket.file(`uploads/${Date.now()}_${fileName}`);
// Simulate upload delay logger.info('Uploading file to GCS', {
await new Promise(resolve => setTimeout(resolve, 100)); fileName,
fileSize: fileBuffer.length,
return `gs://${this.gcsBucketName}/uploads/${fileName}`; bucket: this.gcsBucketName,
destination: file.name
});
await file.save(fileBuffer, {
metadata: {
contentType: 'application/pdf'
}
});
logger.info('File uploaded successfully to GCS', {
gcsPath: `gs://${this.gcsBucketName}/${file.name}`
});
return `gs://${this.gcsBucketName}/${file.name}`;
} catch (error) {
logger.error('Failed to upload file to GCS', {
fileName,
error: error instanceof Error ? error.message : String(error)
});
throw error;
}
} }
private async processWithDocumentAI(gcsFilePath: string): Promise<any> { private async processWithDocumentAI(gcsFilePath: string, mimeType: string): Promise<DocumentAIOutput> {
// This is a placeholder implementation try {
// In production, this would call Google Cloud Document AI logger.info('Processing with Document AI', {
logger.info('Processing with Document AI (placeholder)', { gcsFilePath }); gcsFilePath,
processorName: this.processorName,
// Simulate Document AI processing mimeType
await new Promise(resolve => setTimeout(resolve, 200)); });
return {
text: 'Sample extracted text from Document AI',
entities: [
{ type: 'COMPANY_NAME', mentionText: 'Sample Company', confidence: 0.95 },
{ type: 'MONEY', mentionText: '$10M', confidence: 0.90 }
],
tables: []
};
}
private async processWithGenkit(fileName: string): Promise<any> { // Create the request
// This is a placeholder implementation const request = {
// In production, this would call Genkit for AI analysis name: this.processorName,
logger.info('Processing with Genkit (placeholder)', { fileName }); rawDocument: {
content: '', // We'll use GCS source instead
// Simulate Genkit processing mimeType: mimeType
await new Promise(resolve => setTimeout(resolve, 300)); },
gcsDocument: {
return { gcsUri: gcsFilePath,
markdownOutput: `# CIM Analysis: ${fileName} mimeType: mimeType
}
};
## Executive Summary logger.info('Sending Document AI request', {
Sample analysis generated by Document AI + Genkit integration. processorName: this.processorName,
gcsUri: gcsFilePath
});
## Key Findings // Process the document
- Document processed successfully const [result] = await this.documentAiClient.processDocument(request);
- AI analysis completed const { document } = result;
- Integration working as expected
--- if (!document) {
*Generated by Document AI + Genkit integration*` throw new Error('Document AI returned no document');
}; }
logger.info('Document AI processing successful', {
textLength: document.text?.length || 0,
pagesCount: document.pages?.length || 0,
entitiesCount: document.entities?.length || 0
});
// Extract text
const text = document.text || '';
// Extract entities
const entities = document.entities?.map(entity => ({
type: entity.type || 'UNKNOWN',
mentionText: entity.mentionText || '',
confidence: entity.confidence || 0
})) || [];
// Extract tables
const tables = document.pages?.flatMap(page =>
page.tables?.map(table => ({
rows: table.headerRows?.length || 0,
columns: table.bodyRows?.[0]?.cells?.length || 0
})) || []
) || [];
// Extract pages info
const pages = document.pages?.map(page => ({
pageNumber: page.pageNumber || 0,
blocksCount: page.blocks?.length || 0
})) || [];
return {
text,
entities,
tables,
pages,
mimeType: document.mimeType || mimeType
};
} catch (error) {
logger.error('Document AI processing failed', {
gcsFilePath,
processorName: this.processorName,
error: error instanceof Error ? error.message : String(error),
stack: error instanceof Error ? error.stack : undefined
});
throw error;
}
} }
private async cleanupGCSFiles(gcsFilePath: string): Promise<void> { private async cleanupGCSFiles(gcsFilePath: string): Promise<void> {
// This is a placeholder implementation try {
// In production, this would delete files from Google Cloud Storage const bucketName = gcsFilePath.replace('gs://', '').split('/')[0];
logger.info('Cleaning up GCS files (placeholder)', { gcsFilePath }); const fileName = gcsFilePath.replace(`gs://${bucketName}/`, '');
// Simulate cleanup delay logger.info('Cleaning up GCS files', { gcsFilePath, bucketName, fileName });
await new Promise(resolve => setTimeout(resolve, 50));
const bucket = this.storageClient.bucket(bucketName);
const file = bucket.file(fileName);
await file.delete();
logger.info('GCS file cleanup completed', { gcsFilePath });
} catch (error) {
logger.warn('Failed to cleanup GCS files', {
gcsFilePath,
error: error instanceof Error ? error.message : String(error)
});
// Don't throw error for cleanup failures
}
} }
} }

View File

@@ -83,9 +83,19 @@ export class OptimizedAgenticRAGProcessor {
logger.info(`Optimized processing completed for document: ${documentId}`, result); logger.info(`Optimized processing completed for document: ${documentId}`, result);
console.log('✅ Optimized agentic RAG processing completed successfully for document:', documentId);
console.log('✅ Total chunks processed:', result.processedChunks);
console.log('✅ Processing time:', result.processingTime, 'ms');
console.log('✅ Memory usage:', result.memoryUsage, 'MB');
console.log('✅ Summary length:', result.summary?.length || 0);
return result; return result;
} catch (error) { } catch (error) {
logger.error(`Optimized processing failed for document: ${documentId}`, error); logger.error(`Optimized processing failed for document: ${documentId}`, error);
console.log('❌ Optimized agentic RAG processing failed for document:', documentId);
console.log('❌ Error:', error instanceof Error ? error.message : String(error));
throw error; throw error;
} }
} }

View File

@@ -169,6 +169,9 @@ class UnifiedDocumentProcessor {
} catch (error) { } catch (error) {
logger.error('Optimized agentic RAG processing failed', { documentId, error }); logger.error('Optimized agentic RAG processing failed', { documentId, error });
console.log('❌ Unified document processor - optimized agentic RAG failed for document:', documentId);
console.log('❌ Error:', error instanceof Error ? error.message : String(error));
return { return {
success: false, success: false,
summary: '', summary: '',
@@ -188,33 +191,60 @@ class UnifiedDocumentProcessor {
documentId: string, documentId: string,
userId: string, userId: string,
text: string, text: string,
_options: any options: any
): Promise<ProcessingResult> { ): Promise<ProcessingResult> {
logger.info('Using Document AI + Genkit processing strategy', { documentId }); logger.info('Using Document AI + Genkit processing strategy', { documentId });
const startTime = Date.now(); const startTime = Date.now();
try { try {
// For now, we'll use the existing text extraction // Get the file buffer from options if available, otherwise use text
// In a full implementation, this would use the Document AI processor const fileBuffer = options.fileBuffer || Buffer.from(text);
const fileName = options.fileName || `document-${documentId}.pdf`;
const mimeType = options.mimeType || 'application/pdf';
logger.info('Document AI processing with file data', {
documentId,
fileSize: fileBuffer.length,
fileName,
mimeType
});
const result = await documentAiGenkitProcessor.processDocument( const result = await documentAiGenkitProcessor.processDocument(
documentId, documentId,
userId, userId,
Buffer.from(text), // Convert text to buffer for processing fileBuffer,
`document-${documentId}.txt`, fileName,
'text/plain' mimeType
); );
if (!result.success) {
logger.error('Document AI processing failed', {
documentId,
error: result.error,
metadata: result.metadata
});
}
return { return {
success: result.success, success: result.success,
summary: result.content || '', summary: result.content || '',
analysisData: (result.metadata?.analysisData as CIMReview) || {} as CIMReview, analysisData: (result.metadata?.agenticRagResult?.analysisData as CIMReview) || {} as CIMReview,
processingStrategy: 'document_ai_genkit', processingStrategy: 'document_ai_genkit',
processingTime: Date.now() - startTime, processingTime: Date.now() - startTime,
apiCalls: 1, // Document AI + Genkit typically uses fewer API calls apiCalls: 1, // Document AI + Agentic RAG typically uses fewer API calls
error: result.error || undefined error: result.error || undefined
}; };
} catch (error) { } catch (error) {
const errorMessage = error instanceof Error ? error.message : String(error);
const errorStack = error instanceof Error ? error.stack : undefined;
logger.error('Document AI + Genkit processing failed with exception', {
documentId,
error: errorMessage,
stack: errorStack
});
return { return {
success: false, success: false,
summary: '', summary: '',
@@ -222,7 +252,7 @@ class UnifiedDocumentProcessor {
processingStrategy: 'document_ai_genkit', processingStrategy: 'document_ai_genkit',
processingTime: Date.now() - startTime, processingTime: Date.now() - startTime,
apiCalls: 0, apiCalls: 0,
error: error instanceof Error ? error.message : 'Unknown error' error: errorMessage
}; };
} }
} }

View File

@@ -241,13 +241,14 @@ class VectorDatabaseService {
* Store document chunks with embeddings * Store document chunks with embeddings
*/ */
async storeDocumentChunks(chunks: DocumentChunk[]): Promise<void> { async storeDocumentChunks(chunks: DocumentChunk[]): Promise<void> {
const initialized = await this.ensureInitialized();
if (!initialized) {
logger.warn('Vector database not available, skipping chunk storage');
return;
}
try { try {
const isInitialized = await this.ensureInitialized();
if (!isInitialized) {
logger.warn('Vector database not initialized, skipping chunk storage');
return;
}
switch (this.provider) { switch (this.provider) {
case 'pinecone': case 'pinecone':
await this.storeInPinecone(chunks); await this.storeInPinecone(chunks);
@@ -261,11 +262,14 @@ class VectorDatabaseService {
case 'supabase': case 'supabase':
await this.storeInSupabase(chunks); await this.storeInSupabase(chunks);
break; break;
default:
logger.warn(`Vector database provider ${this.provider} not supported for storage`);
} }
logger.info(`Stored ${chunks.length} document chunks in vector database`);
} catch (error) { } catch (error) {
logger.error('Failed to store document chunks', error); // Log the error but don't fail the entire upload process
throw new Error('Vector storage failed'); logger.error('Failed to store document chunks in vector database:', error);
logger.warn('Continuing with upload process without vector storage');
// Don't throw the error - let the upload continue
} }
} }
@@ -422,7 +426,6 @@ class VectorDatabaseService {
async getVectorDatabaseStats(): Promise<{ async getVectorDatabaseStats(): Promise<{
totalChunks: number; totalChunks: number;
totalDocuments: number; totalDocuments: number;
totalSearches: number;
averageSimilarity: number; averageSimilarity: number;
}> { }> {
try { try {
@@ -521,13 +524,19 @@ class VectorDatabaseService {
.upsert(supabaseRows); .upsert(supabaseRows);
if (error) { if (error) {
// Check if it's a table/column missing error
if (error.message && (error.message.includes('chunkIndex') || error.message.includes('document_chunks'))) {
logger.warn('Vector database table/columns not available, skipping vector storage:', error.message);
return; // Don't throw, just skip vector storage
}
throw error; throw error;
} }
logger.info(`Successfully stored ${chunks.length} chunks in Supabase`); logger.info(`Successfully stored ${chunks.length} chunks in Supabase`);
} catch (error) { } catch (error) {
logger.error('Failed to store chunks in Supabase:', error); logger.error('Failed to store chunks in Supabase:', error);
throw error; // Don't throw the error - let the upload continue without vector storage
logger.warn('Continuing upload process without vector storage');
} }
} }
@@ -581,4 +590,4 @@ class VectorDatabaseService {
} }
} }
export const vectorDatabaseService = new VectorDatabaseService(); export const vectorDatabaseService = new VectorDatabaseService();

View File

@@ -1,89 +1,76 @@
-- Enable the pgvector extension -- Create the document_chunks table
CREATE EXTENSION IF NOT EXISTS vector;
-- Create document_chunks table with vector support
CREATE TABLE IF NOT EXISTS document_chunks ( CREATE TABLE IF NOT EXISTS document_chunks (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(), id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
document_id VARCHAR(255) NOT NULL, document_id UUID NOT NULL,
chunk_index INTEGER NOT NULL,
content TEXT NOT NULL,
embedding vector(1536), -- OpenAI embeddings are 1536 dimensions
metadata JSONB DEFAULT '{}',
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Create indexes for better performance
CREATE INDEX IF NOT EXISTS document_chunks_document_id_idx ON document_chunks(document_id);
CREATE INDEX IF NOT EXISTS document_chunks_embedding_idx ON document_chunks USING ivfflat (embedding vector_cosine_ops);
-- Create function to enable pgvector (for RPC calls)
CREATE OR REPLACE FUNCTION enable_pgvector()
RETURNS VOID AS $$
BEGIN
CREATE EXTENSION IF NOT EXISTS vector;
END;
$$ LANGUAGE plpgsql;
-- Create function to create document_chunks table (for RPC calls)
CREATE OR REPLACE FUNCTION create_document_chunks_table()
RETURNS VOID AS $$
BEGIN
CREATE TABLE IF NOT EXISTS document_chunks (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
document_id VARCHAR(255) NOT NULL,
chunk_index INTEGER NOT NULL,
content TEXT NOT NULL,
embedding vector(1536),
metadata JSONB DEFAULT '{}',
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX IF NOT EXISTS document_chunks_document_id_idx ON document_chunks(document_id);
CREATE INDEX IF NOT EXISTS document_chunks_embedding_idx ON document_chunks USING ivfflat (embedding vector_cosine_ops);
END;
$$ LANGUAGE plpgsql;
-- Create function to match documents based on vector similarity
CREATE OR REPLACE FUNCTION match_documents(
query_embedding vector(1536),
match_threshold float DEFAULT 0.7,
match_count int DEFAULT 10
)
RETURNS TABLE(
id UUID,
content TEXT, content TEXT,
metadata JSONB, metadata JSONB,
document_id VARCHAR(255), embedding VECTOR(1536),
similarity FLOAT chunk_index INTEGER,
) AS $$ section TEXT,
page_number INTEGER,
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
-- Create the vector_similarity_searches table
CREATE TABLE IF NOT EXISTS vector_similarity_searches (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID,
query_text TEXT,
query_embedding VECTOR(1536),
search_results JSONB,
filters JSONB,
limit_count INTEGER,
similarity_threshold REAL,
processing_time_ms INTEGER,
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- Create the function to count distinct documents
CREATE OR REPLACE FUNCTION count_distinct_documents()
RETURNS INTEGER AS $$
BEGIN
RETURN (SELECT COUNT(DISTINCT document_id) FROM document_chunks);
END;
$$ LANGUAGE plpgsql;
-- Create the function to get the average chunk size
CREATE OR REPLACE FUNCTION average_chunk_size()
RETURNS INTEGER AS $$
BEGIN
RETURN (SELECT AVG(LENGTH(content)) FROM document_chunks);
END;
$$ LANGUAGE plpgsql;
-- Create the function to get search analytics
CREATE OR REPLACE FUNCTION get_search_analytics(user_id_param UUID, days_param INTEGER)
RETURNS TABLE(query_text TEXT, search_count BIGINT) AS $$
BEGIN BEGIN
RETURN QUERY RETURN QUERY
SELECT SELECT
document_chunks.id, vs.query_text,
document_chunks.content, COUNT(*) as search_count
document_chunks.metadata, FROM
document_chunks.document_id, vector_similarity_searches vs
1 - (document_chunks.embedding <=> query_embedding) AS similarity WHERE
FROM document_chunks vs.user_id = user_id_param AND
WHERE 1 - (document_chunks.embedding <=> query_embedding) > match_threshold vs.created_at >= NOW() - (days_param * INTERVAL '1 day')
ORDER BY document_chunks.embedding <=> query_embedding GROUP BY
LIMIT match_count; vs.query_text
ORDER BY
search_count DESC
LIMIT 20;
END; END;
$$ LANGUAGE plpgsql; $$ LANGUAGE plpgsql;
-- Enable Row Level Security (RLS) if needed -- Create the function to get vector database stats
-- ALTER TABLE document_chunks ENABLE ROW LEVEL SECURITY; CREATE OR REPLACE FUNCTION get_vector_database_stats()
RETURNS TABLE(total_chunks BIGINT, total_documents BIGINT, average_similarity REAL) AS $$
-- Create policies for RLS (adjust as needed for your auth requirements) BEGIN
-- CREATE POLICY "Users can view all document chunks" ON document_chunks FOR SELECT USING (true); RETURN QUERY
-- CREATE POLICY "Users can insert document chunks" ON document_chunks FOR INSERT WITH CHECK (true); SELECT
-- CREATE POLICY "Users can update document chunks" ON document_chunks FOR UPDATE USING (true); (SELECT COUNT(*) FROM document_chunks),
-- CREATE POLICY "Users can delete document chunks" ON document_chunks FOR DELETE USING (true); (SELECT COUNT(DISTINCT document_id) FROM document_chunks),
(SELECT AVG(similarity_score) FROM document_similarities WHERE similarity_score > 0);
-- Grant necessary permissions END;
GRANT ALL ON document_chunks TO authenticated; $$ LANGUAGE plpgsql;
GRANT ALL ON document_chunks TO anon;
GRANT EXECUTE ON FUNCTION match_documents TO authenticated;
GRANT EXECUTE ON FUNCTION match_documents TO anon;

374
currrent_output.json Normal file

File diff suppressed because one or more lines are too long

View File

@@ -10,7 +10,7 @@ import Analytics from './components/Analytics';
import UploadMonitoringDashboard from './components/UploadMonitoringDashboard'; import UploadMonitoringDashboard from './components/UploadMonitoringDashboard';
import LogoutButton from './components/LogoutButton'; import LogoutButton from './components/LogoutButton';
import { documentService, GCSErrorHandler, GCSError } from './services/documentService'; import { documentService, GCSErrorHandler, GCSError } from './services/documentService';
import { debugAuth, testAPIAuth } from './utils/authDebug'; // import { debugAuth, testAPIAuth } from './utils/authDebug';
import { import {
Home, Home,
@@ -75,13 +75,14 @@ const Dashboard: React.FC = () => {
if (response.ok) { if (response.ok) {
const result = await response.json(); const result = await response.json();
// The API returns an array directly, not wrapped in success/data // The API returns documents wrapped in a documents property
if (Array.isArray(result)) { const documentsArray = result.documents || result;
if (Array.isArray(documentsArray)) {
// Transform backend data to frontend format // Transform backend data to frontend format
const transformedDocs = result.map((doc: any) => ({ const transformedDocs = documentsArray.map((doc: any) => ({
id: doc.id, id: doc.id,
name: doc.name || doc.originalName, name: doc.name || doc.originalName || 'Unknown',
originalName: doc.originalName, originalName: doc.originalName || doc.name || 'Unknown',
status: mapBackendStatus(doc.status), status: mapBackendStatus(doc.status),
uploadedAt: doc.uploadedAt, uploadedAt: doc.uploadedAt,
processedAt: doc.processedAt, processedAt: doc.processedAt,
@@ -216,10 +217,22 @@ const Dashboard: React.FC = () => {
return () => clearInterval(refreshInterval); return () => clearInterval(refreshInterval);
}, [fetchDocuments]); }, [fetchDocuments]);
const handleUploadComplete = (fileId: string) => { const handleUploadComplete = (documentId: string) => {
console.log('Upload completed:', fileId); console.log('Upload completed:', documentId);
// Refresh documents list after upload // Add the new document to the list with a "processing" status
fetchDocuments(); // Since we only have the ID, we'll create a minimal document object
const newDocument = {
id: documentId,
status: 'processing',
name: 'Processing...',
originalName: 'Processing...',
uploadedAt: new Date().toISOString(),
fileSize: 0,
user_id: user?.id || '',
created_at: new Date().toISOString(),
updated_at: new Date().toISOString()
};
setDocuments(prev => [...prev, newDocument]);
}; };
const handleUploadError = (error: string) => { const handleUploadError = (error: string) => {
@@ -291,18 +304,18 @@ const Dashboard: React.FC = () => {
setViewingDocument(null); setViewingDocument(null);
}; };
// Debug functions // Debug functions (commented out for now)
const handleDebugAuth = async () => { // const handleDebugAuth = async () => {
await debugAuth(); // await debugAuth();
}; // };
const handleTestAPIAuth = async () => { // const handleTestAPIAuth = async () => {
await testAPIAuth(); // await testAPIAuth();
}; // };
const filteredDocuments = documents.filter(doc => const filteredDocuments = documents.filter(doc =>
doc.name.toLowerCase().includes(searchTerm.toLowerCase()) || (doc.name?.toLowerCase() || '').includes(searchTerm.toLowerCase()) ||
doc.originalName.toLowerCase().includes(searchTerm.toLowerCase()) (doc.originalName?.toLowerCase() || '').includes(searchTerm.toLowerCase())
); );
const stats = { const stats = {

View File

@@ -21,7 +21,7 @@ interface UploadedFile {
} }
interface DocumentUploadProps { interface DocumentUploadProps {
onUploadComplete?: (fileId: string) => void; onUploadComplete?: (documentId: string) => void;
onUploadError?: (error: string) => void; onUploadError?: (error: string) => void;
} }
@@ -104,15 +104,15 @@ const DocumentUpload: React.FC<DocumentUploadProps> = ({
abortController.signal abortController.signal
); );
// Upload completed - update status to "uploaded" // Upload completed - update status to "processing" immediately
setUploadedFiles(prev => setUploadedFiles(prev =>
prev.map(f => prev.map(f =>
f.id === uploadedFile.id f.id === uploadedFile.id
? { ? {
...f, ...f,
id: document.id, id: result.id,
documentId: document.id, documentId: result.id,
status: 'uploaded', status: 'processing', // Changed from 'uploaded' to 'processing'
progress: 100 progress: 100
} }
: f : f
@@ -120,10 +120,10 @@ const DocumentUpload: React.FC<DocumentUploadProps> = ({
); );
// Call the completion callback with the document ID // Call the completion callback with the document ID
onUploadComplete?.(document.id); onUploadComplete?.(result.id);
// Start monitoring processing progress // Start monitoring processing progress immediately
monitorProcessingProgress(document.id, uploadedFile.id); monitorProcessingProgress(result.id, uploadedFile.id);
} catch (error) { } catch (error) {
// Check if this was an abort error // Check if this was an abort error
@@ -189,8 +189,29 @@ const DocumentUpload: React.FC<DocumentUploadProps> = ({
console.warn('Attempted to monitor progress for document with invalid UUID format:', documentId); console.warn('Attempted to monitor progress for document with invalid UUID format:', documentId);
return; return;
} }
// Add timeout to prevent infinite polling (30 minutes max)
const startTime = Date.now();
const maxPollingTime = 30 * 60 * 1000; // 30 minutes
const checkProgress = async () => { const checkProgress = async () => {
// Check if we've exceeded the maximum polling time
if (Date.now() - startTime > maxPollingTime) {
console.warn(`Polling timeout for document ${documentId} after ${maxPollingTime / 1000 / 60} minutes`);
setUploadedFiles(prev =>
prev.map(f =>
f.id === fileId
? {
...f,
status: 'error',
error: 'Processing timeout - please check document status manually'
}
: f
)
);
return;
}
try { try {
const response = await fetch(`${import.meta.env.VITE_API_BASE_URL}/documents/${documentId}/progress`, { const response = await fetch(`${import.meta.env.VITE_API_BASE_URL}/documents/${documentId}/progress`, {
headers: { headers: {
@@ -203,8 +224,10 @@ const DocumentUpload: React.FC<DocumentUploadProps> = ({
const progress = await response.json(); const progress = await response.json();
// Update status based on progress // Update status based on progress
let newStatus: UploadedFile['status'] = 'uploaded'; let newStatus: UploadedFile['status'] = 'processing'; // Default to processing
if (progress.status === 'processing' || progress.status === 'extracting_text' || progress.status === 'processing_llm' || progress.status === 'generating_pdf') { if (progress.status === 'uploading' || progress.status === 'uploaded') {
newStatus = 'processing'; // Still processing
} else if (progress.status === 'processing' || progress.status === 'extracting_text' || progress.status === 'processing_llm' || progress.status === 'generating_pdf') {
newStatus = 'processing'; newStatus = 'processing';
} else if (progress.status === 'completed') { } else if (progress.status === 'completed') {
newStatus = 'completed'; newStatus = 'completed';
@@ -242,12 +265,12 @@ const DocumentUpload: React.FC<DocumentUploadProps> = ({
// Don't stop monitoring on network errors, just log and continue // Don't stop monitoring on network errors, just log and continue
} }
// Continue monitoring // Continue monitoring with shorter intervals for better responsiveness
setTimeout(checkProgress, 2000); setTimeout(checkProgress, 3000); // Check every 3 seconds
}; };
// Start monitoring // Start monitoring immediately
setTimeout(checkProgress, 1000); setTimeout(checkProgress, 500); // Start checking after 500ms
}, [token]); }, [token]);
const { getRootProps, getInputProps, isDragActive } = useDropzone({ const { getRootProps, getInputProps, isDragActive } = useDropzone({
@@ -378,7 +401,7 @@ const DocumentUpload: React.FC<DocumentUploadProps> = ({
<h4 className="text-sm font-medium text-success-800">Upload Complete</h4> <h4 className="text-sm font-medium text-success-800">Upload Complete</h4>
<p className="text-sm text-success-700 mt-1"> <p className="text-sm text-success-700 mt-1">
Files have been uploaded successfully to Firebase Storage! You can now navigate away from this page. Files have been uploaded successfully to Firebase Storage! You can now navigate away from this page.
Processing will continue in the background using Document AI + Optimized Agentic RAG. PDFs will be automatically deleted after processing to save costs. Processing will continue in the background using Document AI + Optimized Agentic RAG. This can take several minutes. PDFs will be automatically deleted after processing to save costs.
</p> </p>
</div> </div>
</div> </div>

View File

@@ -7,7 +7,7 @@ const API_BASE_URL = config.apiBaseUrl;
// Create axios instance with auth interceptor // Create axios instance with auth interceptor
const apiClient = axios.create({ const apiClient = axios.create({
baseURL: API_BASE_URL, baseURL: API_BASE_URL,
timeout: 30000, // 30 seconds timeout: 300000, // 5 minutes
}); });
// Add auth token to requests // Add auth token to requests
@@ -263,14 +263,46 @@ class DocumentService {
// Step 3: Confirm upload and trigger processing // Step 3: Confirm upload and trigger processing
onProgress?.(95); // 95% - Confirming upload onProgress?.(95); // 95% - Confirming upload
const confirmResponse = await apiClient.post(`/documents/${documentId}/confirm-upload`, {}, { signal }); console.log('🔄 Making confirm-upload request for document:', documentId);
console.log('🔄 Confirm-upload URL:', `/documents/${documentId}/confirm-upload`);
// Add retry logic for confirm-upload (based on Google Cloud best practices)
let confirmResponse;
let lastError;
for (let attempt = 1; attempt <= 3; attempt++) {
try {
console.log(`🔄 Confirm-upload attempt ${attempt}/3`);
confirmResponse = await apiClient.post(`/documents/${documentId}/confirm-upload`, {}, {
signal,
timeout: 60000 // 60 second timeout for confirm-upload
});
console.log('✅ Confirm-upload response received:', confirmResponse.status);
console.log('✅ Confirm-upload response data:', confirmResponse.data);
break; // Success, exit retry loop
} catch (error: any) {
lastError = error;
console.log(`❌ Confirm-upload attempt ${attempt} failed:`, error.message);
if (attempt < 3) {
// Wait before retry (exponential backoff)
const delay = Math.pow(2, attempt) * 1000; // 2s, 4s
console.log(`⏳ Waiting ${delay}ms before retry...`);
await new Promise(resolve => setTimeout(resolve, delay));
}
}
}
if (!confirmResponse) {
throw lastError || new Error('Confirm-upload failed after 3 attempts');
}
onProgress?.(100); // 100% - Complete onProgress?.(100); // 100% - Complete
console.log('✅ Upload confirmed and processing started'); console.log('✅ Upload confirmed and processing started');
return { return {
id: documentId, id: documentId,
...confirmResponse.data ...confirmResponse.data.document
}; };
} catch (error: any) { } catch (error: any) {
@@ -281,6 +313,16 @@ class DocumentService {
throw new Error('Upload was cancelled.'); throw new Error('Upload was cancelled.');
} }
// Handle network timeouts
if (error.code === 'ECONNABORTED' || error.message?.includes('timeout')) {
throw new Error('Request timed out. Please check your connection and try again.');
}
// Handle network errors
if (error.code === 'ERR_NETWORK' || error.message?.includes('Network Error')) {
throw new Error('Network error. Please check your connection and try again.');
}
if (error.response?.status === 401) { if (error.response?.status === 401) {
throw new Error('Authentication required. Please log in again.'); throw new Error('Authentication required. Please log in again.');
} }