Major release with significant performance improvements and new processing strategy. ## Core Changes - Implemented simple_full_document processing strategy (default) - Full document → LLM approach: 1-2 passes, ~5-6 minutes processing time - Achieved 100% completeness with 2 API calls (down from 5+) - Removed redundant Document AI passes for faster processing ## Financial Data Extraction - Enhanced deterministic financial table parser - Improved FY3/FY2/FY1/LTM identification from varying CIM formats - Automatic merging of parser results with LLM extraction ## Code Quality & Infrastructure - Cleaned up debug logging (removed emoji markers from production code) - Fixed Firebase Secrets configuration (using modern defineSecret approach) - Updated OpenAI API key - Resolved deployment conflicts (secrets vs environment variables) - Added .env files to Firebase ignore list ## Deployment - Firebase Functions v2 deployment successful - All 7 required secrets verified and configured - Function URL: https://api-y56ccs6wva-uc.a.run.app ## Performance Improvements - Processing time: ~5-6 minutes (down from 23+ minutes) - API calls: 1-2 (down from 5+) - Completeness: 100% achievable - LLM Model: claude-3-7-sonnet-latest ## Breaking Changes - Default processing strategy changed to 'simple_full_document' - RAG processor available as alternative strategy 'document_ai_agentic_rag' ## Files Changed - 36 files changed, 5642 insertions(+), 4451 deletions(-) - Removed deprecated documentation files - Cleaned up unused services and models This release represents a major refactoring focused on speed, accuracy, and maintainability.
2095 lines
74 KiB
Markdown
2095 lines
74 KiB
Markdown
# CIM Summary Codebase Architecture Summary
|
|
|
|
**Last Updated**: December 2024
|
|
**Purpose**: Comprehensive technical reference for senior developers optimizing and debugging the codebase
|
|
|
|
---
|
|
|
|
## Table of Contents
|
|
|
|
1. [System Overview](#1-system-overview)
|
|
2. [Application Entry Points](#2-application-entry-points)
|
|
3. [Request Flow & API Architecture](#3-request-flow--api-architecture)
|
|
4. [Document Processing Pipeline (Critical Path)](#4-document-processing-pipeline-critical-path)
|
|
5. [Core Services Deep Dive](#5-core-services-deep-dive)
|
|
6. [Data Models & Database Schema](#6-data-models--database-schema)
|
|
7. [Component Handoffs & Integration Points](#7-component-handoffs--integration-points)
|
|
8. [Error Handling & Resilience](#8-error-handling--resilience)
|
|
9. [Performance Optimization Points](#9-performance-optimization-points)
|
|
10. [Background Processing Architecture](#10-background-processing-architecture)
|
|
11. [Frontend Architecture](#11-frontend-architecture)
|
|
12. [Configuration & Environment](#12-configuration--environment)
|
|
13. [Debugging Guide](#13-debugging-guide)
|
|
14. [Optimization Opportunities](#14-optimization-opportunities)
|
|
|
|
---
|
|
|
|
## 1. System Overview
|
|
|
|
### High-Level Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ Frontend (React) │
|
|
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
|
│ │ DocumentUpload│ │ DocumentList │ │ Analytics │ │
|
|
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
|
|
│ │ │ │ │
|
|
│ └──────────────────┴──────────────────┘ │
|
|
│ │ │
|
|
│ ┌────────▼────────┐ │
|
|
│ │ documentService │ │
|
|
│ │ (Axios Client) │ │
|
|
│ └────────┬────────┘ │
|
|
└────────────────────────────┼────────────────────────────────────┘
|
|
│ HTTPS + JWT
|
|
┌────────────────────────────▼────────────────────────────────────┐
|
|
│ Backend (Express + Node.js) │
|
|
│ ┌──────────────────────────────────────────────────────────┐ │
|
|
│ │ Middleware Chain: CORS → Auth → Validation → Error Handler │ │
|
|
│ └──────────────────────────────────────────────────────────┘ │
|
|
│ │ │
|
|
│ ┌──────────────────┼──────────────────┐ │
|
|
│ │ │ │ │
|
|
│ ┌──────▼──────┐ ┌────────▼────────┐ ┌─────▼──────┐ │
|
|
│ │ Routes │ │ Controllers │ │ Services │ │
|
|
│ └──────┬──────┘ └────────┬────────┘ └─────┬──────┘ │
|
|
│ │ │ │ │
|
|
│ └──────────────────┴──────────────────┘ │
|
|
└────────────────────────────┬────────────────────────────────────┘
|
|
│
|
|
┌────────────────────┼────────────────────┐
|
|
│ │ │
|
|
┌────▼────┐ ┌──────▼──────┐ ┌───────▼───────┐
|
|
│Supabase │ │Google Cloud│ │ LLM APIs │
|
|
│(Postgres)│ │ Storage │ │(Claude/OpenAI)│
|
|
└─────────┘ └────────────┘ └───────────────┘
|
|
```
|
|
|
|
### Technology Stack
|
|
|
|
**Frontend:**
|
|
- React 18 + TypeScript
|
|
- Vite (build tool)
|
|
- Axios (HTTP client)
|
|
- Firebase Auth (authentication)
|
|
- React Router (routing)
|
|
|
|
**Backend:**
|
|
- Node.js + Express + TypeScript
|
|
- Firebase Functions v2 (deployment)
|
|
- Supabase (PostgreSQL + Vector DB)
|
|
- Google Cloud Storage (file storage)
|
|
- Google Document AI (PDF text extraction)
|
|
- Puppeteer (PDF generation)
|
|
|
|
**AI/ML Services:**
|
|
- Anthropic Claude (primary LLM)
|
|
- OpenAI (fallback LLM)
|
|
- OpenRouter (LLM routing)
|
|
- OpenAI Embeddings (vector embeddings)
|
|
|
|
### Core Purpose
|
|
|
|
Automated processing and analysis of Confidential Information Memorandums (CIMs) using:
|
|
1. **Text Extraction**: Google Document AI extracts text from PDFs
|
|
2. **Semantic Chunking**: Split text into 4000-char chunks with overlap
|
|
3. **Vector Embeddings**: Generate embeddings for semantic search
|
|
4. **LLM Analysis**: Claude AI analyzes chunks and generates structured CIMReview data
|
|
5. **PDF Generation**: Create summary PDF with analysis results
|
|
|
|
---
|
|
|
|
## 2. Application Entry Points
|
|
|
|
### Backend Entry Point
|
|
|
|
**File**: `backend/src/index.ts`
|
|
|
|
```1:22:backend/src/index.ts
|
|
// Initialize Firebase Admin SDK first
|
|
import './config/firebase';
|
|
|
|
import express from 'express';
|
|
import cors from 'cors';
|
|
import helmet from 'helmet';
|
|
import morgan from 'morgan';
|
|
import rateLimit from 'express-rate-limit';
|
|
import { config } from './config/env';
|
|
import { logger } from './utils/logger';
|
|
import documentRoutes from './routes/documents';
|
|
import vectorRoutes from './routes/vector';
|
|
import monitoringRoutes from './routes/monitoring';
|
|
import auditRoutes from './routes/documentAudit';
|
|
import { jobQueueService } from './services/jobQueueService';
|
|
|
|
import { errorHandler, correlationIdMiddleware } from './middleware/errorHandler';
|
|
import { notFoundHandler } from './middleware/notFoundHandler';
|
|
|
|
// Start the job queue service for background processing
|
|
jobQueueService.start();
|
|
```
|
|
|
|
**Key Initialization Steps:**
|
|
1. Firebase Admin SDK initialization (`./config/firebase`)
|
|
2. Express app setup with middleware chain
|
|
3. Route registration (`/documents`, `/vector`, `/monitoring`, `/api/audit`)
|
|
4. Job queue service startup (legacy in-memory queue)
|
|
5. Firebase Functions export for Cloud deployment
|
|
|
|
**Scheduled Function**: `processDocumentJobs` (```210:267:backend/src/index.ts```)
|
|
- Runs every minute via Firebase Cloud Scheduler
|
|
- Processes pending/retrying jobs from database
|
|
- Detects and resets stuck jobs
|
|
|
|
### Frontend Entry Point
|
|
|
|
**File**: `frontend/src/main.tsx`
|
|
|
|
```1:10:frontend/src/main.tsx
|
|
import React from 'react';
|
|
import ReactDOM from 'react-dom/client';
|
|
import App from './App';
|
|
import './index.css';
|
|
|
|
ReactDOM.createRoot(document.getElementById('root')!).render(
|
|
<React.StrictMode>
|
|
<App />
|
|
</React.StrictMode>
|
|
);
|
|
```
|
|
|
|
**Main App Component**: `frontend/src/App.tsx`
|
|
- Sets up React Router
|
|
- Provides AuthContext
|
|
- Renders protected routes and dashboard
|
|
|
|
---
|
|
|
|
## 3. Request Flow & API Architecture
|
|
|
|
### Request Lifecycle
|
|
|
|
```
|
|
Client Request
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────┐
|
|
│ 1. CORS Middleware │
|
|
│ - Validates origin │
|
|
│ - Sets CORS headers │
|
|
└──────────────┬──────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────┐
|
|
│ 2. Correlation ID Middleware │
|
|
│ - Generates/reads X-Correlation-ID│
|
|
│ - Adds to request object │
|
|
└──────────────┬──────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────┐
|
|
│ 3. Firebase Auth Middleware │
|
|
│ - Verifies JWT token │
|
|
│ - Attaches user to req.user │
|
|
└──────────────┬──────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────┐
|
|
│ 4. Rate Limiting │
|
|
│ - 1000 requests per 15 minutes │
|
|
└──────────────┬──────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────┐
|
|
│ 5. Body Parsing │
|
|
│ - JSON (10MB limit) │
|
|
│ - URL-encoded (10MB limit) │
|
|
└──────────────┬──────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────┐
|
|
│ 6. Route Handler │
|
|
│ - Matches route pattern │
|
|
│ - Calls controller method │
|
|
└──────────────┬──────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────┐
|
|
│ 7. Controller │
|
|
│ - Validates input │
|
|
│ - Calls service methods │
|
|
│ - Returns response │
|
|
└──────────────┬──────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────┐
|
|
│ 8. Service Layer │
|
|
│ - Business logic │
|
|
│ - Database operations │
|
|
│ - External API calls │
|
|
└──────────────┬──────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────┐
|
|
│ 9. Error Handler (if error) │
|
|
│ - Categorizes error │
|
|
│ - Logs with correlation ID │
|
|
│ - Returns structured response │
|
|
└─────────────────────────────────────┘
|
|
```
|
|
|
|
### Authentication Flow
|
|
|
|
**Middleware**: `backend/src/middleware/firebaseAuth.ts`
|
|
|
|
```27:81:backend/src/middleware/firebaseAuth.ts
|
|
export const verifyFirebaseToken = async (
|
|
req: FirebaseAuthenticatedRequest,
|
|
res: Response,
|
|
next: NextFunction
|
|
): Promise<void> => {
|
|
try {
|
|
console.log('🔐 Authentication middleware called for:', req.method, req.url);
|
|
console.log('🔐 Request headers:', Object.keys(req.headers));
|
|
|
|
// Debug Firebase Admin initialization
|
|
console.log('🔐 Firebase apps available:', admin.apps.length);
|
|
console.log('🔐 Firebase app names:', admin.apps.filter(app => app !== null).map(app => app!.name));
|
|
|
|
const authHeader = req.headers.authorization;
|
|
console.log('🔐 Auth header present:', !!authHeader);
|
|
console.log('🔐 Auth header starts with Bearer:', authHeader?.startsWith('Bearer '));
|
|
|
|
if (!authHeader || !authHeader.startsWith('Bearer ')) {
|
|
console.log('❌ No valid authorization header');
|
|
res.status(401).json({ error: 'No valid authorization header' });
|
|
return;
|
|
}
|
|
|
|
const idToken = authHeader.split('Bearer ')[1];
|
|
console.log('🔐 Token extracted, length:', idToken?.length);
|
|
|
|
if (!idToken) {
|
|
console.log('❌ No token provided');
|
|
res.status(401).json({ error: 'No token provided' });
|
|
return;
|
|
}
|
|
|
|
console.log('🔐 Attempting to verify Firebase ID token...');
|
|
console.log('🔐 Token preview:', idToken.substring(0, 20) + '...');
|
|
|
|
// Verify the Firebase ID token
|
|
const decodedToken = await admin.auth().verifyIdToken(idToken, true);
|
|
console.log('✅ Token verified successfully for user:', decodedToken.email);
|
|
console.log('✅ Token UID:', decodedToken.uid);
|
|
console.log('✅ Token issuer:', decodedToken.iss);
|
|
|
|
// Check if token is expired
|
|
const now = Math.floor(Date.now() / 1000);
|
|
if (decodedToken.exp && decodedToken.exp < now) {
|
|
logger.warn('Token expired for user:', decodedToken.uid);
|
|
res.status(401).json({ error: 'Token expired' });
|
|
return;
|
|
}
|
|
|
|
req.user = decodedToken;
|
|
|
|
// Log successful authentication
|
|
logger.info('Authenticated request for user:', decodedToken.email);
|
|
|
|
next();
|
|
```
|
|
|
|
**Frontend Auth**: `frontend/src/services/authService.ts`
|
|
- Manages Firebase Auth state
|
|
- Provides token via `getToken()`
|
|
- Axios interceptor adds token to requests
|
|
|
|
### Route Structure
|
|
|
|
**Main Routes** (`backend/src/routes/documents.ts`):
|
|
- `POST /documents/upload-url` - Get signed upload URL
|
|
- `POST /documents/:id/confirm-upload` - Confirm upload and start processing
|
|
- `GET /documents` - List user's documents
|
|
- `GET /documents/:id` - Get document details
|
|
- `GET /documents/:id/download` - Download processed PDF
|
|
- `GET /documents/analytics` - Get processing analytics
|
|
- `POST /documents/:id/process-optimized-agentic-rag` - Trigger AI processing
|
|
|
|
**Middleware Applied**:
|
|
```22:29:backend/src/routes/documents.ts
|
|
// Apply authentication and correlation ID to all routes
|
|
router.use(verifyFirebaseToken);
|
|
router.use(addCorrelationId);
|
|
|
|
// Add logging middleware for document routes
|
|
router.use((req, res, next) => {
|
|
console.log(`📄 Document route accessed: ${req.method} ${req.path}`);
|
|
next();
|
|
});
|
|
```
|
|
|
|
---
|
|
|
|
## 4. Document Processing Pipeline (Critical Path)
|
|
|
|
### Complete Flow Diagram
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────┐
|
|
│ DOCUMENT PROCESSING PIPELINE │
|
|
└─────────────────────────────────────────────────────────────────────┘
|
|
|
|
1. UPLOAD PHASE
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ User selects PDF │
|
|
│ ↓ │
|
|
│ DocumentUpload component │
|
|
│ ↓ │
|
|
│ documentService.uploadDocument() │
|
|
│ ↓ │
|
|
│ POST /documents/upload-url │
|
|
│ ↓ │
|
|
│ documentController.getUploadUrl() │
|
|
│ ↓ │
|
|
│ DocumentModel.create() → documents table │
|
|
│ ↓ │
|
|
│ fileStorageService.generateSignedUploadUrl() │
|
|
│ ↓ │
|
|
│ Direct upload to GCS via signed URL │
|
|
│ ↓ │
|
|
│ POST /documents/:id/confirm-upload │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
|
|
2. JOB CREATION PHASE
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ documentController.confirmUpload() │
|
|
│ ↓ │
|
|
│ ProcessingJobModel.create() → processing_jobs table │
|
|
│ ↓ │
|
|
│ Status: 'pending' │
|
|
│ ↓ │
|
|
│ Returns 202 Accepted (async processing) │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
|
|
3. JOB PROCESSING PHASE (Background)
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ Scheduled Function: processDocumentJobs (every 1 minute) │
|
|
│ OR │
|
|
│ Immediate processing via jobProcessorService.processJob() │
|
|
│ ↓ │
|
|
│ JobProcessorService.processJob() │
|
|
│ ↓ │
|
|
│ Download file from GCS │
|
|
│ ↓ │
|
|
│ unifiedDocumentProcessor.processDocument() │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
|
|
4. TEXT EXTRACTION PHASE
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ documentAiProcessor.processDocument() │
|
|
│ ↓ │
|
|
│ Google Document AI API │
|
|
│ ↓ │
|
|
│ Extracted text returned │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
|
|
5. CHUNKING & EMBEDDING PHASE
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ optimizedAgenticRAGProcessor.processLargeDocument() │
|
|
│ ↓ │
|
|
│ createIntelligentChunks() │
|
|
│ - Semantic boundary detection │
|
|
│ - 4000-char chunks with 200-char overlap │
|
|
│ ↓ │
|
|
│ processChunksInBatches() │
|
|
│ - Batch size: 10 │
|
|
│ - Max concurrent: 5 │
|
|
│ ↓ │
|
|
│ storeChunksOptimized() │
|
|
│ ↓ │
|
|
│ vectorDatabaseService.storeEmbedding() │
|
|
│ - OpenAI embeddings API │
|
|
│ - Store in document_chunks table │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
|
|
6. LLM ANALYSIS PHASE
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ generateLLMAnalysisHybrid() │
|
|
│ ↓ │
|
|
│ llmService.processCIMDocument() │
|
|
│ ↓ │
|
|
│ Vector search for relevant chunks │
|
|
│ ↓ │
|
|
│ Claude/OpenAI API call with structured prompt │
|
|
│ ↓ │
|
|
│ Parse and validate CIMReview JSON │
|
|
│ ↓ │
|
|
│ Return structured analysisData │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
|
|
7. PDF GENERATION PHASE
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ pdfGenerationService.generatePDF() │
|
|
│ ↓ │
|
|
│ Puppeteer browser instance │
|
|
│ ↓ │
|
|
│ Render HTML template with analysisData │
|
|
│ ↓ │
|
|
│ Generate PDF buffer │
|
|
│ ↓ │
|
|
│ Upload PDF to GCS │
|
|
│ ↓ │
|
|
│ Update document record with PDF path │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
|
|
8. STATUS UPDATE PHASE
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ DocumentModel.updateById() │
|
|
│ - status: 'completed' │
|
|
│ - pdf_path: GCS path │
|
|
│ ↓ │
|
|
│ ProcessingJobModel.markAsCompleted() │
|
|
│ ↓ │
|
|
│ Frontend polls /documents/:id for status updates │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### Key Handoff Points
|
|
|
|
**1. Upload to Job Creation**
|
|
```138:202:backend/src/controllers/documentController.ts
|
|
async confirmUpload(req: Request, res: Response): Promise<void> {
|
|
// ... validation ...
|
|
|
|
// Update status to processing
|
|
await DocumentModel.updateById(documentId, {
|
|
status: 'processing_llm'
|
|
});
|
|
|
|
// Acknowledge the request immediately
|
|
res.status(202).json({
|
|
message: 'Upload confirmed, processing has started.',
|
|
document: document,
|
|
status: 'processing'
|
|
});
|
|
|
|
// CRITICAL FIX: Use database-backed job queue
|
|
const { ProcessingJobModel } = await import('../models/ProcessingJobModel');
|
|
await ProcessingJobModel.create({
|
|
document_id: documentId,
|
|
user_id: userId,
|
|
options: {
|
|
fileName: document.original_file_name,
|
|
mimeType: 'application/pdf'
|
|
}
|
|
});
|
|
}
|
|
```
|
|
|
|
**2. Job Processing to Document Processing**
|
|
```109:200:backend/src/services/jobProcessorService.ts
|
|
private async processJob(jobId: string): Promise<{ success: boolean; error?: string }> {
|
|
// Get job details
|
|
job = await ProcessingJobModel.findById(jobId);
|
|
|
|
// Mark job as processing
|
|
await ProcessingJobModel.markAsProcessing(jobId);
|
|
|
|
// Download file from GCS
|
|
const fileBuffer = await fileStorageService.downloadFile(document.file_path);
|
|
|
|
// Process document
|
|
const result = await unifiedDocumentProcessor.processDocument(
|
|
job.document_id,
|
|
job.user_id,
|
|
fileBuffer.toString('utf-8'), // This will be re-read as buffer
|
|
{
|
|
fileBuffer,
|
|
fileName: job.options?.fileName || 'document.pdf',
|
|
mimeType: job.options?.mimeType || 'application/pdf'
|
|
}
|
|
);
|
|
}
|
|
```
|
|
|
|
**3. Document Processing to Text Extraction**
|
|
```50:80:backend/src/services/documentAiProcessor.ts
|
|
async processDocument(
|
|
documentId: string,
|
|
userId: string,
|
|
fileBuffer: Buffer,
|
|
fileName: string,
|
|
mimeType: string
|
|
): Promise<ProcessingResult> {
|
|
// Step 1: Extract text using Document AI or fallback
|
|
const extractedText = await this.extractTextFromDocument(fileBuffer, fileName, mimeType);
|
|
|
|
// Step 2: Process extracted text through Agentic RAG
|
|
const agenticRagResult = await this.processWithAgenticRAG(documentId, extractedText);
|
|
}
|
|
```
|
|
|
|
**4. Text to Chunking**
|
|
```40:109:backend/src/services/optimizedAgenticRAGProcessor.ts
|
|
async processLargeDocument(
|
|
documentId: string,
|
|
text: string,
|
|
options: {
|
|
enableSemanticChunking?: boolean;
|
|
enableMetadataEnrichment?: boolean;
|
|
similarityThreshold?: number;
|
|
} = {}
|
|
): Promise<ProcessingResult> {
|
|
// Step 1: Create intelligent chunks with semantic boundaries
|
|
const chunks = await this.createIntelligentChunks(text, documentId, options.enableSemanticChunking);
|
|
|
|
// Step 2: Process chunks in batches to manage memory
|
|
const processedChunks = await this.processChunksInBatches(chunks, documentId, options);
|
|
|
|
// Step 3: Store chunks with optimized batching
|
|
const embeddingApiCalls = await this.storeChunksOptimized(processedChunks, documentId);
|
|
|
|
// Step 4: Generate LLM analysis using HYBRID approach
|
|
const llmResult = await this.generateLLMAnalysisHybrid(documentId, text, processedChunks);
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 5. Core Services Deep Dive
|
|
|
|
### 5.1 UnifiedDocumentProcessor
|
|
|
|
**File**: `backend/src/services/unifiedDocumentProcessor.ts`
|
|
**Purpose**: Main orchestrator for document processing strategies
|
|
|
|
**Key Method**:
|
|
```123:143:backend/src/services/unifiedDocumentProcessor.ts
|
|
async processDocument(
|
|
documentId: string,
|
|
userId: string,
|
|
text: string,
|
|
options: any = {}
|
|
): Promise<ProcessingResult> {
|
|
const strategy = options.strategy || 'document_ai_agentic_rag';
|
|
|
|
logger.info('Processing document with unified processor', {
|
|
documentId,
|
|
strategy,
|
|
textLength: text.length
|
|
});
|
|
|
|
// Only support document_ai_agentic_rag strategy
|
|
if (strategy === 'document_ai_agentic_rag') {
|
|
return await this.processWithDocumentAiAgenticRag(documentId, userId, text, options);
|
|
} else {
|
|
throw new Error(`Unsupported processing strategy: ${strategy}. Only 'document_ai_agentic_rag' is supported.`);
|
|
}
|
|
}
|
|
```
|
|
|
|
**Dependencies**:
|
|
- `documentAiProcessor` - Text extraction
|
|
- `optimizedAgenticRAGProcessor` - AI processing
|
|
- `llmService` - LLM interactions
|
|
- `pdfGenerationService` - PDF generation
|
|
|
|
**Error Handling**: Wraps errors with detailed context, validates analysisData presence
|
|
|
|
### 5.2 OptimizedAgenticRAGProcessor
|
|
|
|
**File**: `backend/src/services/optimizedAgenticRAGProcessor.ts` (1885 lines)
|
|
**Purpose**: Core AI processing engine for chunking, embeddings, and LLM analysis
|
|
|
|
**Key Configuration**:
|
|
```32:35:backend/src/services/optimizedAgenticRAGProcessor.ts
|
|
private readonly maxChunkSize = 4000; // Optimal chunk size for embeddings
|
|
private readonly overlapSize = 200; // Overlap between chunks
|
|
private readonly maxConcurrentEmbeddings = 5; // Limit concurrent API calls
|
|
private readonly batchSize = 10; // Process chunks in batches
|
|
```
|
|
|
|
**Key Methods**:
|
|
- `processLargeDocument()` - Main entry point
|
|
- `createIntelligentChunks()` - Semantic chunking with boundary detection
|
|
- `processChunksInBatches()` - Batch processing for memory efficiency
|
|
- `storeChunksOptimized()` - Embedding generation and storage
|
|
- `generateLLMAnalysisHybrid()` - LLM analysis with vector search
|
|
|
|
**Performance Optimizations**:
|
|
- Semantic boundary detection (paragraphs, sections)
|
|
- Batch processing to limit memory usage
|
|
- Concurrent embedding generation (max 5)
|
|
- Vector search with document_id filtering
|
|
|
|
### 5.3 JobProcessorService
|
|
|
|
**File**: `backend/src/services/jobProcessorService.ts`
|
|
**Purpose**: Database-backed job processor (replaces legacy in-memory queue)
|
|
|
|
**Key Method**:
|
|
```15:97:backend/src/services/jobProcessorService.ts
|
|
async processJobs(): Promise<{
|
|
processed: number;
|
|
succeeded: number;
|
|
failed: number;
|
|
skipped: number;
|
|
}> {
|
|
// Prevent concurrent processing runs
|
|
if (this.isProcessing) {
|
|
logger.info('Job processor already running, skipping this run');
|
|
return { processed: 0, succeeded: 0, failed: 0, skipped: 0 };
|
|
}
|
|
|
|
this.isProcessing = true;
|
|
const stats = { processed: 0, succeeded: 0, failed: 0, skipped: 0 };
|
|
|
|
try {
|
|
// Reset stuck jobs first
|
|
const resetCount = await ProcessingJobModel.resetStuckJobs(this.JOB_TIMEOUT_MINUTES);
|
|
|
|
// Get pending jobs
|
|
const pendingJobs = await ProcessingJobModel.getPendingJobs(this.MAX_CONCURRENT_JOBS);
|
|
|
|
// Get retrying jobs
|
|
const retryingJobs = await ProcessingJobModel.getRetryableJobs(
|
|
Math.max(0, this.MAX_CONCURRENT_JOBS - pendingJobs.length)
|
|
);
|
|
|
|
const allJobs = [...pendingJobs, ...retryingJobs];
|
|
|
|
// Process jobs in parallel (up to MAX_CONCURRENT_JOBS)
|
|
const results = await Promise.allSettled(
|
|
allJobs.map((job) => this.processJob(job.id))
|
|
);
|
|
```
|
|
|
|
**Configuration**:
|
|
- `MAX_CONCURRENT_JOBS = 3`
|
|
- `JOB_TIMEOUT_MINUTES = 15`
|
|
|
|
**Features**:
|
|
- Stuck job detection and recovery
|
|
- Retry logic with exponential backoff
|
|
- Parallel processing with concurrency limit
|
|
- Database-backed state management
|
|
|
|
### 5.4 VectorDatabaseService
|
|
|
|
**File**: `backend/src/services/vectorDatabaseService.ts`
|
|
**Purpose**: Vector embeddings and similarity search
|
|
|
|
**Key Method - Vector Search**:
|
|
```88:150:backend/src/services/vectorDatabaseService.ts
|
|
async searchSimilar(
|
|
embedding: number[],
|
|
limit: number = 10,
|
|
threshold: number = 0.7,
|
|
documentId?: string
|
|
): Promise<VectorSearchResult[]> {
|
|
try {
|
|
if (this.provider === 'supabase') {
|
|
// Use optimized Supabase vector search function with document_id filtering
|
|
// This prevents timeouts by only searching within a specific document
|
|
const rpcParams: any = {
|
|
query_embedding: embedding,
|
|
match_threshold: threshold,
|
|
match_count: limit
|
|
};
|
|
|
|
// Add document_id filter if provided (critical for performance)
|
|
if (documentId) {
|
|
rpcParams.filter_document_id = documentId;
|
|
}
|
|
|
|
// Set a timeout for the RPC call (10 seconds)
|
|
const searchPromise = this.supabaseClient
|
|
.rpc('match_document_chunks', rpcParams);
|
|
|
|
const timeoutPromise = new Promise<{ data: null; error: { message: string } }>((_, reject) => {
|
|
setTimeout(() => reject(new Error('Vector search timeout after 10s')), 10000);
|
|
});
|
|
|
|
let result: any;
|
|
try {
|
|
result = await Promise.race([searchPromise, timeoutPromise]);
|
|
} catch (timeoutError: any) {
|
|
if (timeoutError.message?.includes('timeout')) {
|
|
logger.error('Vector search timed out', { documentId, timeout: '10s' });
|
|
throw new Error('Vector search timeout after 10s');
|
|
}
|
|
throw timeoutError;
|
|
}
|
|
```
|
|
|
|
**Critical Optimization**: Always pass `documentId` to filter search scope and prevent timeouts
|
|
|
|
**SQL Function**: `backend/sql/fix_vector_search_timeout.sql`
|
|
```10:39:backend/sql/fix_vector_search_timeout.sql
|
|
CREATE OR REPLACE FUNCTION match_document_chunks (
|
|
query_embedding vector(1536),
|
|
match_threshold float,
|
|
match_count int,
|
|
filter_document_id text DEFAULT NULL
|
|
)
|
|
RETURNS TABLE (
|
|
id UUID,
|
|
document_id TEXT,
|
|
content text,
|
|
metadata JSONB,
|
|
chunk_index INT,
|
|
similarity float
|
|
)
|
|
LANGUAGE sql STABLE
|
|
AS $$
|
|
SELECT
|
|
document_chunks.id,
|
|
document_chunks.document_id,
|
|
document_chunks.content,
|
|
document_chunks.metadata,
|
|
document_chunks.chunk_index,
|
|
1 - (document_chunks.embedding <=> query_embedding) AS similarity
|
|
FROM document_chunks
|
|
WHERE document_chunks.embedding IS NOT NULL
|
|
AND (filter_document_id IS NULL OR document_chunks.document_id = filter_document_id)
|
|
AND 1 - (document_chunks.embedding <=> query_embedding) > match_threshold
|
|
ORDER BY document_chunks.embedding <=> query_embedding
|
|
LIMIT match_count;
|
|
$$;
|
|
```
|
|
|
|
### 5.5 LLMService
|
|
|
|
**File**: `backend/src/services/llmService.ts`
|
|
**Purpose**: LLM interactions (Claude/OpenAI/OpenRouter)
|
|
|
|
**Provider Selection**:
|
|
```43:103:backend/src/services/llmService.ts
|
|
constructor() {
|
|
// Read provider from config (supports openrouter, anthropic, openai)
|
|
this.provider = config.llm.provider;
|
|
|
|
// CRITICAL: If provider is not set correctly, log and use fallback
|
|
if (!this.provider || (this.provider !== 'openrouter' && this.provider !== 'anthropic' && this.provider !== 'openai')) {
|
|
logger.error('LLM provider is invalid or not set', {
|
|
provider: this.provider,
|
|
configProvider: config.llm.provider,
|
|
processEnvProvider: process.env['LLM_PROVIDER'],
|
|
defaultingTo: 'anthropic'
|
|
});
|
|
this.provider = 'anthropic'; // Fallback
|
|
}
|
|
|
|
// Set API key based on provider
|
|
if (this.provider === 'openai') {
|
|
this.apiKey = config.llm.openaiApiKey!;
|
|
} else if (this.provider === 'openrouter') {
|
|
// OpenRouter: Use OpenRouter key if provided, otherwise use Anthropic key for BYOK
|
|
this.apiKey = config.llm.openrouterApiKey || config.llm.anthropicApiKey!;
|
|
} else {
|
|
this.apiKey = config.llm.anthropicApiKey!;
|
|
}
|
|
|
|
// Use configured model instead of hardcoded value
|
|
this.defaultModel = config.llm.model;
|
|
this.maxTokens = config.llm.maxTokens;
|
|
this.temperature = config.llm.temperature;
|
|
}
|
|
```
|
|
|
|
**Key Method**:
|
|
```108:148:backend/src/services/llmService.ts
|
|
async processCIMDocument(text: string, template: string, analysis?: Record<string, any>): Promise<CIMAnalysisResult> {
|
|
// Check and truncate text if it exceeds maxInputTokens
|
|
const maxInputTokens = config.llm.maxInputTokens || 200000;
|
|
const systemPromptTokens = this.estimateTokenCount(this.getCIMSystemPrompt());
|
|
const templateTokens = this.estimateTokenCount(template);
|
|
const promptBuffer = config.llm.promptBuffer || 1000;
|
|
|
|
// Calculate available tokens for document text
|
|
const reservedTokens = systemPromptTokens + templateTokens + promptBuffer + (config.llm.maxTokens || 16000);
|
|
const availableTokens = maxInputTokens - reservedTokens;
|
|
|
|
const textTokens = this.estimateTokenCount(text);
|
|
let processedText = text;
|
|
let wasTruncated = false;
|
|
|
|
if (textTokens > availableTokens) {
|
|
logger.warn('Document text exceeds token limit, truncating', {
|
|
textTokens,
|
|
availableTokens,
|
|
maxInputTokens,
|
|
reservedTokens,
|
|
truncationRatio: (availableTokens / textTokens * 100).toFixed(1) + '%'
|
|
});
|
|
|
|
processedText = this.truncateText(text, availableTokens);
|
|
wasTruncated = true;
|
|
}
|
|
```
|
|
|
|
**Features**:
|
|
- Automatic token counting and truncation
|
|
- Model selection based on task complexity
|
|
- JSON schema validation with Zod
|
|
- Retry logic with exponential backoff
|
|
- Cost tracking
|
|
|
|
### 5.6 DocumentAiProcessor
|
|
|
|
**File**: `backend/src/services/documentAiProcessor.ts`
|
|
**Purpose**: Google Document AI integration for text extraction
|
|
|
|
**Key Method**:
|
|
```50:146:backend/src/services/documentAiProcessor.ts
|
|
async processDocument(
|
|
documentId: string,
|
|
userId: string,
|
|
fileBuffer: Buffer,
|
|
fileName: string,
|
|
mimeType: string
|
|
): Promise<ProcessingResult> {
|
|
const startTime = Date.now();
|
|
|
|
try {
|
|
logger.info('Starting Document AI + Agentic RAG processing', {
|
|
documentId,
|
|
userId,
|
|
fileName,
|
|
fileSize: fileBuffer.length,
|
|
mimeType
|
|
});
|
|
|
|
// Step 1: Extract text using Document AI or fallback
|
|
const extractedText = await this.extractTextFromDocument(fileBuffer, fileName, mimeType);
|
|
|
|
if (!extractedText) {
|
|
throw new Error('Failed to extract text from document');
|
|
}
|
|
|
|
logger.info('Text extraction completed', {
|
|
textLength: extractedText.length
|
|
});
|
|
|
|
// Step 2: Process extracted text through Agentic RAG
|
|
const agenticRagResult = await this.processWithAgenticRAG(documentId, extractedText);
|
|
|
|
const processingTime = Date.now() - startTime;
|
|
|
|
return {
|
|
success: true,
|
|
content: agenticRagResult.summary || extractedText,
|
|
metadata: {
|
|
processingStrategy: 'document_ai_agentic_rag',
|
|
processingTime,
|
|
extractedTextLength: extractedText.length,
|
|
agenticRagResult,
|
|
fileSize: fileBuffer.length,
|
|
fileName,
|
|
mimeType
|
|
}
|
|
};
|
|
```
|
|
|
|
**Fallback Strategy**: Uses `pdf-parse` if Document AI fails
|
|
|
|
### 5.7 PDFGenerationService
|
|
|
|
**File**: `backend/src/services/pdfGenerationService.ts`
|
|
**Purpose**: PDF generation using Puppeteer
|
|
|
|
**Key Features**:
|
|
- Page pooling for performance
|
|
- Caching for repeated requests
|
|
- Browser instance reuse
|
|
- Fallback to PDFKit if Puppeteer fails
|
|
|
|
**Configuration**:
|
|
```65:85:backend/src/services/pdfGenerationService.ts
|
|
class PDFGenerationService {
|
|
private browser: any = null;
|
|
private pagePool: PagePool[] = [];
|
|
private readonly maxPoolSize = 5;
|
|
private readonly pageTimeout = 30000; // 30 seconds
|
|
private readonly cache = new Map<string, { buffer: Buffer; timestamp: number }>();
|
|
private readonly cacheTimeout = 300000; // 5 minutes
|
|
|
|
private readonly defaultOptions: PDFGenerationOptions = {
|
|
format: 'A4',
|
|
margin: {
|
|
top: '1in',
|
|
right: '1in',
|
|
bottom: '1in',
|
|
left: '1in',
|
|
},
|
|
displayHeaderFooter: true,
|
|
printBackground: true,
|
|
quality: 'high',
|
|
timeout: 30000,
|
|
};
|
|
```
|
|
|
|
### 5.8 FileStorageService
|
|
|
|
**File**: `backend/src/services/fileStorageService.ts`
|
|
**Purpose**: Google Cloud Storage operations
|
|
|
|
**Key Methods**:
|
|
- `generateSignedUploadUrl()` - Generate signed URL for direct upload
|
|
- `downloadFile()` - Download file from GCS
|
|
- `saveBuffer()` - Save buffer to GCS
|
|
- `deleteFile()` - Delete file from GCS
|
|
|
|
**Credential Handling**:
|
|
```40:145:backend/src/services/fileStorageService.ts
|
|
constructor() {
|
|
this.bucketName = config.googleCloud.gcsBucketName;
|
|
|
|
// Check if we're in Firebase Functions/Cloud Run environment
|
|
const isCloudEnvironment = process.env.FUNCTION_TARGET ||
|
|
process.env.FUNCTION_NAME ||
|
|
process.env.K_SERVICE ||
|
|
process.env.GOOGLE_CLOUD_PROJECT ||
|
|
!!process.env.GCLOUD_PROJECT ||
|
|
process.env.X_GOOGLE_GCLOUD_PROJECT;
|
|
|
|
// Initialize Google Cloud Storage
|
|
const storageConfig: any = {
|
|
projectId: config.googleCloud.projectId,
|
|
};
|
|
|
|
// Only use keyFilename in local development
|
|
// In Firebase Functions/Cloud Run, use Application Default Credentials
|
|
if (isCloudEnvironment) {
|
|
// In cloud, ALWAYS clear GOOGLE_APPLICATION_CREDENTIALS to force use of ADC
|
|
// Firebase Functions automatically provides credentials via metadata service
|
|
// These credentials have signing capabilities for generating signed URLs
|
|
const originalCreds = process.env.GOOGLE_APPLICATION_CREDENTIALS;
|
|
if (originalCreds) {
|
|
delete process.env.GOOGLE_APPLICATION_CREDENTIALS;
|
|
logger.info('Using Application Default Credentials for GCS (cloud environment)', {
|
|
clearedEnvVar: 'GOOGLE_APPLICATION_CREDENTIALS',
|
|
originalValue: originalCreds,
|
|
projectId: config.googleCloud.projectId
|
|
});
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 6. Data Models & Database Schema
|
|
|
|
### Core Models
|
|
|
|
**DocumentModel** (`backend/src/models/DocumentModel.ts`):
|
|
- `create()` - Create document record
|
|
- `findById()` - Get document by ID
|
|
- `updateById()` - Update document status/metadata
|
|
- `findByUserId()` - List user's documents
|
|
|
|
**ProcessingJobModel** (`backend/src/models/ProcessingJobModel.ts`):
|
|
- `create()` - Create processing job (uses direct PostgreSQL to bypass PostgREST cache)
|
|
- `findById()` - Get job by ID
|
|
- `getPendingJobs()` - Get pending jobs (limit by concurrency)
|
|
- `getRetryableJobs()` - Get jobs ready for retry
|
|
- `markAsProcessing()` - Update job status
|
|
- `markAsCompleted()` - Mark job complete
|
|
- `markAsFailed()` - Mark job failed with error
|
|
- `resetStuckJobs()` - Reset jobs stuck in processing
|
|
|
|
**VectorDatabaseModel** (`backend/src/models/VectorDatabaseModel.ts`):
|
|
- Chunk storage and retrieval
|
|
- Embedding management
|
|
|
|
### Database Tables
|
|
|
|
**documents**:
|
|
- `id` (UUID, primary key)
|
|
- `user_id` (UUID, foreign key)
|
|
- `original_file_name` (text)
|
|
- `file_path` (text, GCS path)
|
|
- `file_size` (bigint)
|
|
- `status` (text: 'uploading', 'uploaded', 'processing_llm', 'completed', 'failed')
|
|
- `pdf_path` (text, GCS path for generated PDF)
|
|
- `created_at`, `updated_at` (timestamps)
|
|
|
|
**processing_jobs**:
|
|
- `id` (UUID, primary key)
|
|
- `document_id` (UUID, foreign key)
|
|
- `user_id` (UUID, foreign key)
|
|
- `status` (text: 'pending', 'processing', 'completed', 'failed', 'retrying')
|
|
- `attempts` (int)
|
|
- `max_attempts` (int, default 3)
|
|
- `options` (JSONB, processing options)
|
|
- `error` (text, error message if failed)
|
|
- `result` (JSONB, processing result)
|
|
- `created_at`, `started_at`, `completed_at`, `updated_at` (timestamps)
|
|
|
|
**document_chunks**:
|
|
- `id` (UUID, primary key)
|
|
- `document_id` (text, foreign key)
|
|
- `content` (text)
|
|
- `embedding` (vector(1536))
|
|
- `metadata` (JSONB)
|
|
- `chunk_index` (int)
|
|
- `created_at`, `updated_at` (timestamps)
|
|
|
|
**agentic_rag_sessions**:
|
|
- `id` (UUID, primary key)
|
|
- `document_id` (UUID, foreign key)
|
|
- `user_id` (UUID, foreign key)
|
|
- `status` (text)
|
|
- `metadata` (JSONB)
|
|
- `created_at`, `updated_at` (timestamps)
|
|
|
|
### Vector Search Optimization
|
|
|
|
**Critical SQL Function**: `match_document_chunks` with `document_id` filtering
|
|
|
|
```10:39:backend/sql/fix_vector_search_timeout.sql
|
|
CREATE OR REPLACE FUNCTION match_document_chunks (
|
|
query_embedding vector(1536),
|
|
match_threshold float,
|
|
match_count int,
|
|
filter_document_id text DEFAULT NULL
|
|
)
|
|
RETURNS TABLE (
|
|
id UUID,
|
|
document_id TEXT,
|
|
content text,
|
|
metadata JSONB,
|
|
chunk_index INT,
|
|
similarity float
|
|
)
|
|
LANGUAGE sql STABLE
|
|
AS $$
|
|
SELECT
|
|
document_chunks.id,
|
|
document_chunks.document_id,
|
|
document_chunks.content,
|
|
document_chunks.metadata,
|
|
document_chunks.chunk_index,
|
|
1 - (document_chunks.embedding <=> query_embedding) AS similarity
|
|
FROM document_chunks
|
|
WHERE document_chunks.embedding IS NOT NULL
|
|
AND (filter_document_id IS NULL OR document_chunks.document_id = filter_document_id)
|
|
AND 1 - (document_chunks.embedding <=> query_embedding) > match_threshold
|
|
ORDER BY document_chunks.embedding <=> query_embedding
|
|
LIMIT match_count;
|
|
$$;
|
|
```
|
|
|
|
**Always pass `filter_document_id`** to prevent timeouts when searching across all documents.
|
|
|
|
---
|
|
|
|
## 7. Component Handoffs & Integration Points
|
|
|
|
### Frontend ↔ Backend
|
|
|
|
**Axios Interceptor** (`frontend/src/services/documentService.ts`):
|
|
```8:54:frontend/src/services/documentService.ts
|
|
export const apiClient = axios.create({
|
|
baseURL: API_BASE_URL,
|
|
timeout: 300000, // 5 minutes
|
|
});
|
|
|
|
// Add auth token to requests
|
|
apiClient.interceptors.request.use(async (config) => {
|
|
const token = await authService.getToken();
|
|
if (token) {
|
|
config.headers.Authorization = `Bearer ${token}`;
|
|
}
|
|
return config;
|
|
});
|
|
|
|
// Handle auth errors with retry logic
|
|
apiClient.interceptors.response.use(
|
|
(response) => response,
|
|
async (error) => {
|
|
const originalRequest = error.config;
|
|
|
|
if (error.response?.status === 401 && !originalRequest._retry) {
|
|
originalRequest._retry = true;
|
|
|
|
try {
|
|
// Attempt to refresh the token
|
|
const newToken = await authService.getToken();
|
|
if (newToken) {
|
|
// Retry the original request with the new token
|
|
originalRequest.headers.Authorization = `Bearer ${newToken}`;
|
|
return apiClient(originalRequest);
|
|
}
|
|
} catch (refreshError) {
|
|
console.error('Token refresh failed:', refreshError);
|
|
}
|
|
|
|
// If token refresh fails, logout the user
|
|
authService.logout();
|
|
window.location.href = '/login';
|
|
}
|
|
|
|
return Promise.reject(error);
|
|
}
|
|
);
|
|
```
|
|
|
|
### Backend ↔ Database
|
|
|
|
**Two Connection Methods**:
|
|
|
|
1. **Supabase Client** (default for most operations):
|
|
```typescript
|
|
import { getSupabaseServiceClient } from '../config/supabase';
|
|
const supabase = getSupabaseServiceClient();
|
|
```
|
|
|
|
2. **Direct PostgreSQL** (for critical operations, bypasses PostgREST cache):
|
|
```47:81:backend/src/models/ProcessingJobModel.ts
|
|
static async create(data: CreateProcessingJobData): Promise<ProcessingJob> {
|
|
try {
|
|
// Use direct PostgreSQL connection to bypass PostgREST cache
|
|
// This is critical because PostgREST cache issues can block entire processing pipeline
|
|
const pool = getPostgresPool();
|
|
|
|
const result = await pool.query(
|
|
`INSERT INTO processing_jobs (
|
|
document_id, user_id, status, attempts, max_attempts, options, created_at
|
|
) VALUES ($1, $2, $3, $4, $5, $6, $7)
|
|
RETURNING *`,
|
|
[
|
|
data.document_id,
|
|
data.user_id,
|
|
'pending',
|
|
0,
|
|
data.max_attempts || 3,
|
|
JSON.stringify(data.options || {}),
|
|
new Date().toISOString()
|
|
]
|
|
);
|
|
|
|
if (result.rows.length === 0) {
|
|
throw new Error('Failed to create processing job: No data returned');
|
|
}
|
|
|
|
const job = result.rows[0];
|
|
|
|
logger.info('Processing job created via direct PostgreSQL', {
|
|
jobId: job.id,
|
|
documentId: data.document_id,
|
|
userId: data.user_id,
|
|
});
|
|
|
|
return job;
|
|
```
|
|
|
|
### Backend ↔ GCS
|
|
|
|
**Signed URL Generation**:
|
|
```typescript
|
|
const uploadUrl = await fileStorageService.generateSignedUploadUrl(filePath, contentType);
|
|
```
|
|
|
|
**Direct Upload** (frontend):
|
|
```403:410:frontend/src/services/documentService.ts
|
|
const fetchPromise = fetch(uploadUrl, {
|
|
method: 'PUT',
|
|
headers: {
|
|
'Content-Type': contentType, // Must match exactly what was used in signed URL generation
|
|
},
|
|
body: file,
|
|
signal: signal,
|
|
});
|
|
```
|
|
|
|
**File Download** (for processing):
|
|
```typescript
|
|
const fileBuffer = await fileStorageService.downloadFile(document.file_path);
|
|
```
|
|
|
|
### Backend ↔ Document AI
|
|
|
|
**Text Extraction**:
|
|
```148:249:backend/src/services/documentAiProcessor.ts
|
|
private async extractTextFromDocument(fileBuffer: Buffer, fileName: string, mimeType: string): Promise<string> {
|
|
try {
|
|
// Check document size first
|
|
// ... size validation ...
|
|
|
|
// Upload to GCS for Document AI processing
|
|
const gcsFileName = `temp/${Date.now()}_${fileName}`;
|
|
await this.storageClient.bucket(this.gcsBucketName).file(gcsFileName).save(fileBuffer);
|
|
|
|
// Process with Document AI
|
|
const request = {
|
|
name: this.processorName,
|
|
rawDocument: {
|
|
gcsSource: {
|
|
uri: `gs://${this.gcsBucketName}/${gcsFileName}`
|
|
},
|
|
mimeType: mimeType
|
|
}
|
|
};
|
|
|
|
const [result] = await this.documentAiClient.processDocument(request);
|
|
|
|
// Extract text from result
|
|
const text = result.document?.text || '';
|
|
|
|
// Clean up temp file
|
|
await this.storageClient.bucket(this.gcsBucketName).file(gcsFileName).delete();
|
|
|
|
return text;
|
|
} catch (error) {
|
|
// Fallback to pdf-parse
|
|
logger.warn('Document AI failed, using pdf-parse fallback', { error });
|
|
const data = await pdf(fileBuffer);
|
|
return data.text;
|
|
}
|
|
}
|
|
```
|
|
|
|
### Backend ↔ LLM APIs
|
|
|
|
**Provider Selection** (Claude/OpenAI/OpenRouter):
|
|
- Configured via `LLM_PROVIDER` environment variable
|
|
- Automatic API key selection based on provider
|
|
- Model selection based on task complexity
|
|
|
|
**Request Flow**:
|
|
```typescript
|
|
// 1. Token counting and truncation
|
|
const processedText = this.truncateText(text, availableTokens);
|
|
|
|
// 2. Model selection
|
|
const model = this.selectModel(taskComplexity);
|
|
|
|
// 3. API call with retry logic
|
|
const response = await this.callLLMAPI({
|
|
prompt: processedText,
|
|
systemPrompt: systemPrompt,
|
|
model: model,
|
|
maxTokens: this.maxTokens,
|
|
temperature: this.temperature
|
|
});
|
|
|
|
// 4. JSON parsing and validation
|
|
const parsed = JSON.parse(response.content);
|
|
const validated = cimReviewSchema.parse(parsed);
|
|
```
|
|
|
|
### Services ↔ Services
|
|
|
|
**Event-Driven Patterns**:
|
|
- `jobQueueService` emits events: `job:added`, `job:started`, `job:completed`, `job:failed`
|
|
- `uploadMonitoringService` tracks upload events
|
|
|
|
**Direct Method Calls**:
|
|
- Most service interactions are direct method calls
|
|
- Services are exported as singletons for easy access
|
|
|
|
---
|
|
|
|
## 8. Error Handling & Resilience
|
|
|
|
### Error Propagation Path
|
|
|
|
```
|
|
Service Method
|
|
│
|
|
▼ (throws error)
|
|
Controller
|
|
│
|
|
▼ (catches, logs, re-throws)
|
|
Express Error Handler
|
|
│
|
|
▼ (categorizes, logs, responds)
|
|
Client (structured error response)
|
|
```
|
|
|
|
### Error Categories
|
|
|
|
**File**: `backend/src/middleware/errorHandler.ts`
|
|
|
|
```17:26:backend/src/middleware/errorHandler.ts
|
|
export enum ErrorCategory {
|
|
VALIDATION = 'validation',
|
|
AUTHENTICATION = 'authentication',
|
|
AUTHORIZATION = 'authorization',
|
|
NOT_FOUND = 'not_found',
|
|
EXTERNAL_SERVICE = 'external_service',
|
|
PROCESSING = 'processing',
|
|
SYSTEM = 'system',
|
|
DATABASE = 'database'
|
|
}
|
|
```
|
|
|
|
**Error Response Structure**:
|
|
```29:39:backend/src/middleware/errorHandler.ts
|
|
export interface ErrorResponse {
|
|
success: false;
|
|
error: {
|
|
code: string;
|
|
message: string;
|
|
details?: any;
|
|
correlationId: string;
|
|
timestamp: string;
|
|
retryable: boolean;
|
|
};
|
|
}
|
|
```
|
|
|
|
### Retry Mechanisms
|
|
|
|
**1. Job Retries**:
|
|
- Max attempts: 3 (configurable per job)
|
|
- Exponential backoff between retries
|
|
- Jobs marked as `retrying` status
|
|
|
|
**2. API Retries**:
|
|
- LLM API calls: 3 retries with exponential backoff
|
|
- Document AI: Fallback to pdf-parse
|
|
- Vector search: 10-second timeout, fallback to direct query
|
|
|
|
**3. Database Retries**:
|
|
```10:46:backend/src/models/DocumentModel.ts
|
|
private static async retryOperation<T>(
|
|
operation: () => Promise<T>,
|
|
operationName: string,
|
|
maxRetries: number = 3,
|
|
baseDelay: number = 1000
|
|
): Promise<T> {
|
|
let lastError: any;
|
|
|
|
for (let attempt = 1; attempt <= maxRetries; attempt++) {
|
|
try {
|
|
return await operation();
|
|
} catch (error: any) {
|
|
lastError = error;
|
|
const isNetworkError = error?.message?.includes('fetch failed') ||
|
|
error?.message?.includes('ENOTFOUND') ||
|
|
error?.message?.includes('ECONNREFUSED') ||
|
|
error?.message?.includes('ETIMEDOUT') ||
|
|
error?.name === 'TypeError';
|
|
|
|
if (!isNetworkError || attempt === maxRetries) {
|
|
throw error;
|
|
}
|
|
|
|
const delay = baseDelay * Math.pow(2, attempt - 1);
|
|
logger.warn(`${operationName} failed (attempt ${attempt}/${maxRetries}), retrying in ${delay}ms`, {
|
|
error: error?.message || String(error),
|
|
code: error?.code,
|
|
attempt,
|
|
maxRetries
|
|
});
|
|
|
|
await new Promise(resolve => setTimeout(resolve, delay));
|
|
}
|
|
}
|
|
|
|
throw lastError;
|
|
}
|
|
```
|
|
|
|
### Timeout Handling
|
|
|
|
**Vector Search Timeout**:
|
|
```109:126:backend/src/services/vectorDatabaseService.ts
|
|
// Set a timeout for the RPC call (10 seconds)
|
|
const searchPromise = this.supabaseClient
|
|
.rpc('match_document_chunks', rpcParams);
|
|
|
|
const timeoutPromise = new Promise<{ data: null; error: { message: string } }>((_, reject) => {
|
|
setTimeout(() => reject(new Error('Vector search timeout after 10s')), 10000);
|
|
});
|
|
|
|
let result: any;
|
|
try {
|
|
result = await Promise.race([searchPromise, timeoutPromise]);
|
|
} catch (timeoutError: any) {
|
|
if (timeoutError.message?.includes('timeout')) {
|
|
logger.error('Vector search timed out', { documentId, timeout: '10s' });
|
|
throw new Error('Vector search timeout after 10s');
|
|
}
|
|
throw timeoutError;
|
|
}
|
|
```
|
|
|
|
**LLM API Timeout**: Handled by axios timeout configuration
|
|
|
|
**Job Timeout**: 15 minutes, jobs stuck longer are reset
|
|
|
|
### Stuck Job Detection and Recovery
|
|
|
|
```34:37:backend/src/services/jobProcessorService.ts
|
|
// Reset stuck jobs first
|
|
const resetCount = await ProcessingJobModel.resetStuckJobs(this.JOB_TIMEOUT_MINUTES);
|
|
if (resetCount > 0) {
|
|
logger.info('Reset stuck jobs', { count: resetCount });
|
|
}
|
|
```
|
|
|
|
**Scheduled Function Monitoring**:
|
|
```228:246:backend/src/index.ts
|
|
// Check for jobs stuck in processing status
|
|
const stuckProcessingJobs = await ProcessingJobModel.getStuckJobs(15); // Jobs stuck > 15 minutes
|
|
if (stuckProcessingJobs.length > 0) {
|
|
logger.warn('Found stuck processing jobs', {
|
|
count: stuckProcessingJobs.length,
|
|
jobIds: stuckProcessingJobs.map(j => j.id),
|
|
timestamp: new Date().toISOString(),
|
|
});
|
|
}
|
|
|
|
// Check for jobs stuck in pending status (alert if > 2 minutes)
|
|
const stuckPendingJobs = await ProcessingJobModel.getStuckPendingJobs(2); // Jobs pending > 2 minutes
|
|
if (stuckPendingJobs.length > 0) {
|
|
logger.warn('Found stuck pending jobs (may indicate processing issues)', {
|
|
count: stuckPendingJobs.length,
|
|
jobIds: stuckPendingJobs.map(j => j.id),
|
|
oldestJobAge: stuckPendingJobs[0] ? Math.round((Date.now() - new Date(stuckPendingJobs[0].created_at).getTime()) / 1000 / 60) : 0,
|
|
timestamp: new Date().toISOString(),
|
|
});
|
|
}
|
|
```
|
|
|
|
### Graceful Degradation
|
|
|
|
**Document AI Failure**: Falls back to `pdf-parse` library
|
|
|
|
**Vector Search Failure**: Falls back to direct database query without similarity calculation
|
|
|
|
**LLM API Failure**: Returns error with retryable flag, job can be retried
|
|
|
|
**PDF Generation Failure**: Falls back to PDFKit if Puppeteer fails
|
|
|
|
---
|
|
|
|
## 9. Performance Optimization Points
|
|
|
|
### Vector Search Optimization
|
|
|
|
**Critical**: Always pass `document_id` filter to prevent timeouts
|
|
|
|
```104:107:backend/src/services/vectorDatabaseService.ts
|
|
// Add document_id filter if provided (critical for performance)
|
|
if (documentId) {
|
|
rpcParams.filter_document_id = documentId;
|
|
}
|
|
```
|
|
|
|
**SQL Function Optimization**: `match_document_chunks` filters by `document_id` first before vector similarity calculation
|
|
|
|
### Chunking Strategy
|
|
|
|
**Optimal Configuration**:
|
|
```32:35:backend/src/services/optimizedAgenticRAGProcessor.ts
|
|
private readonly maxChunkSize = 4000; // Optimal chunk size for embeddings
|
|
private readonly overlapSize = 200; // Overlap between chunks
|
|
private readonly maxConcurrentEmbeddings = 5; // Limit concurrent API calls
|
|
private readonly batchSize = 10; // Process chunks in batches
|
|
```
|
|
|
|
**Semantic Chunking**: Detects paragraph and section boundaries for better chunk quality
|
|
|
|
### Batch Processing
|
|
|
|
**Embedding Generation**:
|
|
- Processes chunks in batches of 10
|
|
- Max 5 concurrent embedding API calls
|
|
- Prevents memory overflow and API rate limiting
|
|
|
|
**Chunk Storage**:
|
|
- Batched database inserts
|
|
- Reduces database round trips
|
|
|
|
### Memory Management
|
|
|
|
**Chunk Processing**:
|
|
- Processes chunks in batches to limit memory usage
|
|
- Cleans up processed chunks from memory after storage
|
|
|
|
**PDF Generation**:
|
|
- Page pooling (max 5 pages)
|
|
- Page timeout (30 seconds)
|
|
- Cache with 5-minute TTL
|
|
|
|
### Database Optimization
|
|
|
|
**Direct PostgreSQL for Critical Operations**:
|
|
- Job creation uses direct PostgreSQL to bypass PostgREST cache issues
|
|
- Ensures reliable job creation even when PostgREST schema cache is stale
|
|
|
|
**Connection Pooling**:
|
|
- Supabase client uses connection pooling
|
|
- Direct PostgreSQL uses pg pool
|
|
|
|
### API Call Optimization
|
|
|
|
**LLM Token Management**:
|
|
- Automatic token counting
|
|
- Text truncation if exceeds limits
|
|
- Model selection based on complexity (smaller models for simpler tasks)
|
|
|
|
**Embedding Caching**:
|
|
```31:32:backend/src/services/vectorDatabaseService.ts
|
|
private semanticCache: Map<string, { embedding: number[]; timestamp: number }> = new Map();
|
|
private readonly CACHE_TTL = 3600000; // 1 hour cache TTL
|
|
```
|
|
|
|
---
|
|
|
|
## 10. Background Processing Architecture
|
|
|
|
### Legacy vs Current System
|
|
|
|
**Legacy: In-Memory Queue** (`jobQueueService`)
|
|
- EventEmitter-based
|
|
- In-memory job storage
|
|
- Still initialized but being phased out
|
|
- Location: `backend/src/services/jobQueueService.ts`
|
|
|
|
**Current: Database-Backed Queue** (`jobProcessorService`)
|
|
- Database-backed job storage
|
|
- Scheduled processing via Firebase Cloud Scheduler
|
|
- Location: `backend/src/services/jobProcessorService.ts`
|
|
|
|
### Job Processing Flow
|
|
|
|
```
|
|
Job Creation
|
|
│
|
|
▼
|
|
ProcessingJobModel.create()
|
|
│
|
|
▼
|
|
Status: 'pending' in database
|
|
│
|
|
▼
|
|
Scheduled Function (every 1 minute)
|
|
OR
|
|
Immediate processing via API
|
|
│
|
|
▼
|
|
JobProcessorService.processJobs()
|
|
│
|
|
▼
|
|
Get pending/retrying jobs (max 3 concurrent)
|
|
│
|
|
▼
|
|
Process jobs in parallel
|
|
│
|
|
▼
|
|
For each job:
|
|
- Mark as 'processing'
|
|
- Download file from GCS
|
|
- Call unifiedDocumentProcessor
|
|
- Update document status
|
|
- Mark job as 'completed' or 'failed'
|
|
```
|
|
|
|
### Scheduled Function
|
|
|
|
**File**: `backend/src/index.ts`
|
|
|
|
```210:267:backend/src/index.ts
|
|
export const processDocumentJobs = onSchedule({
|
|
schedule: 'every 1 minutes', // Minimum interval for Firebase Cloud Scheduler
|
|
timeoutSeconds: 900, // 15 minutes (max for Gen2 scheduled functions)
|
|
memory: '1GiB',
|
|
retryCount: 2, // Retry up to 2 times on failure
|
|
}, async (event) => {
|
|
logger.info('Processing document jobs scheduled function triggered', {
|
|
timestamp: new Date().toISOString(),
|
|
scheduleTime: event.scheduleTime,
|
|
});
|
|
|
|
try {
|
|
const { jobProcessorService } = await import('./services/jobProcessorService');
|
|
|
|
// Check for stuck jobs before processing (monitoring)
|
|
const { ProcessingJobModel } = await import('./models/ProcessingJobModel');
|
|
|
|
// Check for jobs stuck in processing status
|
|
const stuckProcessingJobs = await ProcessingJobModel.getStuckJobs(15); // Jobs stuck > 15 minutes
|
|
if (stuckProcessingJobs.length > 0) {
|
|
logger.warn('Found stuck processing jobs', {
|
|
count: stuckProcessingJobs.length,
|
|
jobIds: stuckProcessingJobs.map(j => j.id),
|
|
timestamp: new Date().toISOString(),
|
|
});
|
|
}
|
|
|
|
// Check for jobs stuck in pending status (alert if > 2 minutes)
|
|
const stuckPendingJobs = await ProcessingJobModel.getStuckPendingJobs(2); // Jobs pending > 2 minutes
|
|
if (stuckPendingJobs.length > 0) {
|
|
logger.warn('Found stuck pending jobs (may indicate processing issues)', {
|
|
count: stuckPendingJobs.length,
|
|
jobIds: stuckPendingJobs.map(j => j.id),
|
|
oldestJobAge: stuckPendingJobs[0] ? Math.round((Date.now() - new Date(stuckPendingJobs[0].created_at).getTime()) / 1000 / 60) : 0,
|
|
timestamp: new Date().toISOString(),
|
|
});
|
|
}
|
|
|
|
const result = await jobProcessorService.processJobs();
|
|
|
|
logger.info('Document jobs processing completed', {
|
|
...result,
|
|
timestamp: new Date().toISOString(),
|
|
});
|
|
} catch (error) {
|
|
const errorMessage = error instanceof Error ? error.message : String(error);
|
|
const errorStack = error instanceof Error ? error.stack : undefined;
|
|
|
|
logger.error('Error processing document jobs', {
|
|
error: errorMessage,
|
|
stack: errorStack,
|
|
timestamp: new Date().toISOString(),
|
|
});
|
|
|
|
// Re-throw to trigger retry mechanism (up to retryCount times)
|
|
throw error;
|
|
}
|
|
});
|
|
```
|
|
|
|
### Job States
|
|
|
|
```
|
|
pending → processing → completed
|
|
│ │
|
|
│ ▼
|
|
│ failed
|
|
│ │
|
|
└──────────────────────┘
|
|
│
|
|
▼
|
|
retrying
|
|
│
|
|
▼
|
|
(back to pending)
|
|
```
|
|
|
|
### Concurrency Control
|
|
|
|
**Max Concurrent Jobs**: 3
|
|
|
|
```9:10:backend/src/services/jobProcessorService.ts
|
|
private readonly MAX_CONCURRENT_JOBS = 3;
|
|
private readonly JOB_TIMEOUT_MINUTES = 15;
|
|
```
|
|
|
|
**Processing Logic**:
|
|
```40:63:backend/src/services/jobProcessorService.ts
|
|
// Get pending jobs
|
|
const pendingJobs = await ProcessingJobModel.getPendingJobs(this.MAX_CONCURRENT_JOBS);
|
|
|
|
// Get retrying jobs (enabled - schema is updated)
|
|
const retryingJobs = await ProcessingJobModel.getRetryableJobs(
|
|
Math.max(0, this.MAX_CONCURRENT_JOBS - pendingJobs.length)
|
|
);
|
|
|
|
const allJobs = [...pendingJobs, ...retryingJobs];
|
|
|
|
if (allJobs.length === 0) {
|
|
logger.debug('No jobs to process');
|
|
return stats;
|
|
}
|
|
|
|
logger.info('Processing jobs', {
|
|
totalJobs: allJobs.length,
|
|
pendingJobs: pendingJobs.length,
|
|
retryingJobs: retryingJobs.length,
|
|
});
|
|
|
|
// Process jobs in parallel (up to MAX_CONCURRENT_JOBS)
|
|
const results = await Promise.allSettled(
|
|
allJobs.map((job) => this.processJob(job.id))
|
|
);
|
|
```
|
|
|
|
---
|
|
|
|
## 11. Frontend Architecture
|
|
|
|
### Component Structure
|
|
|
|
**Main Components**:
|
|
- `DocumentUpload` - File upload with drag-and-drop
|
|
- `DocumentList` - List of user's documents with status
|
|
- `DocumentViewer` - View processed document and PDF
|
|
- `Analytics` - Processing statistics dashboard
|
|
- `UploadMonitoringDashboard` - Real-time upload monitoring
|
|
|
|
### State Management
|
|
|
|
**AuthContext** (`frontend/src/contexts/AuthContext.tsx`):
|
|
```11:46:frontend/src/contexts/AuthContext.tsx
|
|
export const AuthProvider: React.FC<AuthProviderProps> = ({ children }) => {
|
|
const [user, setUser] = useState<User | null>(null);
|
|
const [token, setToken] = useState<string | null>(null);
|
|
const [isLoading, setIsLoading] = useState(true);
|
|
const [error, setError] = useState<string | null>(null);
|
|
const [isInitialized, setIsInitialized] = useState(false);
|
|
|
|
useEffect(() => {
|
|
setIsLoading(true);
|
|
|
|
// Listen for Firebase auth state changes
|
|
const unsubscribe = authService.onAuthStateChanged(async (firebaseUser) => {
|
|
try {
|
|
if (firebaseUser) {
|
|
const user = authService.getCurrentUser();
|
|
const token = await authService.getToken();
|
|
setUser(user);
|
|
setToken(token);
|
|
} else {
|
|
setUser(null);
|
|
setToken(null);
|
|
}
|
|
} catch (error) {
|
|
console.error('Auth state change error:', error);
|
|
setError('Authentication error occurred');
|
|
setUser(null);
|
|
setToken(null);
|
|
} finally {
|
|
setIsLoading(false);
|
|
setIsInitialized(true);
|
|
}
|
|
});
|
|
|
|
// Cleanup subscription on unmount
|
|
return () => unsubscribe();
|
|
}, []);
|
|
```
|
|
|
|
### API Communication
|
|
|
|
**Document Service** (`frontend/src/services/documentService.ts`):
|
|
- Axios client with auth interceptor
|
|
- Automatic token refresh on 401 errors
|
|
- Progress tracking for uploads
|
|
- Error handling with user-friendly messages
|
|
|
|
**Upload Flow**:
|
|
```224:361:frontend/src/services/documentService.ts
|
|
async uploadDocument(
|
|
file: File,
|
|
onProgress?: (progress: number) => void,
|
|
signal?: AbortSignal
|
|
): Promise<Document> {
|
|
try {
|
|
// Check authentication before upload
|
|
const token = await authService.getToken();
|
|
if (!token) {
|
|
throw new Error('Authentication required. Please log in to upload documents.');
|
|
}
|
|
|
|
// Step 1: Get signed upload URL
|
|
onProgress?.(5); // 5% - Getting upload URL
|
|
|
|
const uploadUrlResponse = await apiClient.post('/documents/upload-url', {
|
|
fileName: file.name,
|
|
fileSize: file.size,
|
|
contentType: contentTypeForSigning
|
|
}, { signal });
|
|
|
|
const { documentId, uploadUrl } = uploadUrlResponse.data;
|
|
|
|
// Step 2: Upload directly to Firebase Storage
|
|
onProgress?.(10); // 10% - Starting direct upload
|
|
|
|
await this.uploadToFirebaseStorage(
|
|
file,
|
|
uploadUrl,
|
|
contentTypeForSigning,
|
|
(uploadProgress) => {
|
|
// Map upload progress (10-90%)
|
|
const mappedProgress = 10 + (uploadProgress * 0.8);
|
|
onProgress?.(mappedProgress);
|
|
},
|
|
signal
|
|
);
|
|
|
|
// Step 3: Confirm upload
|
|
onProgress?.(90); // 90% - Confirming upload
|
|
|
|
const confirmResponse = await apiClient.post(
|
|
`/documents/${documentId}/confirm-upload`,
|
|
{},
|
|
{ signal }
|
|
);
|
|
|
|
onProgress?.(100); // 100% - Complete
|
|
|
|
return confirmResponse.data.document;
|
|
} catch (error) {
|
|
// ... error handling ...
|
|
}
|
|
}
|
|
```
|
|
|
|
### Real-Time Updates
|
|
|
|
**Polling for Processing Status**:
|
|
- Frontend polls `/documents/:id` endpoint
|
|
- Updates UI when status changes from 'processing' to 'completed'
|
|
- Shows error messages if status is 'failed'
|
|
|
|
**Upload Progress**:
|
|
- Real-time progress tracking via `onProgress` callback
|
|
- Visual progress bar in `DocumentUpload` component
|
|
|
|
---
|
|
|
|
## 12. Configuration & Environment
|
|
|
|
### Environment Variables
|
|
|
|
**File**: `backend/src/config/env.ts`
|
|
|
|
**Key Configuration Categories**:
|
|
|
|
1. **LLM Provider**:
|
|
- `LLM_PROVIDER` - 'anthropic', 'openai', or 'openrouter'
|
|
- `ANTHROPIC_API_KEY` - Claude API key
|
|
- `OPENAI_API_KEY` - OpenAI API key
|
|
- `OPENROUTER_API_KEY` - OpenRouter API key
|
|
- `LLM_MODEL` - Model name (e.g., 'claude-sonnet-4-5-20250929')
|
|
- `LLM_MAX_TOKENS` - Max output tokens
|
|
- `LLM_MAX_INPUT_TOKENS` - Max input tokens (default 200000)
|
|
|
|
2. **Database**:
|
|
- `SUPABASE_URL` - Supabase project URL
|
|
- `SUPABASE_SERVICE_KEY` - Service role key
|
|
- `SUPABASE_ANON_KEY` - Anonymous key
|
|
|
|
3. **Google Cloud**:
|
|
- `GCLOUD_PROJECT_ID` - GCP project ID
|
|
- `GCS_BUCKET_NAME` - Storage bucket name
|
|
- `DOCUMENT_AI_PROCESSOR_ID` - Document AI processor ID
|
|
- `DOCUMENT_AI_LOCATION` - Processor location (default 'us')
|
|
|
|
4. **Feature Flags**:
|
|
- `AGENTIC_RAG_ENABLED` - Enable/disable agentic RAG processing
|
|
|
|
### Configuration Loading
|
|
|
|
**Priority Order**:
|
|
1. `process.env` (Firebase Functions v2)
|
|
2. `functions.config()` (Firebase Functions v1 fallback)
|
|
3. `.env` file (local development)
|
|
|
|
**Validation**: Joi schema validates all required environment variables
|
|
|
|
---
|
|
|
|
## 13. Debugging Guide
|
|
|
|
### Key Log Points
|
|
|
|
**Correlation IDs**: Every request has a correlation ID for tracing
|
|
|
|
**Structured Logging**: Winston logger with structured data
|
|
|
|
**Key Log Locations**:
|
|
1. **Request Entry**: `backend/src/index.ts` - All incoming requests
|
|
2. **Authentication**: `backend/src/middleware/firebaseAuth.ts` - Auth success/failure
|
|
3. **Job Processing**: `backend/src/services/jobProcessorService.ts` - Job lifecycle
|
|
4. **Document Processing**: `backend/src/services/unifiedDocumentProcessor.ts` - Processing steps
|
|
5. **LLM Calls**: `backend/src/services/llmService.ts` - API calls and responses
|
|
6. **Vector Search**: `backend/src/services/vectorDatabaseService.ts` - Search operations
|
|
7. **Error Handling**: `backend/src/middleware/errorHandler.ts` - All errors with categorization
|
|
|
|
### Common Failure Points
|
|
|
|
**1. Vector Search Timeouts**
|
|
- **Symptom**: "Vector search timeout after 10s"
|
|
- **Cause**: Searching across all documents without `document_id` filter
|
|
- **Fix**: Always pass `documentId` to `vectorDatabaseService.searchSimilar()`
|
|
|
|
**2. LLM API Failures**
|
|
- **Symptom**: "LLM API call failed" or "Invalid JSON response"
|
|
- **Cause**: API rate limits, network issues, or invalid response format
|
|
- **Fix**: Check API keys, retry logic, and response validation
|
|
|
|
**3. GCS Upload Failures**
|
|
- **Symptom**: "Failed to upload to GCS" or "Signed URL expired"
|
|
- **Cause**: Credential issues, bucket permissions, or URL expiration
|
|
- **Fix**: Check GCS credentials and bucket configuration
|
|
|
|
**4. Job Stuck in Processing**
|
|
- **Symptom**: Job status remains 'processing' for > 15 minutes
|
|
- **Cause**: Process crashed, timeout, or error not caught
|
|
- **Fix**: Check logs, reset stuck jobs, investigate error
|
|
|
|
**5. Document AI Failures**
|
|
- **Symptom**: "Failed to extract text from document"
|
|
- **Cause**: Document AI API error or invalid file format
|
|
- **Fix**: Check Document AI processor configuration, fallback to pdf-parse
|
|
|
|
### Diagnostic Tools
|
|
|
|
**Health Check Endpoints**:
|
|
- `GET /health` - Basic health check
|
|
- `GET /health/config` - Configuration health
|
|
- `GET /health/agentic-rag` - Agentic RAG health status
|
|
|
|
**Monitoring Endpoints**:
|
|
- `GET /monitoring/upload-metrics` - Upload statistics
|
|
- `GET /monitoring/upload-health` - Upload health
|
|
- `GET /monitoring/real-time-stats` - Real-time statistics
|
|
|
|
**Database Debugging**:
|
|
```sql
|
|
-- Check pending jobs
|
|
SELECT * FROM processing_jobs WHERE status = 'pending' ORDER BY created_at DESC;
|
|
|
|
-- Check stuck jobs
|
|
SELECT * FROM processing_jobs
|
|
WHERE status = 'processing'
|
|
AND started_at < NOW() - INTERVAL '15 minutes';
|
|
|
|
-- Check document status
|
|
SELECT id, original_file_name, status, created_at
|
|
FROM documents
|
|
WHERE user_id = '<user_id>'
|
|
ORDER BY created_at DESC;
|
|
```
|
|
|
|
**Job Inspection**:
|
|
```typescript
|
|
// Get job details
|
|
const job = await ProcessingJobModel.findById(jobId);
|
|
|
|
// Check job error
|
|
console.log('Job error:', job.error);
|
|
|
|
// Check job result
|
|
console.log('Job result:', job.result);
|
|
```
|
|
|
|
### Debugging Workflow
|
|
|
|
1. **Identify the Issue**: Check error logs with correlation ID
|
|
2. **Trace the Request**: Follow correlation ID through logs
|
|
3. **Check Job Status**: Query `processing_jobs` table for job state
|
|
4. **Check Document Status**: Query `documents` table for document state
|
|
5. **Review Service Logs**: Check specific service logs for detailed errors
|
|
6. **Test Components**: Test individual services in isolation
|
|
7. **Check External Services**: Verify GCS, Document AI, LLM APIs are accessible
|
|
|
|
---
|
|
|
|
## 14. Optimization Opportunities
|
|
|
|
### Identified Bottlenecks
|
|
|
|
**1. Vector Search Performance**
|
|
- **Current**: 10-second timeout, can be slow for large document sets
|
|
- **Optimization**: Ensure `document_id` filter is always used
|
|
- **Future**: Consider indexing optimizations, batch search
|
|
|
|
**2. LLM API Calls**
|
|
- **Current**: Sequential processing, no caching of similar requests
|
|
- **Optimization**: Implement response caching for similar documents
|
|
- **Future**: Batch API calls, use smaller models for simpler tasks
|
|
|
|
**3. PDF Generation**
|
|
- **Current**: Puppeteer can be memory-intensive
|
|
- **Optimization**: Page pooling already implemented
|
|
- **Future**: Consider serverless PDF generation service
|
|
|
|
**4. Database Queries**
|
|
- **Current**: Some queries don't use indexes effectively
|
|
- **Optimization**: Add indexes on frequently queried columns
|
|
- **Future**: Query optimization, connection pooling tuning
|
|
|
|
### Memory Usage Patterns
|
|
|
|
**Chunk Processing**:
|
|
- Processes chunks in batches to limit memory
|
|
- Cleans up processed chunks after storage
|
|
- **Optimization**: Consider streaming for very large documents
|
|
|
|
**PDF Generation**:
|
|
- Page pooling limits memory usage
|
|
- Browser instance reuse reduces overhead
|
|
- **Optimization**: Consider headless browser optimization
|
|
|
|
### API Call Optimization
|
|
|
|
**Embedding Generation**:
|
|
- Current: Max 5 concurrent calls
|
|
- **Optimization**: Tune based on API rate limits
|
|
- **Future**: Batch embedding API if available
|
|
|
|
**LLM Calls**:
|
|
- Current: Single call per document
|
|
- **Optimization**: Use smaller models for simpler tasks
|
|
- **Future**: Implement response caching
|
|
|
|
### Database Query Optimization
|
|
|
|
**Frequently Queried Tables**:
|
|
- `documents` - Add index on `user_id`, `status`
|
|
- `processing_jobs` - Add index on `status`, `created_at`
|
|
- `document_chunks` - Add index on `document_id`, `chunk_index`
|
|
|
|
**Vector Search**:
|
|
- Current: Uses `match_document_chunks` function
|
|
- **Optimization**: Ensure `document_id` filter is always used
|
|
- **Future**: Consider HNSW index for faster similarity search
|
|
|
|
---
|
|
|
|
## Appendix: Key File Locations
|
|
|
|
### Backend Services
|
|
- `backend/src/services/unifiedDocumentProcessor.ts` - Main orchestrator
|
|
- `backend/src/services/optimizedAgenticRAGProcessor.ts` - AI processing engine
|
|
- `backend/src/services/jobProcessorService.ts` - Job processor
|
|
- `backend/src/services/vectorDatabaseService.ts` - Vector operations
|
|
- `backend/src/services/llmService.ts` - LLM interactions
|
|
- `backend/src/services/documentAiProcessor.ts` - Document AI integration
|
|
- `backend/src/services/pdfGenerationService.ts` - PDF generation
|
|
- `backend/src/services/fileStorageService.ts` - GCS operations
|
|
|
|
### Backend Models
|
|
- `backend/src/models/DocumentModel.ts` - Document data model
|
|
- `backend/src/models/ProcessingJobModel.ts` - Job data model
|
|
- `backend/src/models/VectorDatabaseModel.ts` - Vector data model
|
|
|
|
### Backend Routes
|
|
- `backend/src/routes/documents.ts` - Document endpoints
|
|
- `backend/src/routes/vector.ts` - Vector endpoints
|
|
- `backend/src/routes/monitoring.ts` - Monitoring endpoints
|
|
|
|
### Backend Controllers
|
|
- `backend/src/controllers/documentController.ts` - Document controller
|
|
|
|
### Frontend Services
|
|
- `frontend/src/services/documentService.ts` - Document API client
|
|
- `frontend/src/services/authService.ts` - Authentication service
|
|
|
|
### Frontend Components
|
|
- `frontend/src/components/DocumentUpload.tsx` - Upload component
|
|
- `frontend/src/components/DocumentList.tsx` - Document list
|
|
- `frontend/src/components/DocumentViewer.tsx` - Document viewer
|
|
|
|
### Configuration
|
|
- `backend/src/config/env.ts` - Environment configuration
|
|
- `backend/src/config/supabase.ts` - Supabase configuration
|
|
- `backend/src/config/firebase.ts` - Firebase configuration
|
|
|
|
### SQL
|
|
- `backend/sql/fix_vector_search_timeout.sql` - Vector search optimization
|
|
|
|
---
|
|
|
|
**End of Architecture Summary**
|
|
|