15 KiB
Design Document
Overview
This design addresses the systematic cleanup and stabilization of the CIM Document Processor backend, with a focus on fixing the processing pipeline and integrating Firebase Storage as the primary file storage solution. The design identifies critical issues in the current codebase and provides a comprehensive solution to ensure reliable document processing from upload through final PDF generation.
Architecture
Current Issues Identified
Based on error analysis and code review, the following critical issues have been identified:
- Database Query Issues: UUID validation errors when non-UUID strings are passed to document queries
- Service Dependencies: Circular dependencies and missing service imports
- Firebase Storage Integration: Incomplete migration from Google Cloud Storage to Firebase Storage
- Error Handling: Insufficient error handling and logging throughout the pipeline
- Configuration Management: Environment variable validation issues in serverless environments
- Processing Pipeline: Broken service orchestration in the document processing flow
Target Architecture
┌─────────────────────────────────────────────────────────────────────────────┐
│ FRONTEND (React) │
├─────────────────────────────────────────────────────────────────────────────┤
│ Firebase Auth + Document Upload → Firebase Storage │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼ HTTPS API
┌─────────────────────────────────────────────────────────────────────────────┐
│ BACKEND (Node.js) │
├─────────────────────────────────────────────────────────────────────────────┤
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ API Routes │ │ Middleware │ │ Error Handler │ │
│ │ - Documents │ │ - Auth │ │ - Global │ │
│ │ - Monitoring │ │ - Validation │ │ - Correlation │ │
│ │ - Vector │ │ - CORS │ │ - Logging │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Core Services │ │ Processing │ │ External APIs │ │
│ │ - Document │ │ - Agentic RAG │ │ - Document AI │ │
│ │ - Upload │ │ - LLM Service │ │ - Claude AI │ │
│ │ - Session │ │ - PDF Gen │ │ - Firebase │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ STORAGE & DATABASE │
├─────────────────────────────────────────────────────────────────────────────┤
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Firebase │ │ Supabase │ │ Vector │ │
│ │ Storage │ │ Database │ │ Database │ │
│ │ - File Upload │ │ - Documents │ │ - Embeddings │ │
│ │ - Security │ │ - Sessions │ │ - Chunks │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
Components and Interfaces
1. Enhanced Configuration Management
Purpose: Robust environment variable validation and configuration management for both development and production environments.
Key Features:
- Graceful handling of missing environment variables in serverless environments
- Runtime configuration validation
- Fallback values for non-critical settings
- Clear error messages for missing critical configuration
Interface:
interface Config {
// Core settings
env: string;
port: number;
// Firebase configuration
firebase: {
projectId: string;
storageBucket: string;
apiKey: string;
authDomain: string;
};
// Database configuration
supabase: {
url: string;
anonKey: string;
serviceKey: string;
};
// External services
googleCloud: {
projectId: string;
documentAiLocation: string;
documentAiProcessorId: string;
applicationCredentials: string;
};
// LLM configuration
llm: {
provider: 'anthropic' | 'openai';
anthropicApiKey?: string;
openaiApiKey?: string;
model: string;
maxTokens: number;
temperature: number;
};
}
2. Firebase Storage Service
Purpose: Complete Firebase Storage integration replacing Google Cloud Storage for file operations.
Key Features:
- Secure file upload with Firebase Authentication
- Proper file organization and naming conventions
- File metadata management
- Download URL generation
- File cleanup and lifecycle management
Interface:
interface FirebaseStorageService {
uploadFile(file: Buffer, fileName: string, userId: string): Promise<string>;
getDownloadUrl(filePath: string): Promise<string>;
deleteFile(filePath: string): Promise<void>;
getFileMetadata(filePath: string): Promise<FileMetadata>;
generateUploadUrl(fileName: string, userId: string): Promise<string>;
}
3. Enhanced Document Service
Purpose: Centralized document management with proper error handling and validation.
Key Features:
- UUID validation for all document operations
- Proper error handling for database operations
- Document lifecycle management
- Status tracking and updates
- Metadata management
Interface:
interface DocumentService {
createDocument(data: CreateDocumentData): Promise<Document>;
getDocument(id: string): Promise<Document | null>;
updateDocument(id: string, updates: Partial<Document>): Promise<Document>;
deleteDocument(id: string): Promise<void>;
listDocuments(userId: string, filters?: DocumentFilters): Promise<Document[]>;
validateDocumentId(id: string): boolean;
}
4. Improved Processing Pipeline
Purpose: Reliable document processing pipeline with proper error handling and recovery.
Key Features:
- Step-by-step processing with checkpoints
- Error recovery and retry mechanisms
- Progress tracking and status updates
- Partial result preservation
- Processing timeout handling
Interface:
interface ProcessingPipeline {
processDocument(documentId: string, options: ProcessingOptions): Promise<ProcessingResult>;
getProcessingStatus(documentId: string): Promise<ProcessingStatus>;
retryProcessing(documentId: string, fromStep?: string): Promise<ProcessingResult>;
cancelProcessing(documentId: string): Promise<void>;
}
5. Robust Error Handling System
Purpose: Comprehensive error handling with correlation tracking and proper logging.
Key Features:
- Correlation ID generation for request tracking
- Structured error logging
- Error categorization and handling strategies
- User-friendly error messages
- Error recovery mechanisms
Interface:
interface ErrorHandler {
handleError(error: Error, context: ErrorContext): ErrorResponse;
logError(error: Error, correlationId: string, context: any): void;
createCorrelationId(): string;
categorizeError(error: Error): ErrorCategory;
}
Data Models
Enhanced Document Model
interface Document {
id: string; // UUID
userId: string;
originalFileName: string;
filePath: string; // Firebase Storage path
fileSize: number;
mimeType: string;
status: DocumentStatus;
extractedText?: string;
generatedSummary?: string;
summaryPdfPath?: string;
analysisData?: CIMReview;
processingSteps: ProcessingStep[];
errorLog?: ErrorEntry[];
createdAt: Date;
updatedAt: Date;
}
enum DocumentStatus {
UPLOADED = 'uploaded',
PROCESSING = 'processing',
COMPLETED = 'completed',
FAILED = 'failed',
CANCELLED = 'cancelled'
}
interface ProcessingStep {
step: string;
status: 'pending' | 'in_progress' | 'completed' | 'failed';
startedAt?: Date;
completedAt?: Date;
error?: string;
metadata?: Record<string, any>;
}
Processing Session Model
interface ProcessingSession {
id: string;
documentId: string;
strategy: string;
status: SessionStatus;
steps: ProcessingStep[];
totalSteps: number;
completedSteps: number;
failedSteps: number;
processingTimeMs?: number;
apiCallsCount: number;
totalCost: number;
errorLog: ErrorEntry[];
createdAt: Date;
completedAt?: Date;
}
Error Handling
Error Categories and Strategies
-
Validation Errors
- UUID format validation
- File type and size validation
- Required field validation
- Strategy: Return 400 Bad Request with detailed error message
-
Authentication Errors
- Invalid or expired tokens
- Missing authentication
- Strategy: Return 401 Unauthorized, trigger token refresh
-
Authorization Errors
- Insufficient permissions
- Resource access denied
- Strategy: Return 403 Forbidden with clear message
-
Resource Not Found
- Document not found
- File not found
- Strategy: Return 404 Not Found
-
External Service Errors
- Firebase Storage errors
- Document AI failures
- LLM API errors
- Strategy: Retry with exponential backoff, fallback options
-
Processing Errors
- Text extraction failures
- PDF generation errors
- Database operation failures
- Strategy: Preserve partial results, enable retry from checkpoint
-
System Errors
- Memory issues
- Timeout errors
- Network failures
- Strategy: Graceful degradation, error logging, monitoring alerts
Error Response Format
interface ErrorResponse {
success: false;
error: {
code: string;
message: string;
details?: any;
correlationId: string;
timestamp: string;
retryable: boolean;
};
}
Testing Strategy
Unit Testing
- Service layer testing with mocked dependencies
- Utility function testing
- Configuration validation testing
- Error handling testing
Integration Testing
- Firebase Storage integration
- Database operations
- External API integrations
- End-to-end processing pipeline
Error Scenario Testing
- Network failure simulation
- API rate limit testing
- Invalid input handling
- Timeout scenario testing
Performance Testing
- Large file upload testing
- Concurrent processing testing
- Memory usage monitoring
- API response time testing
Implementation Phases
Phase 1: Core Infrastructure Cleanup
- Fix configuration management and environment variable handling
- Implement proper UUID validation for database queries
- Set up comprehensive error handling and logging
- Fix service dependency issues
Phase 2: Firebase Storage Integration
- Implement Firebase Storage service
- Update file upload endpoints
- Migrate existing file operations
- Update frontend integration
Phase 3: Processing Pipeline Stabilization
- Fix service orchestration issues
- Implement proper error recovery
- Add processing checkpoints
- Enhance monitoring and logging
Phase 4: Testing and Optimization
- Comprehensive testing suite
- Performance optimization
- Error scenario testing
- Documentation updates
Security Considerations
Firebase Storage Security
- Proper Firebase Security Rules
- User-based file access control
- File type and size validation
- Secure download URL generation
API Security
- Request validation and sanitization
- Rate limiting and abuse prevention
- Correlation ID tracking
- Secure error messages (no sensitive data exposure)
Data Protection
- User data isolation
- Secure file deletion
- Audit logging
- GDPR compliance considerations
Monitoring and Observability
Key Metrics
- Document processing success rate
- Average processing time per document
- API response times
- Error rates by category
- Firebase Storage usage
- Database query performance
Logging Strategy
- Structured logging with correlation IDs
- Error categorization and tracking
- Performance metrics logging
- User activity logging
- External service interaction logging
Health Checks
- Service availability checks
- Database connectivity
- External service status
- File storage accessibility
- Processing pipeline health