417 lines
15 KiB
Markdown
417 lines
15 KiB
Markdown
# Design Document
|
|
|
|
## Overview
|
|
|
|
This design addresses the systematic cleanup and stabilization of the CIM Document Processor backend, with a focus on fixing the processing pipeline and integrating Firebase Storage as the primary file storage solution. The design identifies critical issues in the current codebase and provides a comprehensive solution to ensure reliable document processing from upload through final PDF generation.
|
|
|
|
## Architecture
|
|
|
|
### Current Issues Identified
|
|
|
|
Based on error analysis and code review, the following critical issues have been identified:
|
|
|
|
1. **Database Query Issues**: UUID validation errors when non-UUID strings are passed to document queries
|
|
2. **Service Dependencies**: Circular dependencies and missing service imports
|
|
3. **Firebase Storage Integration**: Incomplete migration from Google Cloud Storage to Firebase Storage
|
|
4. **Error Handling**: Insufficient error handling and logging throughout the pipeline
|
|
5. **Configuration Management**: Environment variable validation issues in serverless environments
|
|
6. **Processing Pipeline**: Broken service orchestration in the document processing flow
|
|
|
|
### Target Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
│ FRONTEND (React) │
|
|
├─────────────────────────────────────────────────────────────────────────────┤
|
|
│ Firebase Auth + Document Upload → Firebase Storage │
|
|
└─────────────────────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼ HTTPS API
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
│ BACKEND (Node.js) │
|
|
├─────────────────────────────────────────────────────────────────────────────┤
|
|
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
|
|
│ │ API Routes │ │ Middleware │ │ Error Handler │ │
|
|
│ │ - Documents │ │ - Auth │ │ - Global │ │
|
|
│ │ - Monitoring │ │ - Validation │ │ - Correlation │ │
|
|
│ │ - Vector │ │ - CORS │ │ - Logging │ │
|
|
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
|
|
│ │
|
|
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
|
|
│ │ Core Services │ │ Processing │ │ External APIs │ │
|
|
│ │ - Document │ │ - Agentic RAG │ │ - Document AI │ │
|
|
│ │ - Upload │ │ - LLM Service │ │ - Claude AI │ │
|
|
│ │ - Session │ │ - PDF Gen │ │ - Firebase │ │
|
|
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
|
|
└─────────────────────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
│ STORAGE & DATABASE │
|
|
├─────────────────────────────────────────────────────────────────────────────┤
|
|
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
|
|
│ │ Firebase │ │ Supabase │ │ Vector │ │
|
|
│ │ Storage │ │ Database │ │ Database │ │
|
|
│ │ - File Upload │ │ - Documents │ │ - Embeddings │ │
|
|
│ │ - Security │ │ - Sessions │ │ - Chunks │ │
|
|
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
|
|
└─────────────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Components and Interfaces
|
|
|
|
### 1. Enhanced Configuration Management
|
|
|
|
**Purpose**: Robust environment variable validation and configuration management for both development and production environments.
|
|
|
|
**Key Features**:
|
|
- Graceful handling of missing environment variables in serverless environments
|
|
- Runtime configuration validation
|
|
- Fallback values for non-critical settings
|
|
- Clear error messages for missing critical configuration
|
|
|
|
**Interface**:
|
|
```typescript
|
|
interface Config {
|
|
// Core settings
|
|
env: string;
|
|
port: number;
|
|
|
|
// Firebase configuration
|
|
firebase: {
|
|
projectId: string;
|
|
storageBucket: string;
|
|
apiKey: string;
|
|
authDomain: string;
|
|
};
|
|
|
|
// Database configuration
|
|
supabase: {
|
|
url: string;
|
|
anonKey: string;
|
|
serviceKey: string;
|
|
};
|
|
|
|
// External services
|
|
googleCloud: {
|
|
projectId: string;
|
|
documentAiLocation: string;
|
|
documentAiProcessorId: string;
|
|
applicationCredentials: string;
|
|
};
|
|
|
|
// LLM configuration
|
|
llm: {
|
|
provider: 'anthropic' | 'openai';
|
|
anthropicApiKey?: string;
|
|
openaiApiKey?: string;
|
|
model: string;
|
|
maxTokens: number;
|
|
temperature: number;
|
|
};
|
|
}
|
|
```
|
|
|
|
### 2. Firebase Storage Service
|
|
|
|
**Purpose**: Complete Firebase Storage integration replacing Google Cloud Storage for file operations.
|
|
|
|
**Key Features**:
|
|
- Secure file upload with Firebase Authentication
|
|
- Proper file organization and naming conventions
|
|
- File metadata management
|
|
- Download URL generation
|
|
- File cleanup and lifecycle management
|
|
|
|
**Interface**:
|
|
```typescript
|
|
interface FirebaseStorageService {
|
|
uploadFile(file: Buffer, fileName: string, userId: string): Promise<string>;
|
|
getDownloadUrl(filePath: string): Promise<string>;
|
|
deleteFile(filePath: string): Promise<void>;
|
|
getFileMetadata(filePath: string): Promise<FileMetadata>;
|
|
generateUploadUrl(fileName: string, userId: string): Promise<string>;
|
|
}
|
|
```
|
|
|
|
### 3. Enhanced Document Service
|
|
|
|
**Purpose**: Centralized document management with proper error handling and validation.
|
|
|
|
**Key Features**:
|
|
- UUID validation for all document operations
|
|
- Proper error handling for database operations
|
|
- Document lifecycle management
|
|
- Status tracking and updates
|
|
- Metadata management
|
|
|
|
**Interface**:
|
|
```typescript
|
|
interface DocumentService {
|
|
createDocument(data: CreateDocumentData): Promise<Document>;
|
|
getDocument(id: string): Promise<Document | null>;
|
|
updateDocument(id: string, updates: Partial<Document>): Promise<Document>;
|
|
deleteDocument(id: string): Promise<void>;
|
|
listDocuments(userId: string, filters?: DocumentFilters): Promise<Document[]>;
|
|
validateDocumentId(id: string): boolean;
|
|
}
|
|
```
|
|
|
|
### 4. Improved Processing Pipeline
|
|
|
|
**Purpose**: Reliable document processing pipeline with proper error handling and recovery.
|
|
|
|
**Key Features**:
|
|
- Step-by-step processing with checkpoints
|
|
- Error recovery and retry mechanisms
|
|
- Progress tracking and status updates
|
|
- Partial result preservation
|
|
- Processing timeout handling
|
|
|
|
**Interface**:
|
|
```typescript
|
|
interface ProcessingPipeline {
|
|
processDocument(documentId: string, options: ProcessingOptions): Promise<ProcessingResult>;
|
|
getProcessingStatus(documentId: string): Promise<ProcessingStatus>;
|
|
retryProcessing(documentId: string, fromStep?: string): Promise<ProcessingResult>;
|
|
cancelProcessing(documentId: string): Promise<void>;
|
|
}
|
|
```
|
|
|
|
### 5. Robust Error Handling System
|
|
|
|
**Purpose**: Comprehensive error handling with correlation tracking and proper logging.
|
|
|
|
**Key Features**:
|
|
- Correlation ID generation for request tracking
|
|
- Structured error logging
|
|
- Error categorization and handling strategies
|
|
- User-friendly error messages
|
|
- Error recovery mechanisms
|
|
|
|
**Interface**:
|
|
```typescript
|
|
interface ErrorHandler {
|
|
handleError(error: Error, context: ErrorContext): ErrorResponse;
|
|
logError(error: Error, correlationId: string, context: any): void;
|
|
createCorrelationId(): string;
|
|
categorizeError(error: Error): ErrorCategory;
|
|
}
|
|
```
|
|
|
|
## Data Models
|
|
|
|
### Enhanced Document Model
|
|
|
|
```typescript
|
|
interface Document {
|
|
id: string; // UUID
|
|
userId: string;
|
|
originalFileName: string;
|
|
filePath: string; // Firebase Storage path
|
|
fileSize: number;
|
|
mimeType: string;
|
|
status: DocumentStatus;
|
|
extractedText?: string;
|
|
generatedSummary?: string;
|
|
summaryPdfPath?: string;
|
|
analysisData?: CIMReview;
|
|
processingSteps: ProcessingStep[];
|
|
errorLog?: ErrorEntry[];
|
|
createdAt: Date;
|
|
updatedAt: Date;
|
|
}
|
|
|
|
enum DocumentStatus {
|
|
UPLOADED = 'uploaded',
|
|
PROCESSING = 'processing',
|
|
COMPLETED = 'completed',
|
|
FAILED = 'failed',
|
|
CANCELLED = 'cancelled'
|
|
}
|
|
|
|
interface ProcessingStep {
|
|
step: string;
|
|
status: 'pending' | 'in_progress' | 'completed' | 'failed';
|
|
startedAt?: Date;
|
|
completedAt?: Date;
|
|
error?: string;
|
|
metadata?: Record<string, any>;
|
|
}
|
|
```
|
|
|
|
### Processing Session Model
|
|
|
|
```typescript
|
|
interface ProcessingSession {
|
|
id: string;
|
|
documentId: string;
|
|
strategy: string;
|
|
status: SessionStatus;
|
|
steps: ProcessingStep[];
|
|
totalSteps: number;
|
|
completedSteps: number;
|
|
failedSteps: number;
|
|
processingTimeMs?: number;
|
|
apiCallsCount: number;
|
|
totalCost: number;
|
|
errorLog: ErrorEntry[];
|
|
createdAt: Date;
|
|
completedAt?: Date;
|
|
}
|
|
```
|
|
|
|
## Error Handling
|
|
|
|
### Error Categories and Strategies
|
|
|
|
1. **Validation Errors**
|
|
- UUID format validation
|
|
- File type and size validation
|
|
- Required field validation
|
|
- Strategy: Return 400 Bad Request with detailed error message
|
|
|
|
2. **Authentication Errors**
|
|
- Invalid or expired tokens
|
|
- Missing authentication
|
|
- Strategy: Return 401 Unauthorized, trigger token refresh
|
|
|
|
3. **Authorization Errors**
|
|
- Insufficient permissions
|
|
- Resource access denied
|
|
- Strategy: Return 403 Forbidden with clear message
|
|
|
|
4. **Resource Not Found**
|
|
- Document not found
|
|
- File not found
|
|
- Strategy: Return 404 Not Found
|
|
|
|
5. **External Service Errors**
|
|
- Firebase Storage errors
|
|
- Document AI failures
|
|
- LLM API errors
|
|
- Strategy: Retry with exponential backoff, fallback options
|
|
|
|
6. **Processing Errors**
|
|
- Text extraction failures
|
|
- PDF generation errors
|
|
- Database operation failures
|
|
- Strategy: Preserve partial results, enable retry from checkpoint
|
|
|
|
7. **System Errors**
|
|
- Memory issues
|
|
- Timeout errors
|
|
- Network failures
|
|
- Strategy: Graceful degradation, error logging, monitoring alerts
|
|
|
|
### Error Response Format
|
|
|
|
```typescript
|
|
interface ErrorResponse {
|
|
success: false;
|
|
error: {
|
|
code: string;
|
|
message: string;
|
|
details?: any;
|
|
correlationId: string;
|
|
timestamp: string;
|
|
retryable: boolean;
|
|
};
|
|
}
|
|
```
|
|
|
|
## Testing Strategy
|
|
|
|
### Unit Testing
|
|
- Service layer testing with mocked dependencies
|
|
- Utility function testing
|
|
- Configuration validation testing
|
|
- Error handling testing
|
|
|
|
### Integration Testing
|
|
- Firebase Storage integration
|
|
- Database operations
|
|
- External API integrations
|
|
- End-to-end processing pipeline
|
|
|
|
### Error Scenario Testing
|
|
- Network failure simulation
|
|
- API rate limit testing
|
|
- Invalid input handling
|
|
- Timeout scenario testing
|
|
|
|
### Performance Testing
|
|
- Large file upload testing
|
|
- Concurrent processing testing
|
|
- Memory usage monitoring
|
|
- API response time testing
|
|
|
|
## Implementation Phases
|
|
|
|
### Phase 1: Core Infrastructure Cleanup
|
|
1. Fix configuration management and environment variable handling
|
|
2. Implement proper UUID validation for database queries
|
|
3. Set up comprehensive error handling and logging
|
|
4. Fix service dependency issues
|
|
|
|
### Phase 2: Firebase Storage Integration
|
|
1. Implement Firebase Storage service
|
|
2. Update file upload endpoints
|
|
3. Migrate existing file operations
|
|
4. Update frontend integration
|
|
|
|
### Phase 3: Processing Pipeline Stabilization
|
|
1. Fix service orchestration issues
|
|
2. Implement proper error recovery
|
|
3. Add processing checkpoints
|
|
4. Enhance monitoring and logging
|
|
|
|
### Phase 4: Testing and Optimization
|
|
1. Comprehensive testing suite
|
|
2. Performance optimization
|
|
3. Error scenario testing
|
|
4. Documentation updates
|
|
|
|
## Security Considerations
|
|
|
|
### Firebase Storage Security
|
|
- Proper Firebase Security Rules
|
|
- User-based file access control
|
|
- File type and size validation
|
|
- Secure download URL generation
|
|
|
|
### API Security
|
|
- Request validation and sanitization
|
|
- Rate limiting and abuse prevention
|
|
- Correlation ID tracking
|
|
- Secure error messages (no sensitive data exposure)
|
|
|
|
### Data Protection
|
|
- User data isolation
|
|
- Secure file deletion
|
|
- Audit logging
|
|
- GDPR compliance considerations
|
|
|
|
## Monitoring and Observability
|
|
|
|
### Key Metrics
|
|
- Document processing success rate
|
|
- Average processing time per document
|
|
- API response times
|
|
- Error rates by category
|
|
- Firebase Storage usage
|
|
- Database query performance
|
|
|
|
### Logging Strategy
|
|
- Structured logging with correlation IDs
|
|
- Error categorization and tracking
|
|
- Performance metrics logging
|
|
- User activity logging
|
|
- External service interaction logging
|
|
|
|
### Health Checks
|
|
- Service availability checks
|
|
- Database connectivity
|
|
- External service status
|
|
- File storage accessibility
|
|
- Processing pipeline health |