Files
cim_summary/.kiro/specs/codebase-cleanup-and-upload-fix/design.md
2025-08-01 15:46:43 -04:00

417 lines
15 KiB
Markdown

# Design Document
## Overview
This design addresses the systematic cleanup and stabilization of the CIM Document Processor backend, with a focus on fixing the processing pipeline and integrating Firebase Storage as the primary file storage solution. The design identifies critical issues in the current codebase and provides a comprehensive solution to ensure reliable document processing from upload through final PDF generation.
## Architecture
### Current Issues Identified
Based on error analysis and code review, the following critical issues have been identified:
1. **Database Query Issues**: UUID validation errors when non-UUID strings are passed to document queries
2. **Service Dependencies**: Circular dependencies and missing service imports
3. **Firebase Storage Integration**: Incomplete migration from Google Cloud Storage to Firebase Storage
4. **Error Handling**: Insufficient error handling and logging throughout the pipeline
5. **Configuration Management**: Environment variable validation issues in serverless environments
6. **Processing Pipeline**: Broken service orchestration in the document processing flow
### Target Architecture
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ FRONTEND (React) │
├─────────────────────────────────────────────────────────────────────────────┤
│ Firebase Auth + Document Upload → Firebase Storage │
└─────────────────────────────────────────────────────────────────────────────┘
▼ HTTPS API
┌─────────────────────────────────────────────────────────────────────────────┐
│ BACKEND (Node.js) │
├─────────────────────────────────────────────────────────────────────────────┤
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ API Routes │ │ Middleware │ │ Error Handler │ │
│ │ - Documents │ │ - Auth │ │ - Global │ │
│ │ - Monitoring │ │ - Validation │ │ - Correlation │ │
│ │ - Vector │ │ - CORS │ │ - Logging │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Core Services │ │ Processing │ │ External APIs │ │
│ │ - Document │ │ - Agentic RAG │ │ - Document AI │ │
│ │ - Upload │ │ - LLM Service │ │ - Claude AI │ │
│ │ - Session │ │ - PDF Gen │ │ - Firebase │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ STORAGE & DATABASE │
├─────────────────────────────────────────────────────────────────────────────┤
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Firebase │ │ Supabase │ │ Vector │ │
│ │ Storage │ │ Database │ │ Database │ │
│ │ - File Upload │ │ - Documents │ │ - Embeddings │ │
│ │ - Security │ │ - Sessions │ │ - Chunks │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
```
## Components and Interfaces
### 1. Enhanced Configuration Management
**Purpose**: Robust environment variable validation and configuration management for both development and production environments.
**Key Features**:
- Graceful handling of missing environment variables in serverless environments
- Runtime configuration validation
- Fallback values for non-critical settings
- Clear error messages for missing critical configuration
**Interface**:
```typescript
interface Config {
// Core settings
env: string;
port: number;
// Firebase configuration
firebase: {
projectId: string;
storageBucket: string;
apiKey: string;
authDomain: string;
};
// Database configuration
supabase: {
url: string;
anonKey: string;
serviceKey: string;
};
// External services
googleCloud: {
projectId: string;
documentAiLocation: string;
documentAiProcessorId: string;
applicationCredentials: string;
};
// LLM configuration
llm: {
provider: 'anthropic' | 'openai';
anthropicApiKey?: string;
openaiApiKey?: string;
model: string;
maxTokens: number;
temperature: number;
};
}
```
### 2. Firebase Storage Service
**Purpose**: Complete Firebase Storage integration replacing Google Cloud Storage for file operations.
**Key Features**:
- Secure file upload with Firebase Authentication
- Proper file organization and naming conventions
- File metadata management
- Download URL generation
- File cleanup and lifecycle management
**Interface**:
```typescript
interface FirebaseStorageService {
uploadFile(file: Buffer, fileName: string, userId: string): Promise<string>;
getDownloadUrl(filePath: string): Promise<string>;
deleteFile(filePath: string): Promise<void>;
getFileMetadata(filePath: string): Promise<FileMetadata>;
generateUploadUrl(fileName: string, userId: string): Promise<string>;
}
```
### 3. Enhanced Document Service
**Purpose**: Centralized document management with proper error handling and validation.
**Key Features**:
- UUID validation for all document operations
- Proper error handling for database operations
- Document lifecycle management
- Status tracking and updates
- Metadata management
**Interface**:
```typescript
interface DocumentService {
createDocument(data: CreateDocumentData): Promise<Document>;
getDocument(id: string): Promise<Document | null>;
updateDocument(id: string, updates: Partial<Document>): Promise<Document>;
deleteDocument(id: string): Promise<void>;
listDocuments(userId: string, filters?: DocumentFilters): Promise<Document[]>;
validateDocumentId(id: string): boolean;
}
```
### 4. Improved Processing Pipeline
**Purpose**: Reliable document processing pipeline with proper error handling and recovery.
**Key Features**:
- Step-by-step processing with checkpoints
- Error recovery and retry mechanisms
- Progress tracking and status updates
- Partial result preservation
- Processing timeout handling
**Interface**:
```typescript
interface ProcessingPipeline {
processDocument(documentId: string, options: ProcessingOptions): Promise<ProcessingResult>;
getProcessingStatus(documentId: string): Promise<ProcessingStatus>;
retryProcessing(documentId: string, fromStep?: string): Promise<ProcessingResult>;
cancelProcessing(documentId: string): Promise<void>;
}
```
### 5. Robust Error Handling System
**Purpose**: Comprehensive error handling with correlation tracking and proper logging.
**Key Features**:
- Correlation ID generation for request tracking
- Structured error logging
- Error categorization and handling strategies
- User-friendly error messages
- Error recovery mechanisms
**Interface**:
```typescript
interface ErrorHandler {
handleError(error: Error, context: ErrorContext): ErrorResponse;
logError(error: Error, correlationId: string, context: any): void;
createCorrelationId(): string;
categorizeError(error: Error): ErrorCategory;
}
```
## Data Models
### Enhanced Document Model
```typescript
interface Document {
id: string; // UUID
userId: string;
originalFileName: string;
filePath: string; // Firebase Storage path
fileSize: number;
mimeType: string;
status: DocumentStatus;
extractedText?: string;
generatedSummary?: string;
summaryPdfPath?: string;
analysisData?: CIMReview;
processingSteps: ProcessingStep[];
errorLog?: ErrorEntry[];
createdAt: Date;
updatedAt: Date;
}
enum DocumentStatus {
UPLOADED = 'uploaded',
PROCESSING = 'processing',
COMPLETED = 'completed',
FAILED = 'failed',
CANCELLED = 'cancelled'
}
interface ProcessingStep {
step: string;
status: 'pending' | 'in_progress' | 'completed' | 'failed';
startedAt?: Date;
completedAt?: Date;
error?: string;
metadata?: Record<string, any>;
}
```
### Processing Session Model
```typescript
interface ProcessingSession {
id: string;
documentId: string;
strategy: string;
status: SessionStatus;
steps: ProcessingStep[];
totalSteps: number;
completedSteps: number;
failedSteps: number;
processingTimeMs?: number;
apiCallsCount: number;
totalCost: number;
errorLog: ErrorEntry[];
createdAt: Date;
completedAt?: Date;
}
```
## Error Handling
### Error Categories and Strategies
1. **Validation Errors**
- UUID format validation
- File type and size validation
- Required field validation
- Strategy: Return 400 Bad Request with detailed error message
2. **Authentication Errors**
- Invalid or expired tokens
- Missing authentication
- Strategy: Return 401 Unauthorized, trigger token refresh
3. **Authorization Errors**
- Insufficient permissions
- Resource access denied
- Strategy: Return 403 Forbidden with clear message
4. **Resource Not Found**
- Document not found
- File not found
- Strategy: Return 404 Not Found
5. **External Service Errors**
- Firebase Storage errors
- Document AI failures
- LLM API errors
- Strategy: Retry with exponential backoff, fallback options
6. **Processing Errors**
- Text extraction failures
- PDF generation errors
- Database operation failures
- Strategy: Preserve partial results, enable retry from checkpoint
7. **System Errors**
- Memory issues
- Timeout errors
- Network failures
- Strategy: Graceful degradation, error logging, monitoring alerts
### Error Response Format
```typescript
interface ErrorResponse {
success: false;
error: {
code: string;
message: string;
details?: any;
correlationId: string;
timestamp: string;
retryable: boolean;
};
}
```
## Testing Strategy
### Unit Testing
- Service layer testing with mocked dependencies
- Utility function testing
- Configuration validation testing
- Error handling testing
### Integration Testing
- Firebase Storage integration
- Database operations
- External API integrations
- End-to-end processing pipeline
### Error Scenario Testing
- Network failure simulation
- API rate limit testing
- Invalid input handling
- Timeout scenario testing
### Performance Testing
- Large file upload testing
- Concurrent processing testing
- Memory usage monitoring
- API response time testing
## Implementation Phases
### Phase 1: Core Infrastructure Cleanup
1. Fix configuration management and environment variable handling
2. Implement proper UUID validation for database queries
3. Set up comprehensive error handling and logging
4. Fix service dependency issues
### Phase 2: Firebase Storage Integration
1. Implement Firebase Storage service
2. Update file upload endpoints
3. Migrate existing file operations
4. Update frontend integration
### Phase 3: Processing Pipeline Stabilization
1. Fix service orchestration issues
2. Implement proper error recovery
3. Add processing checkpoints
4. Enhance monitoring and logging
### Phase 4: Testing and Optimization
1. Comprehensive testing suite
2. Performance optimization
3. Error scenario testing
4. Documentation updates
## Security Considerations
### Firebase Storage Security
- Proper Firebase Security Rules
- User-based file access control
- File type and size validation
- Secure download URL generation
### API Security
- Request validation and sanitization
- Rate limiting and abuse prevention
- Correlation ID tracking
- Secure error messages (no sensitive data exposure)
### Data Protection
- User data isolation
- Secure file deletion
- Audit logging
- GDPR compliance considerations
## Monitoring and Observability
### Key Metrics
- Document processing success rate
- Average processing time per document
- API response times
- Error rates by category
- Firebase Storage usage
- Database query performance
### Logging Strategy
- Structured logging with correlation IDs
- Error categorization and tracking
- Performance metrics logging
- User activity logging
- External service interaction logging
### Health Checks
- Service availability checks
- Database connectivity
- External service status
- File storage accessibility
- Processing pipeline health