# Design Document ## Overview The CIM Document Processor is a web-based application that enables authenticated team members to upload large PDF documents (CIMs), have them analyzed by an LLM using a structured template, and download the results in both Markdown and PDF formats. The system follows a modern web architecture with secure authentication, robust file processing, and comprehensive admin oversight. ## Architecture ### High-Level Architecture ```mermaid graph TB subgraph "Frontend Layer" UI[React Web Application] Auth[Authentication UI] Upload[File Upload Interface] Dashboard[User Dashboard] Admin[Admin Panel] end subgraph "Backend Layer" API[Express.js API Server] AuthM[Authentication Middleware] FileH[File Handler Service] LLMS[LLM Processing Service] PDF[PDF Generation Service] end subgraph "Data Layer" DB[(PostgreSQL Database)] FileStore[File Storage (AWS S3/Local)] Cache[Redis Cache] end subgraph "External Services" LLM[LLM API (OpenAI/Anthropic)] PDFLib[PDF Processing Library] end UI --> API Auth --> AuthM Upload --> FileH Dashboard --> API Admin --> API API --> DB API --> FileStore API --> Cache FileH --> FileStore LLMS --> LLM PDF --> PDFLib API --> LLMS API --> PDF ``` ### Technology Stack **Frontend:** - React 18 with TypeScript - Tailwind CSS for styling - React Router for navigation - Axios for API communication - React Query for state management and caching **Backend:** - Node.js with Express.js - TypeScript for type safety - JWT for authentication - Multer for file uploads - Bull Queue for background job processing **Database:** - PostgreSQL for primary data storage - Redis for session management and job queues **File Processing:** - PDF-parse for text extraction - Puppeteer for PDF generation from Markdown - AWS S3 or local file system for file storage **LLM Integration:** - OpenAI API or Anthropic Claude API - Configurable model selection - Token management and rate limiting ## Components and Interfaces ### Frontend Components #### Authentication Components - `LoginForm`: Handles user login with validation - `AuthGuard`: Protects routes requiring authentication - `SessionManager`: Manages user session state #### Upload Components - `FileUploader`: Drag-and-drop PDF upload with progress - `UploadValidator`: Client-side file validation - `UploadProgress`: Real-time upload status display #### Dashboard Components - `DocumentList`: Displays user's uploaded documents - `DocumentCard`: Individual document status and actions - `ProcessingStatus`: Real-time processing updates - `DownloadButtons`: Markdown and PDF download options #### Admin Components - `AdminDashboard`: Overview of all system documents - `UserManagement`: User account management - `DocumentArchive`: System-wide document access - `SystemMetrics`: Storage and processing statistics ### Backend Services #### Authentication Service ```typescript interface AuthService { login(credentials: LoginCredentials): Promise validateToken(token: string): Promise logout(userId: string): Promise refreshToken(refreshToken: string): Promise } ``` #### Document Service ```typescript interface DocumentService { uploadDocument(file: File, userId: string): Promise getDocuments(userId: string): Promise getDocument(documentId: string): Promise deleteDocument(documentId: string): Promise updateDocumentStatus(documentId: string, status: ProcessingStatus): Promise } ``` #### LLM Processing Service ```typescript interface LLMService { processDocument(documentId: string, extractedText: string): Promise regenerateWithFeedback(documentId: string, feedback: string): Promise validateOutput(output: string): Promise } ``` #### PDF Service ```typescript interface PDFService { extractText(filePath: string): Promise generatePDF(markdown: string): Promise validatePDF(filePath: string): Promise } ``` ## Data Models ### User Model ```typescript interface User { id: string email: string name: string role: 'user' | 'admin' createdAt: Date updatedAt: Date } ``` ### Document Model ```typescript interface Document { id: string userId: string originalFileName: string filePath: string fileSize: number uploadedAt: Date status: ProcessingStatus extractedText?: string generatedSummary?: string summaryMarkdownPath?: string summaryPdfPath?: string processingStartedAt?: Date processingCompletedAt?: Date errorMessage?: string feedback?: DocumentFeedback[] versions: DocumentVersion[] } type ProcessingStatus = | 'uploaded' | 'extracting_text' | 'processing_llm' | 'generating_pdf' | 'completed' | 'failed' ``` ### Document Feedback Model ```typescript interface DocumentFeedback { id: string documentId: string userId: string feedback: string regenerationInstructions?: string createdAt: Date } ``` ### Document Version Model ```typescript interface DocumentVersion { id: string documentId: string versionNumber: number summaryMarkdown: string summaryPdfPath: string createdAt: Date feedback?: string } ``` ### Processing Job Model ```typescript interface ProcessingJob { id: string documentId: string type: 'text_extraction' | 'llm_processing' | 'pdf_generation' status: 'pending' | 'processing' | 'completed' | 'failed' progress: number errorMessage?: string createdAt: Date startedAt?: Date completedAt?: Date } ``` ## Error Handling ### Frontend Error Handling - Global error boundary for React components - Toast notifications for user-facing errors - Retry mechanisms for failed API calls - Graceful degradation for offline scenarios ### Backend Error Handling - Centralized error middleware - Structured error logging with Winston - Error categorization (validation, processing, system) - Automatic retry for transient failures ### File Processing Error Handling - PDF validation before processing - Text extraction fallback mechanisms - LLM API timeout and retry logic - Cleanup of failed uploads and partial processing ### Error Types ```typescript enum ErrorType { VALIDATION_ERROR = 'validation_error', AUTHENTICATION_ERROR = 'authentication_error', FILE_PROCESSING_ERROR = 'file_processing_error', LLM_PROCESSING_ERROR = 'llm_processing_error', STORAGE_ERROR = 'storage_error', SYSTEM_ERROR = 'system_error' } ``` ## Testing Strategy ### Unit Testing - Jest for JavaScript/TypeScript testing - React Testing Library for component testing - Supertest for API endpoint testing - Mock LLM API responses for consistent testing ### Integration Testing - Database integration tests with test containers - File upload and processing workflow tests - Authentication flow testing - PDF generation and download testing ### End-to-End Testing - Playwright for browser automation - Complete user workflows (upload → process → download) - Admin functionality testing - Error scenario testing ### Performance Testing - Load testing for file uploads - LLM processing performance benchmarks - Database query optimization testing - Memory usage monitoring during PDF processing ### Security Testing - Authentication and authorization testing - File upload security validation - SQL injection prevention testing - XSS and CSRF protection verification ## LLM Integration Design ### Prompt Engineering The system will use a two-part prompt structure: **Part 1: CIM Data Extraction** - Provide the BPCP CIM Review Template - Instruct LLM to populate only from CIM content - Use "Not specified in CIM" for missing information - Maintain strict markdown formatting **Part 2: Investment Analysis** - Add "Key Investment Considerations & Diligence Areas" section - Allow use of general industry knowledge - Focus on investment-specific insights and risks ### Token Management - Document chunking for large PDFs (>100 pages) - Token counting and optimization - Fallback to smaller context windows if needed - Cost tracking and monitoring ### Output Validation - Markdown syntax validation - Template structure verification - Content completeness checking - Retry mechanism for malformed outputs ## Security Considerations ### Authentication & Authorization - JWT tokens with short expiration times - Refresh token rotation - Role-based access control (user/admin) - Session management with Redis ### File Security - File type validation (PDF only) - File size limits (100MB max) - Virus scanning integration - Secure file storage with access controls ### Data Protection - Encryption at rest for sensitive documents - HTTPS enforcement for all communications - Input sanitization and validation - Audit logging for admin actions ### API Security - Rate limiting on all endpoints - CORS configuration - Request size limits - API key management for LLM services ## Performance Optimization ### File Processing - Asynchronous processing with job queues - Progress tracking and status updates - Parallel processing for multiple documents - Efficient PDF text extraction ### Database Optimization - Proper indexing on frequently queried fields - Connection pooling - Query optimization - Database migrations management ### Caching Strategy - Redis caching for user sessions - Document metadata caching - LLM response caching for similar content - Static asset caching ### Scalability Considerations - Horizontal scaling capability - Load balancing for multiple instances - Database read replicas - CDN for static assets and downloads