Files
cim_summary/.kiro/specs/cim-document-processor/design.md
Jon 5a3c961bfc feat: Complete implementation of Tasks 1-5 - CIM Document Processor
Backend Infrastructure:
- Complete Express server setup with security middleware (helmet, CORS, rate limiting)
- Comprehensive error handling and logging with Winston
- Authentication system with JWT tokens and session management
- Database models and migrations for Users, Documents, Feedback, and Processing Jobs
- API routes structure for authentication and document management
- Integration tests for all server components (86 tests passing)

Frontend Infrastructure:
- React application with TypeScript and Vite
- Authentication UI with login form, protected routes, and logout functionality
- Authentication context with proper async state management
- Component tests with proper async handling (25 tests passing)
- Tailwind CSS styling and responsive design

Key Features:
- User registration, login, and authentication
- Protected routes with role-based access control
- Comprehensive error handling and user feedback
- Database schema with proper relationships
- Security middleware and validation
- Production-ready build configuration

Test Coverage: 111/111 tests passing
Tasks Completed: 1-5 (Project setup, Database, Auth system, Frontend UI, Backend infrastructure)

Ready for Task 6: File upload backend infrastructure
2025-07-27 13:29:26 -04:00

9.6 KiB

Design Document

Overview

The CIM Document Processor is a web-based application that enables authenticated team members to upload large PDF documents (CIMs), have them analyzed by an LLM using a structured template, and download the results in both Markdown and PDF formats. The system follows a modern web architecture with secure authentication, robust file processing, and comprehensive admin oversight.

Architecture

High-Level Architecture

graph TB
    subgraph "Frontend Layer"
        UI[React Web Application]
        Auth[Authentication UI]
        Upload[File Upload Interface]
        Dashboard[User Dashboard]
        Admin[Admin Panel]
    end
    
    subgraph "Backend Layer"
        API[Express.js API Server]
        AuthM[Authentication Middleware]
        FileH[File Handler Service]
        LLMS[LLM Processing Service]
        PDF[PDF Generation Service]
    end
    
    subgraph "Data Layer"
        DB[(PostgreSQL Database)]
        FileStore[File Storage (AWS S3/Local)]
        Cache[Redis Cache]
    end
    
    subgraph "External Services"
        LLM[LLM API (OpenAI/Anthropic)]
        PDFLib[PDF Processing Library]
    end
    
    UI --> API
    Auth --> AuthM
    Upload --> FileH
    Dashboard --> API
    Admin --> API
    
    API --> DB
    API --> FileStore
    API --> Cache
    
    FileH --> FileStore
    LLMS --> LLM
    PDF --> PDFLib
    
    API --> LLMS
    API --> PDF

Technology Stack

Frontend:

  • React 18 with TypeScript
  • Tailwind CSS for styling
  • React Router for navigation
  • Axios for API communication
  • React Query for state management and caching

Backend:

  • Node.js with Express.js
  • TypeScript for type safety
  • JWT for authentication
  • Multer for file uploads
  • Bull Queue for background job processing

Database:

  • PostgreSQL for primary data storage
  • Redis for session management and job queues

File Processing:

  • PDF-parse for text extraction
  • Puppeteer for PDF generation from Markdown
  • AWS S3 or local file system for file storage

LLM Integration:

  • OpenAI API or Anthropic Claude API
  • Configurable model selection
  • Token management and rate limiting

Components and Interfaces

Frontend Components

Authentication Components

  • LoginForm: Handles user login with validation
  • AuthGuard: Protects routes requiring authentication
  • SessionManager: Manages user session state

Upload Components

  • FileUploader: Drag-and-drop PDF upload with progress
  • UploadValidator: Client-side file validation
  • UploadProgress: Real-time upload status display

Dashboard Components

  • DocumentList: Displays user's uploaded documents
  • DocumentCard: Individual document status and actions
  • ProcessingStatus: Real-time processing updates
  • DownloadButtons: Markdown and PDF download options

Admin Components

  • AdminDashboard: Overview of all system documents
  • UserManagement: User account management
  • DocumentArchive: System-wide document access
  • SystemMetrics: Storage and processing statistics

Backend Services

Authentication Service

interface AuthService {
  login(credentials: LoginCredentials): Promise<AuthResult>
  validateToken(token: string): Promise<User>
  logout(userId: string): Promise<void>
  refreshToken(refreshToken: string): Promise<AuthResult>
}

Document Service

interface DocumentService {
  uploadDocument(file: File, userId: string): Promise<Document>
  getDocuments(userId: string): Promise<Document[]>
  getDocument(documentId: string): Promise<Document>
  deleteDocument(documentId: string): Promise<void>
  updateDocumentStatus(documentId: string, status: ProcessingStatus): Promise<void>
}

LLM Processing Service

interface LLMService {
  processDocument(documentId: string, extractedText: string): Promise<ProcessingResult>
  regenerateWithFeedback(documentId: string, feedback: string): Promise<ProcessingResult>
  validateOutput(output: string): Promise<ValidationResult>
}

PDF Service

interface PDFService {
  extractText(filePath: string): Promise<string>
  generatePDF(markdown: string): Promise<Buffer>
  validatePDF(filePath: string): Promise<boolean>
}

Data Models

User Model

interface User {
  id: string
  email: string
  name: string
  role: 'user' | 'admin'
  createdAt: Date
  updatedAt: Date
}

Document Model

interface Document {
  id: string
  userId: string
  originalFileName: string
  filePath: string
  fileSize: number
  uploadedAt: Date
  status: ProcessingStatus
  extractedText?: string
  generatedSummary?: string
  summaryMarkdownPath?: string
  summaryPdfPath?: string
  processingStartedAt?: Date
  processingCompletedAt?: Date
  errorMessage?: string
  feedback?: DocumentFeedback[]
  versions: DocumentVersion[]
}

type ProcessingStatus = 
  | 'uploaded' 
  | 'extracting_text' 
  | 'processing_llm' 
  | 'generating_pdf' 
  | 'completed' 
  | 'failed'

Document Feedback Model

interface DocumentFeedback {
  id: string
  documentId: string
  userId: string
  feedback: string
  regenerationInstructions?: string
  createdAt: Date
}

Document Version Model

interface DocumentVersion {
  id: string
  documentId: string
  versionNumber: number
  summaryMarkdown: string
  summaryPdfPath: string
  createdAt: Date
  feedback?: string
}

Processing Job Model

interface ProcessingJob {
  id: string
  documentId: string
  type: 'text_extraction' | 'llm_processing' | 'pdf_generation'
  status: 'pending' | 'processing' | 'completed' | 'failed'
  progress: number
  errorMessage?: string
  createdAt: Date
  startedAt?: Date
  completedAt?: Date
}

Error Handling

Frontend Error Handling

  • Global error boundary for React components
  • Toast notifications for user-facing errors
  • Retry mechanisms for failed API calls
  • Graceful degradation for offline scenarios

Backend Error Handling

  • Centralized error middleware
  • Structured error logging with Winston
  • Error categorization (validation, processing, system)
  • Automatic retry for transient failures

File Processing Error Handling

  • PDF validation before processing
  • Text extraction fallback mechanisms
  • LLM API timeout and retry logic
  • Cleanup of failed uploads and partial processing

Error Types

enum ErrorType {
  VALIDATION_ERROR = 'validation_error',
  AUTHENTICATION_ERROR = 'authentication_error',
  FILE_PROCESSING_ERROR = 'file_processing_error',
  LLM_PROCESSING_ERROR = 'llm_processing_error',
  STORAGE_ERROR = 'storage_error',
  SYSTEM_ERROR = 'system_error'
}

Testing Strategy

Unit Testing

  • Jest for JavaScript/TypeScript testing
  • React Testing Library for component testing
  • Supertest for API endpoint testing
  • Mock LLM API responses for consistent testing

Integration Testing

  • Database integration tests with test containers
  • File upload and processing workflow tests
  • Authentication flow testing
  • PDF generation and download testing

End-to-End Testing

  • Playwright for browser automation
  • Complete user workflows (upload → process → download)
  • Admin functionality testing
  • Error scenario testing

Performance Testing

  • Load testing for file uploads
  • LLM processing performance benchmarks
  • Database query optimization testing
  • Memory usage monitoring during PDF processing

Security Testing

  • Authentication and authorization testing
  • File upload security validation
  • SQL injection prevention testing
  • XSS and CSRF protection verification

LLM Integration Design

Prompt Engineering

The system will use a two-part prompt structure:

Part 1: CIM Data Extraction

  • Provide the BPCP CIM Review Template
  • Instruct LLM to populate only from CIM content
  • Use "Not specified in CIM" for missing information
  • Maintain strict markdown formatting

Part 2: Investment Analysis

  • Add "Key Investment Considerations & Diligence Areas" section
  • Allow use of general industry knowledge
  • Focus on investment-specific insights and risks

Token Management

  • Document chunking for large PDFs (>100 pages)
  • Token counting and optimization
  • Fallback to smaller context windows if needed
  • Cost tracking and monitoring

Output Validation

  • Markdown syntax validation
  • Template structure verification
  • Content completeness checking
  • Retry mechanism for malformed outputs

Security Considerations

Authentication & Authorization

  • JWT tokens with short expiration times
  • Refresh token rotation
  • Role-based access control (user/admin)
  • Session management with Redis

File Security

  • File type validation (PDF only)
  • File size limits (100MB max)
  • Virus scanning integration
  • Secure file storage with access controls

Data Protection

  • Encryption at rest for sensitive documents
  • HTTPS enforcement for all communications
  • Input sanitization and validation
  • Audit logging for admin actions

API Security

  • Rate limiting on all endpoints
  • CORS configuration
  • Request size limits
  • API key management for LLM services

Performance Optimization

File Processing

  • Asynchronous processing with job queues
  • Progress tracking and status updates
  • Parallel processing for multiple documents
  • Efficient PDF text extraction

Database Optimization

  • Proper indexing on frequently queried fields
  • Connection pooling
  • Query optimization
  • Database migrations management

Caching Strategy

  • Redis caching for user sessions
  • Document metadata caching
  • LLM response caching for similar content
  • Static asset caching

Scalability Considerations

  • Horizontal scaling capability
  • Load balancing for multiple instances
  • Database read replicas
  • CDN for static assets and downloads