Files

Jon 5a3c961bfc feat: Complete implementation of Tasks 1-5 - CIM Document Processor

Backend Infrastructure:
- Complete Express server setup with security middleware (helmet, CORS, rate limiting)
- Comprehensive error handling and logging with Winston
- Authentication system with JWT tokens and session management
- Database models and migrations for Users, Documents, Feedback, and Processing Jobs
- API routes structure for authentication and document management
- Integration tests for all server components (86 tests passing)

Frontend Infrastructure:
- React application with TypeScript and Vite
- Authentication UI with login form, protected routes, and logout functionality
- Authentication context with proper async state management
- Component tests with proper async handling (25 tests passing)
- Tailwind CSS styling and responsive design

Key Features:
- User registration, login, and authentication
- Protected routes with role-based access control
- Comprehensive error handling and user feedback
- Database schema with proper relationships
- Security middleware and validation
- Production-ready build configuration

Test Coverage: 111/111 tests passing
Tasks Completed: 1-5 (Project setup, Database, Auth system, Frontend UI, Backend infrastructure)

Ready for Task 6: File upload backend infrastructure

2025-07-27 13:29:26 -04:00

9.6 KiB

Raw Blame History

Design Document

Overview

The CIM Document Processor is a web-based application that enables authenticated team members to upload large PDF documents (CIMs), have them analyzed by an LLM using a structured template, and download the results in both Markdown and PDF formats. The system follows a modern web architecture with secure authentication, robust file processing, and comprehensive admin oversight.

Architecture

High-Level Architecture

graph TB
    subgraph "Frontend Layer"
        UI[React Web Application]
        Auth[Authentication UI]
        Upload[File Upload Interface]
        Dashboard[User Dashboard]
        Admin[Admin Panel]
    end
    
    subgraph "Backend Layer"
        API[Express.js API Server]
        AuthM[Authentication Middleware]
        FileH[File Handler Service]
        LLMS[LLM Processing Service]
        PDF[PDF Generation Service]
    end
    
    subgraph "Data Layer"
        DB[(PostgreSQL Database)]
        FileStore[File Storage (AWS S3/Local)]
        Cache[Redis Cache]
    end
    
    subgraph "External Services"
        LLM[LLM API (OpenAI/Anthropic)]
        PDFLib[PDF Processing Library]
    end
    
    UI --> API
    Auth --> AuthM
    Upload --> FileH
    Dashboard --> API
    Admin --> API
    
    API --> DB
    API --> FileStore
    API --> Cache
    
    FileH --> FileStore
    LLMS --> LLM
    PDF --> PDFLib
    
    API --> LLMS
    API --> PDF

Technology Stack

Frontend:

React 18 with TypeScript
Tailwind CSS for styling
React Router for navigation
Axios for API communication
React Query for state management and caching

Backend:

Node.js with Express.js
TypeScript for type safety
JWT for authentication
Multer for file uploads
Bull Queue for background job processing

Database:

PostgreSQL for primary data storage
Redis for session management and job queues

File Processing:

PDF-parse for text extraction
Puppeteer for PDF generation from Markdown
AWS S3 or local file system for file storage

LLM Integration:

OpenAI API or Anthropic Claude API
Configurable model selection
Token management and rate limiting

Components and Interfaces

Frontend Components

Authentication Components

LoginForm: Handles user login with validation
AuthGuard: Protects routes requiring authentication
SessionManager: Manages user session state

Upload Components

FileUploader: Drag-and-drop PDF upload with progress
UploadValidator: Client-side file validation
UploadProgress: Real-time upload status display

Dashboard Components

DocumentList: Displays user's uploaded documents
DocumentCard: Individual document status and actions
ProcessingStatus: Real-time processing updates
DownloadButtons: Markdown and PDF download options

Admin Components

AdminDashboard: Overview of all system documents
UserManagement: User account management
DocumentArchive: System-wide document access
SystemMetrics: Storage and processing statistics

Backend Services

Authentication Service

interface AuthService {
  login(credentials: LoginCredentials): Promise<AuthResult>
  validateToken(token: string): Promise<User>
  logout(userId: string): Promise<void>
  refreshToken(refreshToken: string): Promise<AuthResult>
}

Document Service

interface DocumentService {
  uploadDocument(file: File, userId: string): Promise<Document>
  getDocuments(userId: string): Promise<Document[]>
  getDocument(documentId: string): Promise<Document>
  deleteDocument(documentId: string): Promise<void>
  updateDocumentStatus(documentId: string, status: ProcessingStatus): Promise<void>
}

LLM Processing Service

interface LLMService {
  processDocument(documentId: string, extractedText: string): Promise<ProcessingResult>
  regenerateWithFeedback(documentId: string, feedback: string): Promise<ProcessingResult>
  validateOutput(output: string): Promise<ValidationResult>
}

PDF Service

interface PDFService {
  extractText(filePath: string): Promise<string>
  generatePDF(markdown: string): Promise<Buffer>
  validatePDF(filePath: string): Promise<boolean>
}

Data Models

User Model

interface User {
  id: string
  email: string
  name: string
  role: 'user' | 'admin'
  createdAt: Date
  updatedAt: Date
}

Document Model

interface Document {
  id: string
  userId: string
  originalFileName: string
  filePath: string
  fileSize: number
  uploadedAt: Date
  status: ProcessingStatus
  extractedText?: string
  generatedSummary?: string
  summaryMarkdownPath?: string
  summaryPdfPath?: string
  processingStartedAt?: Date
  processingCompletedAt?: Date
  errorMessage?: string
  feedback?: DocumentFeedback[]
  versions: DocumentVersion[]
}

type ProcessingStatus = 
  | 'uploaded' 
  | 'extracting_text' 
  | 'processing_llm' 
  | 'generating_pdf' 
  | 'completed' 
  | 'failed'

Document Feedback Model

interface DocumentFeedback {
  id: string
  documentId: string
  userId: string
  feedback: string
  regenerationInstructions?: string
  createdAt: Date
}

Document Version Model

interface DocumentVersion {
  id: string
  documentId: string
  versionNumber: number
  summaryMarkdown: string
  summaryPdfPath: string
  createdAt: Date
  feedback?: string
}

Processing Job Model

interface ProcessingJob {
  id: string
  documentId: string
  type: 'text_extraction' | 'llm_processing' | 'pdf_generation'
  status: 'pending' | 'processing' | 'completed' | 'failed'
  progress: number
  errorMessage?: string
  createdAt: Date
  startedAt?: Date
  completedAt?: Date
}

Error Handling

Frontend Error Handling

Global error boundary for React components
Toast notifications for user-facing errors
Retry mechanisms for failed API calls
Graceful degradation for offline scenarios

Backend Error Handling

Centralized error middleware
Structured error logging with Winston
Error categorization (validation, processing, system)
Automatic retry for transient failures

File Processing Error Handling

PDF validation before processing
Text extraction fallback mechanisms
LLM API timeout and retry logic
Cleanup of failed uploads and partial processing

Error Types

enum ErrorType {
  VALIDATION_ERROR = 'validation_error',
  AUTHENTICATION_ERROR = 'authentication_error',
  FILE_PROCESSING_ERROR = 'file_processing_error',
  LLM_PROCESSING_ERROR = 'llm_processing_error',
  STORAGE_ERROR = 'storage_error',
  SYSTEM_ERROR = 'system_error'
}

Testing Strategy

Unit Testing

Jest for JavaScript/TypeScript testing
React Testing Library for component testing
Supertest for API endpoint testing
Mock LLM API responses for consistent testing

Integration Testing

Database integration tests with test containers
File upload and processing workflow tests
Authentication flow testing
PDF generation and download testing

End-to-End Testing

Playwright for browser automation
Complete user workflows (upload → process → download)
Admin functionality testing
Error scenario testing

Performance Testing

Load testing for file uploads
LLM processing performance benchmarks
Database query optimization testing
Memory usage monitoring during PDF processing

Security Testing

Authentication and authorization testing
File upload security validation
SQL injection prevention testing
XSS and CSRF protection verification

LLM Integration Design

Prompt Engineering

The system will use a two-part prompt structure:

Part 1: CIM Data Extraction

Provide the BPCP CIM Review Template
Instruct LLM to populate only from CIM content
Use "Not specified in CIM" for missing information
Maintain strict markdown formatting

Part 2: Investment Analysis

Add "Key Investment Considerations & Diligence Areas" section
Allow use of general industry knowledge
Focus on investment-specific insights and risks

Token Management

Document chunking for large PDFs (>100 pages)
Token counting and optimization
Fallback to smaller context windows if needed
Cost tracking and monitoring

Output Validation

Markdown syntax validation
Template structure verification
Content completeness checking
Retry mechanism for malformed outputs

Security Considerations

Authentication & Authorization

JWT tokens with short expiration times
Refresh token rotation
Role-based access control (user/admin)
Session management with Redis

File Security

File type validation (PDF only)
File size limits (100MB max)
Virus scanning integration
Secure file storage with access controls

Data Protection

Encryption at rest for sensitive documents
HTTPS enforcement for all communications
Input sanitization and validation
Audit logging for admin actions

API Security

Rate limiting on all endpoints
CORS configuration
Request size limits
API key management for LLM services

Performance Optimization

File Processing

Asynchronous processing with job queues
Progress tracking and status updates
Parallel processing for multiple documents
Efficient PDF text extraction

Database Optimization

Proper indexing on frequently queried fields
Connection pooling
Query optimization
Database migrations management

Caching Strategy

Redis caching for user sessions
Document metadata caching
LLM response caching for similar content
Static asset caching

Scalability Considerations

Horizontal scaling capability
Load balancing for multiple instances
Database read replicas
CDN for static assets and downloads

9.6 KiB Raw Blame History