cim_summary/.kiro/specs/cim-document-processor/design.md

# Design Document

## Overview

The CIM Document Processor is a web-based application that enables authenticated team members to upload large PDF documents (CIMs), have them analyzed by an LLM using a structured template, and download the results in both Markdown and PDF formats. The system follows a modern web architecture with secure authentication, robust file processing, and comprehensive admin oversight.

## Architecture

### High-Level Architecture

```mermaid
graph TB
    subgraph "Frontend Layer"
        UI[React Web Application]
        Auth[Authentication UI]
        Upload[File Upload Interface]
        Dashboard[User Dashboard]
        Admin[Admin Panel]
    end

    subgraph "Backend Layer"
        API[Express.js API Server]
        AuthM[Authentication Middleware]
        FileH[File Handler Service]
        LLMS[LLM Processing Service]
        PDF[PDF Generation Service]
    end

    subgraph "Data Layer"
        DB[(PostgreSQL Database)]
        FileStore[File Storage (AWS S3/Local)]
        Cache[Redis Cache]
    end

    subgraph "External Services"
        LLM[LLM API (OpenAI/Anthropic)]
        PDFLib[PDF Processing Library]
    end

    UI --> API
    Auth --> AuthM
    Upload --> FileH
    Dashboard --> API
    Admin --> API

    API --> DB
    API --> FileStore
    API --> Cache

    FileH --> FileStore
    LLMS --> LLM
    PDF --> PDFLib

    API --> LLMS
    API --> PDF
```

### Technology Stack

**Frontend:**
- React 18 with TypeScript
- Tailwind CSS for styling
- React Router for navigation
- Axios for API communication
- React Query for state management and caching

**Backend:**
- Node.js with Express.js
- TypeScript for type safety
- JWT for authentication
- Multer for file uploads
- Bull Queue for background job processing

**Database:**
- PostgreSQL for primary data storage
- Redis for session management and job queues

**File Processing:**
- PDF-parse for text extraction
- Puppeteer for PDF generation from Markdown
- AWS S3 or local file system for file storage

**LLM Integration:**
- OpenAI API or Anthropic Claude API
- Configurable model selection
- Token management and rate limiting

## Components and Interfaces

### Frontend Components

#### Authentication Components
- `LoginForm`: Handles user login with validation
- `AuthGuard`: Protects routes requiring authentication
- `SessionManager`: Manages user session state

#### Upload Components
- `FileUploader`: Drag-and-drop PDF upload with progress
- `UploadValidator`: Client-side file validation
- `UploadProgress`: Real-time upload status display

#### Dashboard Components
- `DocumentList`: Displays user's uploaded documents
- `DocumentCard`: Individual document status and actions
- `ProcessingStatus`: Real-time processing updates
- `DownloadButtons`: Markdown and PDF download options

#### Admin Components
- `AdminDashboard`: Overview of all system documents
- `UserManagement`: User account management
- `DocumentArchive`: System-wide document access
- `SystemMetrics`: Storage and processing statistics

### Backend Services

#### Authentication Service
```typescript
interface AuthService {
  login(credentials: LoginCredentials): Promise<AuthResult>
  validateToken(token: string): Promise<User>
  logout(userId: string): Promise<void>
  refreshToken(refreshToken: string): Promise<AuthResult>
}
```

#### Document Service
```typescript
interface DocumentService {
  uploadDocument(file: File, userId: string): Promise<Document>
  getDocuments(userId: string): Promise<Document[]>
  getDocument(documentId: string): Promise<Document>
  deleteDocument(documentId: string): Promise<void>
  updateDocumentStatus(documentId: string, status: ProcessingStatus): Promise<void>
}
```

#### LLM Processing Service
```typescript
interface LLMService {
  processDocument(documentId: string, extractedText: string): Promise<ProcessingResult>
  regenerateWithFeedback(documentId: string, feedback: string): Promise<ProcessingResult>
  validateOutput(output: string): Promise<ValidationResult>
}
```

#### PDF Service
```typescript
interface PDFService {
  extractText(filePath: string): Promise<string>
  generatePDF(markdown: string): Promise<Buffer>
  validatePDF(filePath: string): Promise<boolean>
}
```

## Data Models

### User Model
```typescript
interface User {
  id: string
  email: string
  name: string
  role: 'user' | 'admin'
  createdAt: Date
  updatedAt: Date
}
```

### Document Model
```typescript
interface Document {
  id: string
  userId: string
  originalFileName: string
  filePath: string
  fileSize: number
  uploadedAt: Date
  status: ProcessingStatus
  extractedText?: string
  generatedSummary?: string
  summaryMarkdownPath?: string
  summaryPdfPath?: string
  processingStartedAt?: Date
  processingCompletedAt?: Date
  errorMessage?: string
  feedback?: DocumentFeedback[]
  versions: DocumentVersion[]
}

type ProcessingStatus =
  | 'uploaded'
  | 'extracting_text'
  | 'processing_llm'
  | 'generating_pdf'
  | 'completed'
  | 'failed'
```

### Document Feedback Model
```typescript
interface DocumentFeedback {
  id: string
  documentId: string
  userId: string
  feedback: string
  regenerationInstructions?: string
  createdAt: Date
}
```

### Document Version Model
```typescript
interface DocumentVersion {
  id: string
  documentId: string
  versionNumber: number
  summaryMarkdown: string
  summaryPdfPath: string
  createdAt: Date
  feedback?: string
}
```

### Processing Job Model
```typescript
interface ProcessingJob {
  id: string
  documentId: string
  type: 'text_extraction' | 'llm_processing' | 'pdf_generation'
  status: 'pending' | 'processing' | 'completed' | 'failed'
  progress: number
  errorMessage?: string
  createdAt: Date
  startedAt?: Date
  completedAt?: Date
}
```

## Error Handling

### Frontend Error Handling
- Global error boundary for React components
- Toast notifications for user-facing errors
- Retry mechanisms for failed API calls
- Graceful degradation for offline scenarios

### Backend Error Handling
- Centralized error middleware
- Structured error logging with Winston
- Error categorization (validation, processing, system)
- Automatic retry for transient failures

### File Processing Error Handling
- PDF validation before processing
- Text extraction fallback mechanisms
- LLM API timeout and retry logic
- Cleanup of failed uploads and partial processing

### Error Types
```typescript
enum ErrorType {
  VALIDATION_ERROR = 'validation_error',
  AUTHENTICATION_ERROR = 'authentication_error',
  FILE_PROCESSING_ERROR = 'file_processing_error',
  LLM_PROCESSING_ERROR = 'llm_processing_error',
  STORAGE_ERROR = 'storage_error',
  SYSTEM_ERROR = 'system_error'
}
```

## Testing Strategy

### Unit Testing
- Jest for JavaScript/TypeScript testing
- React Testing Library for component testing
- Supertest for API endpoint testing
- Mock LLM API responses for consistent testing

### Integration Testing
- Database integration tests with test containers
- File upload and processing workflow tests
- Authentication flow testing
- PDF generation and download testing

### End-to-End Testing
- Playwright for browser automation
- Complete user workflows (upload → process → download)
- Admin functionality testing
- Error scenario testing

### Performance Testing
- Load testing for file uploads
- LLM processing performance benchmarks
- Database query optimization testing
- Memory usage monitoring during PDF processing

### Security Testing
- Authentication and authorization testing
- File upload security validation
- SQL injection prevention testing
- XSS and CSRF protection verification

## LLM Integration Design

### Prompt Engineering
The system will use a two-part prompt structure:

**Part 1: CIM Data Extraction**
- Provide the BPCP CIM Review Template
- Instruct LLM to populate only from CIM content
- Use "Not specified in CIM" for missing information
- Maintain strict markdown formatting

**Part 2: Investment Analysis**
- Add "Key Investment Considerations & Diligence Areas" section
- Allow use of general industry knowledge
- Focus on investment-specific insights and risks

### Token Management
- Document chunking for large PDFs (>100 pages)
- Token counting and optimization
- Fallback to smaller context windows if needed
- Cost tracking and monitoring

### Output Validation
- Markdown syntax validation
- Template structure verification
- Content completeness checking
- Retry mechanism for malformed outputs

## Security Considerations

### Authentication & Authorization
- JWT tokens with short expiration times
- Refresh token rotation
- Role-based access control (user/admin)
- Session management with Redis

### File Security
- File type validation (PDF only)
- File size limits (100MB max)
- Virus scanning integration
- Secure file storage with access controls

### Data Protection
- Encryption at rest for sensitive documents
- HTTPS enforcement for all communications
- Input sanitization and validation
- Audit logging for admin actions

### API Security
- Rate limiting on all endpoints
- CORS configuration
- Request size limits
- API key management for LLM services

## Performance Optimization

### File Processing
- Asynchronous processing with job queues
- Progress tracking and status updates
- Parallel processing for multiple documents
- Efficient PDF text extraction

### Database Optimization
- Proper indexing on frequently queried fields
- Connection pooling
- Query optimization
- Database migrations management

### Caching Strategy
- Redis caching for user sessions
- Document metadata caching
- LLM response caching for similar content
- Static asset caching

### Scalability Considerations
- Horizontal scaling capability
- Load balancing for multiple instances
- Database read replicas
- CDN for static assets and downloads