Files
cim_summary/.kiro/specs/cim-document-processor/design.md
Jon 5a3c961bfc feat: Complete implementation of Tasks 1-5 - CIM Document Processor
Backend Infrastructure:
- Complete Express server setup with security middleware (helmet, CORS, rate limiting)
- Comprehensive error handling and logging with Winston
- Authentication system with JWT tokens and session management
- Database models and migrations for Users, Documents, Feedback, and Processing Jobs
- API routes structure for authentication and document management
- Integration tests for all server components (86 tests passing)

Frontend Infrastructure:
- React application with TypeScript and Vite
- Authentication UI with login form, protected routes, and logout functionality
- Authentication context with proper async state management
- Component tests with proper async handling (25 tests passing)
- Tailwind CSS styling and responsive design

Key Features:
- User registration, login, and authentication
- Protected routes with role-based access control
- Comprehensive error handling and user feedback
- Database schema with proper relationships
- Security middleware and validation
- Production-ready build configuration

Test Coverage: 111/111 tests passing
Tasks Completed: 1-5 (Project setup, Database, Auth system, Frontend UI, Backend infrastructure)

Ready for Task 6: File upload backend infrastructure
2025-07-27 13:29:26 -04:00

381 lines
9.6 KiB
Markdown

# Design Document
## Overview
The CIM Document Processor is a web-based application that enables authenticated team members to upload large PDF documents (CIMs), have them analyzed by an LLM using a structured template, and download the results in both Markdown and PDF formats. The system follows a modern web architecture with secure authentication, robust file processing, and comprehensive admin oversight.
## Architecture
### High-Level Architecture
```mermaid
graph TB
subgraph "Frontend Layer"
UI[React Web Application]
Auth[Authentication UI]
Upload[File Upload Interface]
Dashboard[User Dashboard]
Admin[Admin Panel]
end
subgraph "Backend Layer"
API[Express.js API Server]
AuthM[Authentication Middleware]
FileH[File Handler Service]
LLMS[LLM Processing Service]
PDF[PDF Generation Service]
end
subgraph "Data Layer"
DB[(PostgreSQL Database)]
FileStore[File Storage (AWS S3/Local)]
Cache[Redis Cache]
end
subgraph "External Services"
LLM[LLM API (OpenAI/Anthropic)]
PDFLib[PDF Processing Library]
end
UI --> API
Auth --> AuthM
Upload --> FileH
Dashboard --> API
Admin --> API
API --> DB
API --> FileStore
API --> Cache
FileH --> FileStore
LLMS --> LLM
PDF --> PDFLib
API --> LLMS
API --> PDF
```
### Technology Stack
**Frontend:**
- React 18 with TypeScript
- Tailwind CSS for styling
- React Router for navigation
- Axios for API communication
- React Query for state management and caching
**Backend:**
- Node.js with Express.js
- TypeScript for type safety
- JWT for authentication
- Multer for file uploads
- Bull Queue for background job processing
**Database:**
- PostgreSQL for primary data storage
- Redis for session management and job queues
**File Processing:**
- PDF-parse for text extraction
- Puppeteer for PDF generation from Markdown
- AWS S3 or local file system for file storage
**LLM Integration:**
- OpenAI API or Anthropic Claude API
- Configurable model selection
- Token management and rate limiting
## Components and Interfaces
### Frontend Components
#### Authentication Components
- `LoginForm`: Handles user login with validation
- `AuthGuard`: Protects routes requiring authentication
- `SessionManager`: Manages user session state
#### Upload Components
- `FileUploader`: Drag-and-drop PDF upload with progress
- `UploadValidator`: Client-side file validation
- `UploadProgress`: Real-time upload status display
#### Dashboard Components
- `DocumentList`: Displays user's uploaded documents
- `DocumentCard`: Individual document status and actions
- `ProcessingStatus`: Real-time processing updates
- `DownloadButtons`: Markdown and PDF download options
#### Admin Components
- `AdminDashboard`: Overview of all system documents
- `UserManagement`: User account management
- `DocumentArchive`: System-wide document access
- `SystemMetrics`: Storage and processing statistics
### Backend Services
#### Authentication Service
```typescript
interface AuthService {
login(credentials: LoginCredentials): Promise<AuthResult>
validateToken(token: string): Promise<User>
logout(userId: string): Promise<void>
refreshToken(refreshToken: string): Promise<AuthResult>
}
```
#### Document Service
```typescript
interface DocumentService {
uploadDocument(file: File, userId: string): Promise<Document>
getDocuments(userId: string): Promise<Document[]>
getDocument(documentId: string): Promise<Document>
deleteDocument(documentId: string): Promise<void>
updateDocumentStatus(documentId: string, status: ProcessingStatus): Promise<void>
}
```
#### LLM Processing Service
```typescript
interface LLMService {
processDocument(documentId: string, extractedText: string): Promise<ProcessingResult>
regenerateWithFeedback(documentId: string, feedback: string): Promise<ProcessingResult>
validateOutput(output: string): Promise<ValidationResult>
}
```
#### PDF Service
```typescript
interface PDFService {
extractText(filePath: string): Promise<string>
generatePDF(markdown: string): Promise<Buffer>
validatePDF(filePath: string): Promise<boolean>
}
```
## Data Models
### User Model
```typescript
interface User {
id: string
email: string
name: string
role: 'user' | 'admin'
createdAt: Date
updatedAt: Date
}
```
### Document Model
```typescript
interface Document {
id: string
userId: string
originalFileName: string
filePath: string
fileSize: number
uploadedAt: Date
status: ProcessingStatus
extractedText?: string
generatedSummary?: string
summaryMarkdownPath?: string
summaryPdfPath?: string
processingStartedAt?: Date
processingCompletedAt?: Date
errorMessage?: string
feedback?: DocumentFeedback[]
versions: DocumentVersion[]
}
type ProcessingStatus =
| 'uploaded'
| 'extracting_text'
| 'processing_llm'
| 'generating_pdf'
| 'completed'
| 'failed'
```
### Document Feedback Model
```typescript
interface DocumentFeedback {
id: string
documentId: string
userId: string
feedback: string
regenerationInstructions?: string
createdAt: Date
}
```
### Document Version Model
```typescript
interface DocumentVersion {
id: string
documentId: string
versionNumber: number
summaryMarkdown: string
summaryPdfPath: string
createdAt: Date
feedback?: string
}
```
### Processing Job Model
```typescript
interface ProcessingJob {
id: string
documentId: string
type: 'text_extraction' | 'llm_processing' | 'pdf_generation'
status: 'pending' | 'processing' | 'completed' | 'failed'
progress: number
errorMessage?: string
createdAt: Date
startedAt?: Date
completedAt?: Date
}
```
## Error Handling
### Frontend Error Handling
- Global error boundary for React components
- Toast notifications for user-facing errors
- Retry mechanisms for failed API calls
- Graceful degradation for offline scenarios
### Backend Error Handling
- Centralized error middleware
- Structured error logging with Winston
- Error categorization (validation, processing, system)
- Automatic retry for transient failures
### File Processing Error Handling
- PDF validation before processing
- Text extraction fallback mechanisms
- LLM API timeout and retry logic
- Cleanup of failed uploads and partial processing
### Error Types
```typescript
enum ErrorType {
VALIDATION_ERROR = 'validation_error',
AUTHENTICATION_ERROR = 'authentication_error',
FILE_PROCESSING_ERROR = 'file_processing_error',
LLM_PROCESSING_ERROR = 'llm_processing_error',
STORAGE_ERROR = 'storage_error',
SYSTEM_ERROR = 'system_error'
}
```
## Testing Strategy
### Unit Testing
- Jest for JavaScript/TypeScript testing
- React Testing Library for component testing
- Supertest for API endpoint testing
- Mock LLM API responses for consistent testing
### Integration Testing
- Database integration tests with test containers
- File upload and processing workflow tests
- Authentication flow testing
- PDF generation and download testing
### End-to-End Testing
- Playwright for browser automation
- Complete user workflows (upload → process → download)
- Admin functionality testing
- Error scenario testing
### Performance Testing
- Load testing for file uploads
- LLM processing performance benchmarks
- Database query optimization testing
- Memory usage monitoring during PDF processing
### Security Testing
- Authentication and authorization testing
- File upload security validation
- SQL injection prevention testing
- XSS and CSRF protection verification
## LLM Integration Design
### Prompt Engineering
The system will use a two-part prompt structure:
**Part 1: CIM Data Extraction**
- Provide the BPCP CIM Review Template
- Instruct LLM to populate only from CIM content
- Use "Not specified in CIM" for missing information
- Maintain strict markdown formatting
**Part 2: Investment Analysis**
- Add "Key Investment Considerations & Diligence Areas" section
- Allow use of general industry knowledge
- Focus on investment-specific insights and risks
### Token Management
- Document chunking for large PDFs (>100 pages)
- Token counting and optimization
- Fallback to smaller context windows if needed
- Cost tracking and monitoring
### Output Validation
- Markdown syntax validation
- Template structure verification
- Content completeness checking
- Retry mechanism for malformed outputs
## Security Considerations
### Authentication & Authorization
- JWT tokens with short expiration times
- Refresh token rotation
- Role-based access control (user/admin)
- Session management with Redis
### File Security
- File type validation (PDF only)
- File size limits (100MB max)
- Virus scanning integration
- Secure file storage with access controls
### Data Protection
- Encryption at rest for sensitive documents
- HTTPS enforcement for all communications
- Input sanitization and validation
- Audit logging for admin actions
### API Security
- Rate limiting on all endpoints
- CORS configuration
- Request size limits
- API key management for LLM services
## Performance Optimization
### File Processing
- Asynchronous processing with job queues
- Progress tracking and status updates
- Parallel processing for multiple documents
- Efficient PDF text extraction
### Database Optimization
- Proper indexing on frequently queried fields
- Connection pooling
- Query optimization
- Database migrations management
### Caching Strategy
- Redis caching for user sessions
- Document metadata caching
- LLM response caching for similar content
- Static asset caching
### Scalability Considerations
- Horizontal scaling capability
- Load balancing for multiple instances
- Database read replicas
- CDN for static assets and downloads