Backend Infrastructure: - Complete Express server setup with security middleware (helmet, CORS, rate limiting) - Comprehensive error handling and logging with Winston - Authentication system with JWT tokens and session management - Database models and migrations for Users, Documents, Feedback, and Processing Jobs - API routes structure for authentication and document management - Integration tests for all server components (86 tests passing) Frontend Infrastructure: - React application with TypeScript and Vite - Authentication UI with login form, protected routes, and logout functionality - Authentication context with proper async state management - Component tests with proper async handling (25 tests passing) - Tailwind CSS styling and responsive design Key Features: - User registration, login, and authentication - Protected routes with role-based access control - Comprehensive error handling and user feedback - Database schema with proper relationships - Security middleware and validation - Production-ready build configuration Test Coverage: 111/111 tests passing Tasks Completed: 1-5 (Project setup, Database, Auth system, Frontend UI, Backend infrastructure) Ready for Task 6: File upload backend infrastructure
381 lines
9.6 KiB
Markdown
381 lines
9.6 KiB
Markdown
# Design Document
|
|
|
|
## Overview
|
|
|
|
The CIM Document Processor is a web-based application that enables authenticated team members to upload large PDF documents (CIMs), have them analyzed by an LLM using a structured template, and download the results in both Markdown and PDF formats. The system follows a modern web architecture with secure authentication, robust file processing, and comprehensive admin oversight.
|
|
|
|
## Architecture
|
|
|
|
### High-Level Architecture
|
|
|
|
```mermaid
|
|
graph TB
|
|
subgraph "Frontend Layer"
|
|
UI[React Web Application]
|
|
Auth[Authentication UI]
|
|
Upload[File Upload Interface]
|
|
Dashboard[User Dashboard]
|
|
Admin[Admin Panel]
|
|
end
|
|
|
|
subgraph "Backend Layer"
|
|
API[Express.js API Server]
|
|
AuthM[Authentication Middleware]
|
|
FileH[File Handler Service]
|
|
LLMS[LLM Processing Service]
|
|
PDF[PDF Generation Service]
|
|
end
|
|
|
|
subgraph "Data Layer"
|
|
DB[(PostgreSQL Database)]
|
|
FileStore[File Storage (AWS S3/Local)]
|
|
Cache[Redis Cache]
|
|
end
|
|
|
|
subgraph "External Services"
|
|
LLM[LLM API (OpenAI/Anthropic)]
|
|
PDFLib[PDF Processing Library]
|
|
end
|
|
|
|
UI --> API
|
|
Auth --> AuthM
|
|
Upload --> FileH
|
|
Dashboard --> API
|
|
Admin --> API
|
|
|
|
API --> DB
|
|
API --> FileStore
|
|
API --> Cache
|
|
|
|
FileH --> FileStore
|
|
LLMS --> LLM
|
|
PDF --> PDFLib
|
|
|
|
API --> LLMS
|
|
API --> PDF
|
|
```
|
|
|
|
### Technology Stack
|
|
|
|
**Frontend:**
|
|
- React 18 with TypeScript
|
|
- Tailwind CSS for styling
|
|
- React Router for navigation
|
|
- Axios for API communication
|
|
- React Query for state management and caching
|
|
|
|
**Backend:**
|
|
- Node.js with Express.js
|
|
- TypeScript for type safety
|
|
- JWT for authentication
|
|
- Multer for file uploads
|
|
- Bull Queue for background job processing
|
|
|
|
**Database:**
|
|
- PostgreSQL for primary data storage
|
|
- Redis for session management and job queues
|
|
|
|
**File Processing:**
|
|
- PDF-parse for text extraction
|
|
- Puppeteer for PDF generation from Markdown
|
|
- AWS S3 or local file system for file storage
|
|
|
|
**LLM Integration:**
|
|
- OpenAI API or Anthropic Claude API
|
|
- Configurable model selection
|
|
- Token management and rate limiting
|
|
|
|
## Components and Interfaces
|
|
|
|
### Frontend Components
|
|
|
|
#### Authentication Components
|
|
- `LoginForm`: Handles user login with validation
|
|
- `AuthGuard`: Protects routes requiring authentication
|
|
- `SessionManager`: Manages user session state
|
|
|
|
#### Upload Components
|
|
- `FileUploader`: Drag-and-drop PDF upload with progress
|
|
- `UploadValidator`: Client-side file validation
|
|
- `UploadProgress`: Real-time upload status display
|
|
|
|
#### Dashboard Components
|
|
- `DocumentList`: Displays user's uploaded documents
|
|
- `DocumentCard`: Individual document status and actions
|
|
- `ProcessingStatus`: Real-time processing updates
|
|
- `DownloadButtons`: Markdown and PDF download options
|
|
|
|
#### Admin Components
|
|
- `AdminDashboard`: Overview of all system documents
|
|
- `UserManagement`: User account management
|
|
- `DocumentArchive`: System-wide document access
|
|
- `SystemMetrics`: Storage and processing statistics
|
|
|
|
### Backend Services
|
|
|
|
#### Authentication Service
|
|
```typescript
|
|
interface AuthService {
|
|
login(credentials: LoginCredentials): Promise<AuthResult>
|
|
validateToken(token: string): Promise<User>
|
|
logout(userId: string): Promise<void>
|
|
refreshToken(refreshToken: string): Promise<AuthResult>
|
|
}
|
|
```
|
|
|
|
#### Document Service
|
|
```typescript
|
|
interface DocumentService {
|
|
uploadDocument(file: File, userId: string): Promise<Document>
|
|
getDocuments(userId: string): Promise<Document[]>
|
|
getDocument(documentId: string): Promise<Document>
|
|
deleteDocument(documentId: string): Promise<void>
|
|
updateDocumentStatus(documentId: string, status: ProcessingStatus): Promise<void>
|
|
}
|
|
```
|
|
|
|
#### LLM Processing Service
|
|
```typescript
|
|
interface LLMService {
|
|
processDocument(documentId: string, extractedText: string): Promise<ProcessingResult>
|
|
regenerateWithFeedback(documentId: string, feedback: string): Promise<ProcessingResult>
|
|
validateOutput(output: string): Promise<ValidationResult>
|
|
}
|
|
```
|
|
|
|
#### PDF Service
|
|
```typescript
|
|
interface PDFService {
|
|
extractText(filePath: string): Promise<string>
|
|
generatePDF(markdown: string): Promise<Buffer>
|
|
validatePDF(filePath: string): Promise<boolean>
|
|
}
|
|
```
|
|
|
|
## Data Models
|
|
|
|
### User Model
|
|
```typescript
|
|
interface User {
|
|
id: string
|
|
email: string
|
|
name: string
|
|
role: 'user' | 'admin'
|
|
createdAt: Date
|
|
updatedAt: Date
|
|
}
|
|
```
|
|
|
|
### Document Model
|
|
```typescript
|
|
interface Document {
|
|
id: string
|
|
userId: string
|
|
originalFileName: string
|
|
filePath: string
|
|
fileSize: number
|
|
uploadedAt: Date
|
|
status: ProcessingStatus
|
|
extractedText?: string
|
|
generatedSummary?: string
|
|
summaryMarkdownPath?: string
|
|
summaryPdfPath?: string
|
|
processingStartedAt?: Date
|
|
processingCompletedAt?: Date
|
|
errorMessage?: string
|
|
feedback?: DocumentFeedback[]
|
|
versions: DocumentVersion[]
|
|
}
|
|
|
|
type ProcessingStatus =
|
|
| 'uploaded'
|
|
| 'extracting_text'
|
|
| 'processing_llm'
|
|
| 'generating_pdf'
|
|
| 'completed'
|
|
| 'failed'
|
|
```
|
|
|
|
### Document Feedback Model
|
|
```typescript
|
|
interface DocumentFeedback {
|
|
id: string
|
|
documentId: string
|
|
userId: string
|
|
feedback: string
|
|
regenerationInstructions?: string
|
|
createdAt: Date
|
|
}
|
|
```
|
|
|
|
### Document Version Model
|
|
```typescript
|
|
interface DocumentVersion {
|
|
id: string
|
|
documentId: string
|
|
versionNumber: number
|
|
summaryMarkdown: string
|
|
summaryPdfPath: string
|
|
createdAt: Date
|
|
feedback?: string
|
|
}
|
|
```
|
|
|
|
### Processing Job Model
|
|
```typescript
|
|
interface ProcessingJob {
|
|
id: string
|
|
documentId: string
|
|
type: 'text_extraction' | 'llm_processing' | 'pdf_generation'
|
|
status: 'pending' | 'processing' | 'completed' | 'failed'
|
|
progress: number
|
|
errorMessage?: string
|
|
createdAt: Date
|
|
startedAt?: Date
|
|
completedAt?: Date
|
|
}
|
|
```
|
|
|
|
## Error Handling
|
|
|
|
### Frontend Error Handling
|
|
- Global error boundary for React components
|
|
- Toast notifications for user-facing errors
|
|
- Retry mechanisms for failed API calls
|
|
- Graceful degradation for offline scenarios
|
|
|
|
### Backend Error Handling
|
|
- Centralized error middleware
|
|
- Structured error logging with Winston
|
|
- Error categorization (validation, processing, system)
|
|
- Automatic retry for transient failures
|
|
|
|
### File Processing Error Handling
|
|
- PDF validation before processing
|
|
- Text extraction fallback mechanisms
|
|
- LLM API timeout and retry logic
|
|
- Cleanup of failed uploads and partial processing
|
|
|
|
### Error Types
|
|
```typescript
|
|
enum ErrorType {
|
|
VALIDATION_ERROR = 'validation_error',
|
|
AUTHENTICATION_ERROR = 'authentication_error',
|
|
FILE_PROCESSING_ERROR = 'file_processing_error',
|
|
LLM_PROCESSING_ERROR = 'llm_processing_error',
|
|
STORAGE_ERROR = 'storage_error',
|
|
SYSTEM_ERROR = 'system_error'
|
|
}
|
|
```
|
|
|
|
## Testing Strategy
|
|
|
|
### Unit Testing
|
|
- Jest for JavaScript/TypeScript testing
|
|
- React Testing Library for component testing
|
|
- Supertest for API endpoint testing
|
|
- Mock LLM API responses for consistent testing
|
|
|
|
### Integration Testing
|
|
- Database integration tests with test containers
|
|
- File upload and processing workflow tests
|
|
- Authentication flow testing
|
|
- PDF generation and download testing
|
|
|
|
### End-to-End Testing
|
|
- Playwright for browser automation
|
|
- Complete user workflows (upload → process → download)
|
|
- Admin functionality testing
|
|
- Error scenario testing
|
|
|
|
### Performance Testing
|
|
- Load testing for file uploads
|
|
- LLM processing performance benchmarks
|
|
- Database query optimization testing
|
|
- Memory usage monitoring during PDF processing
|
|
|
|
### Security Testing
|
|
- Authentication and authorization testing
|
|
- File upload security validation
|
|
- SQL injection prevention testing
|
|
- XSS and CSRF protection verification
|
|
|
|
## LLM Integration Design
|
|
|
|
### Prompt Engineering
|
|
The system will use a two-part prompt structure:
|
|
|
|
**Part 1: CIM Data Extraction**
|
|
- Provide the BPCP CIM Review Template
|
|
- Instruct LLM to populate only from CIM content
|
|
- Use "Not specified in CIM" for missing information
|
|
- Maintain strict markdown formatting
|
|
|
|
**Part 2: Investment Analysis**
|
|
- Add "Key Investment Considerations & Diligence Areas" section
|
|
- Allow use of general industry knowledge
|
|
- Focus on investment-specific insights and risks
|
|
|
|
### Token Management
|
|
- Document chunking for large PDFs (>100 pages)
|
|
- Token counting and optimization
|
|
- Fallback to smaller context windows if needed
|
|
- Cost tracking and monitoring
|
|
|
|
### Output Validation
|
|
- Markdown syntax validation
|
|
- Template structure verification
|
|
- Content completeness checking
|
|
- Retry mechanism for malformed outputs
|
|
|
|
## Security Considerations
|
|
|
|
### Authentication & Authorization
|
|
- JWT tokens with short expiration times
|
|
- Refresh token rotation
|
|
- Role-based access control (user/admin)
|
|
- Session management with Redis
|
|
|
|
### File Security
|
|
- File type validation (PDF only)
|
|
- File size limits (100MB max)
|
|
- Virus scanning integration
|
|
- Secure file storage with access controls
|
|
|
|
### Data Protection
|
|
- Encryption at rest for sensitive documents
|
|
- HTTPS enforcement for all communications
|
|
- Input sanitization and validation
|
|
- Audit logging for admin actions
|
|
|
|
### API Security
|
|
- Rate limiting on all endpoints
|
|
- CORS configuration
|
|
- Request size limits
|
|
- API key management for LLM services
|
|
|
|
## Performance Optimization
|
|
|
|
### File Processing
|
|
- Asynchronous processing with job queues
|
|
- Progress tracking and status updates
|
|
- Parallel processing for multiple documents
|
|
- Efficient PDF text extraction
|
|
|
|
### Database Optimization
|
|
- Proper indexing on frequently queried fields
|
|
- Connection pooling
|
|
- Query optimization
|
|
- Database migrations management
|
|
|
|
### Caching Strategy
|
|
- Redis caching for user sessions
|
|
- Document metadata caching
|
|
- LLM response caching for similar content
|
|
- Static asset caching
|
|
|
|
### Scalability Considerations
|
|
- Horizontal scaling capability
|
|
- Load balancing for multiple instances
|
|
- Database read replicas
|
|
- CDN for static assets and downloads |