Backend Infrastructure: - Complete Express server setup with security middleware (helmet, CORS, rate limiting) - Comprehensive error handling and logging with Winston - Authentication system with JWT tokens and session management - Database models and migrations for Users, Documents, Feedback, and Processing Jobs - API routes structure for authentication and document management - Integration tests for all server components (86 tests passing) Frontend Infrastructure: - React application with TypeScript and Vite - Authentication UI with login form, protected routes, and logout functionality - Authentication context with proper async state management - Component tests with proper async handling (25 tests passing) - Tailwind CSS styling and responsive design Key Features: - User registration, login, and authentication - Protected routes with role-based access control - Comprehensive error handling and user feedback - Database schema with proper relationships - Security middleware and validation - Production-ready build configuration Test Coverage: 111/111 tests passing Tasks Completed: 1-5 (Project setup, Database, Auth system, Frontend UI, Backend infrastructure) Ready for Task 6: File upload backend infrastructure
9.6 KiB
9.6 KiB
Design Document
Overview
The CIM Document Processor is a web-based application that enables authenticated team members to upload large PDF documents (CIMs), have them analyzed by an LLM using a structured template, and download the results in both Markdown and PDF formats. The system follows a modern web architecture with secure authentication, robust file processing, and comprehensive admin oversight.
Architecture
High-Level Architecture
graph TB
subgraph "Frontend Layer"
UI[React Web Application]
Auth[Authentication UI]
Upload[File Upload Interface]
Dashboard[User Dashboard]
Admin[Admin Panel]
end
subgraph "Backend Layer"
API[Express.js API Server]
AuthM[Authentication Middleware]
FileH[File Handler Service]
LLMS[LLM Processing Service]
PDF[PDF Generation Service]
end
subgraph "Data Layer"
DB[(PostgreSQL Database)]
FileStore[File Storage (AWS S3/Local)]
Cache[Redis Cache]
end
subgraph "External Services"
LLM[LLM API (OpenAI/Anthropic)]
PDFLib[PDF Processing Library]
end
UI --> API
Auth --> AuthM
Upload --> FileH
Dashboard --> API
Admin --> API
API --> DB
API --> FileStore
API --> Cache
FileH --> FileStore
LLMS --> LLM
PDF --> PDFLib
API --> LLMS
API --> PDF
Technology Stack
Frontend:
- React 18 with TypeScript
- Tailwind CSS for styling
- React Router for navigation
- Axios for API communication
- React Query for state management and caching
Backend:
- Node.js with Express.js
- TypeScript for type safety
- JWT for authentication
- Multer for file uploads
- Bull Queue for background job processing
Database:
- PostgreSQL for primary data storage
- Redis for session management and job queues
File Processing:
- PDF-parse for text extraction
- Puppeteer for PDF generation from Markdown
- AWS S3 or local file system for file storage
LLM Integration:
- OpenAI API or Anthropic Claude API
- Configurable model selection
- Token management and rate limiting
Components and Interfaces
Frontend Components
Authentication Components
LoginForm: Handles user login with validationAuthGuard: Protects routes requiring authenticationSessionManager: Manages user session state
Upload Components
FileUploader: Drag-and-drop PDF upload with progressUploadValidator: Client-side file validationUploadProgress: Real-time upload status display
Dashboard Components
DocumentList: Displays user's uploaded documentsDocumentCard: Individual document status and actionsProcessingStatus: Real-time processing updatesDownloadButtons: Markdown and PDF download options
Admin Components
AdminDashboard: Overview of all system documentsUserManagement: User account managementDocumentArchive: System-wide document accessSystemMetrics: Storage and processing statistics
Backend Services
Authentication Service
interface AuthService {
login(credentials: LoginCredentials): Promise<AuthResult>
validateToken(token: string): Promise<User>
logout(userId: string): Promise<void>
refreshToken(refreshToken: string): Promise<AuthResult>
}
Document Service
interface DocumentService {
uploadDocument(file: File, userId: string): Promise<Document>
getDocuments(userId: string): Promise<Document[]>
getDocument(documentId: string): Promise<Document>
deleteDocument(documentId: string): Promise<void>
updateDocumentStatus(documentId: string, status: ProcessingStatus): Promise<void>
}
LLM Processing Service
interface LLMService {
processDocument(documentId: string, extractedText: string): Promise<ProcessingResult>
regenerateWithFeedback(documentId: string, feedback: string): Promise<ProcessingResult>
validateOutput(output: string): Promise<ValidationResult>
}
PDF Service
interface PDFService {
extractText(filePath: string): Promise<string>
generatePDF(markdown: string): Promise<Buffer>
validatePDF(filePath: string): Promise<boolean>
}
Data Models
User Model
interface User {
id: string
email: string
name: string
role: 'user' | 'admin'
createdAt: Date
updatedAt: Date
}
Document Model
interface Document {
id: string
userId: string
originalFileName: string
filePath: string
fileSize: number
uploadedAt: Date
status: ProcessingStatus
extractedText?: string
generatedSummary?: string
summaryMarkdownPath?: string
summaryPdfPath?: string
processingStartedAt?: Date
processingCompletedAt?: Date
errorMessage?: string
feedback?: DocumentFeedback[]
versions: DocumentVersion[]
}
type ProcessingStatus =
| 'uploaded'
| 'extracting_text'
| 'processing_llm'
| 'generating_pdf'
| 'completed'
| 'failed'
Document Feedback Model
interface DocumentFeedback {
id: string
documentId: string
userId: string
feedback: string
regenerationInstructions?: string
createdAt: Date
}
Document Version Model
interface DocumentVersion {
id: string
documentId: string
versionNumber: number
summaryMarkdown: string
summaryPdfPath: string
createdAt: Date
feedback?: string
}
Processing Job Model
interface ProcessingJob {
id: string
documentId: string
type: 'text_extraction' | 'llm_processing' | 'pdf_generation'
status: 'pending' | 'processing' | 'completed' | 'failed'
progress: number
errorMessage?: string
createdAt: Date
startedAt?: Date
completedAt?: Date
}
Error Handling
Frontend Error Handling
- Global error boundary for React components
- Toast notifications for user-facing errors
- Retry mechanisms for failed API calls
- Graceful degradation for offline scenarios
Backend Error Handling
- Centralized error middleware
- Structured error logging with Winston
- Error categorization (validation, processing, system)
- Automatic retry for transient failures
File Processing Error Handling
- PDF validation before processing
- Text extraction fallback mechanisms
- LLM API timeout and retry logic
- Cleanup of failed uploads and partial processing
Error Types
enum ErrorType {
VALIDATION_ERROR = 'validation_error',
AUTHENTICATION_ERROR = 'authentication_error',
FILE_PROCESSING_ERROR = 'file_processing_error',
LLM_PROCESSING_ERROR = 'llm_processing_error',
STORAGE_ERROR = 'storage_error',
SYSTEM_ERROR = 'system_error'
}
Testing Strategy
Unit Testing
- Jest for JavaScript/TypeScript testing
- React Testing Library for component testing
- Supertest for API endpoint testing
- Mock LLM API responses for consistent testing
Integration Testing
- Database integration tests with test containers
- File upload and processing workflow tests
- Authentication flow testing
- PDF generation and download testing
End-to-End Testing
- Playwright for browser automation
- Complete user workflows (upload → process → download)
- Admin functionality testing
- Error scenario testing
Performance Testing
- Load testing for file uploads
- LLM processing performance benchmarks
- Database query optimization testing
- Memory usage monitoring during PDF processing
Security Testing
- Authentication and authorization testing
- File upload security validation
- SQL injection prevention testing
- XSS and CSRF protection verification
LLM Integration Design
Prompt Engineering
The system will use a two-part prompt structure:
Part 1: CIM Data Extraction
- Provide the BPCP CIM Review Template
- Instruct LLM to populate only from CIM content
- Use "Not specified in CIM" for missing information
- Maintain strict markdown formatting
Part 2: Investment Analysis
- Add "Key Investment Considerations & Diligence Areas" section
- Allow use of general industry knowledge
- Focus on investment-specific insights and risks
Token Management
- Document chunking for large PDFs (>100 pages)
- Token counting and optimization
- Fallback to smaller context windows if needed
- Cost tracking and monitoring
Output Validation
- Markdown syntax validation
- Template structure verification
- Content completeness checking
- Retry mechanism for malformed outputs
Security Considerations
Authentication & Authorization
- JWT tokens with short expiration times
- Refresh token rotation
- Role-based access control (user/admin)
- Session management with Redis
File Security
- File type validation (PDF only)
- File size limits (100MB max)
- Virus scanning integration
- Secure file storage with access controls
Data Protection
- Encryption at rest for sensitive documents
- HTTPS enforcement for all communications
- Input sanitization and validation
- Audit logging for admin actions
API Security
- Rate limiting on all endpoints
- CORS configuration
- Request size limits
- API key management for LLM services
Performance Optimization
File Processing
- Asynchronous processing with job queues
- Progress tracking and status updates
- Parallel processing for multiple documents
- Efficient PDF text extraction
Database Optimization
- Proper indexing on frequently queried fields
- Connection pooling
- Query optimization
- Database migrations management
Caching Strategy
- Redis caching for user sessions
- Document metadata caching
- LLM response caching for similar content
- Static asset caching
Scalability Considerations
- Horizontal scaling capability
- Load balancing for multiple instances
- Database read replicas
- CDN for static assets and downloads