# CIM Document Processor - Cursor Rules ## Project Overview This is an AI-powered document processing system for analyzing Confidential Information Memorandums (CIMs). The system extracts text from PDFs, processes them through LLM services (Claude AI/OpenAI), generates structured analysis, and creates summary PDFs. **Core Purpose**: Automated processing and analysis of CIM documents using Google Document AI, vector embeddings, and LLM services. ## Tech Stack ### Backend - **Runtime**: Node.js 18+ with TypeScript - **Framework**: Express.js - **Database**: Supabase (PostgreSQL + Vector Database) - **Storage**: Google Cloud Storage (primary), Firebase Storage (fallback) - **AI Services**: - Google Document AI (text extraction) - Anthropic Claude (primary LLM) - OpenAI (fallback LLM) - OpenRouter (LLM routing) - **Authentication**: Firebase Auth - **Deployment**: Firebase Functions v2 ### Frontend - **Framework**: React 18 + TypeScript - **Build Tool**: Vite - **HTTP Client**: Axios - **Routing**: React Router - **Styling**: Tailwind CSS ## Critical Rules ### TypeScript Standards - **ALWAYS** use strict TypeScript types - avoid `any` type - Use proper type definitions from `backend/src/types/` and `frontend/src/types/` - Enable `noImplicitAny: true` in new code (currently disabled in tsconfig.json for legacy reasons) - Use interfaces for object shapes, types for unions/primitives - Prefer `unknown` over `any` when type is truly unknown ### Logging Standards - **ALWAYS** use Winston logger from `backend/src/utils/logger.ts` - Use `StructuredLogger` class for operations with correlation IDs - Log levels: - `logger.debug()` - Detailed diagnostic info - `logger.info()` - Normal operations - `logger.warn()` - Warning conditions - `logger.error()` - Error conditions with context - Include correlation IDs for request tracing - Log structured data: `logger.error('Message', { key: value, error: error.message })` - Never use `console.log` in production code - use logger instead ### Error Handling Patterns - **ALWAYS** use try-catch blocks for async operations - Include error context: `error instanceof Error ? error.message : String(error)` - Log errors with structured data before re-throwing - Use existing error handling middleware: `backend/src/middleware/errorHandler.ts` - For Firebase/Supabase errors, extract meaningful messages from error objects - Retry patterns: Use exponential backoff for external API calls (see `llmService.ts` for examples) ### Service Architecture - Services should be in `backend/src/services/` - Use dependency injection patterns where possible - Services should handle their own errors and log appropriately - Reference existing services before creating new ones: - `jobQueueService.ts` - Background job processing - `unifiedDocumentProcessor.ts` - Main document processing orchestrator - `llmService.ts` - LLM API interactions - `fileStorageService.ts` - File storage operations - `vectorDatabaseService.ts` - Vector embeddings and search ### Database Patterns - Use Supabase client from `backend/src/config/supabase.ts` - Models should be in `backend/src/models/` - Always handle Row Level Security (RLS) policies - Use transactions for multi-step operations - Handle connection errors gracefully with retries ### Testing Standards - Use Vitest for testing (Jest was removed - see TESTING_STRATEGY_DOCUMENTATION.md) - Write tests in `backend/src/__tests__/` - Test critical paths first: document upload, authentication, core API endpoints - Use TDD approach: write tests first, then implementation - Mock external services (Firebase, Supabase, LLM APIs) ## Deprecated Patterns (DO NOT USE) ### Removed Services - ❌ `agenticRAGDatabaseService.ts` - Removed, functionality moved to other services - ❌ `sessionService.ts` - Removed, use Firebase Auth directly - ❌ Direct PostgreSQL connections - Use Supabase client instead - ❌ Redis caching - Not used in current architecture - ❌ JWT authentication - Use Firebase Auth tokens instead ### Removed Test Patterns - ❌ Jest - Use Vitest instead - ❌ Tests for PostgreSQL/Redis architecture - Architecture changed to Supabase/Firebase ### Old API Patterns - ❌ Direct database queries - Use model methods from `backend/src/models/` - ❌ Manual error handling without structured logging - Use StructuredLogger ## Common Bugs to Avoid ### 1. Missing Correlation IDs - **Problem**: Logs without correlation IDs make debugging difficult - **Solution**: Always use `StructuredLogger` with correlation ID for request-scoped operations - **Example**: `const logger = new StructuredLogger(correlationId);` ### 2. Unhandled Promise Rejections - **Problem**: Async operations without try-catch cause unhandled rejections - **Solution**: Always wrap async operations in try-catch blocks - **Check**: `backend/src/index.ts` has global unhandled rejection handler ### 3. Type Assertions Instead of Type Guards - **Problem**: Using `as` type assertions can hide type errors - **Solution**: Use proper type guards: `error instanceof Error ? error.message : String(error)` ### 4. Missing Error Context - **Problem**: Errors logged without sufficient context - **Solution**: Include documentId, userId, jobId, and operation context in error logs ### 5. Firebase/Supabase Error Handling - **Problem**: Not extracting meaningful error messages from Firebase/Supabase errors - **Solution**: Check error.code and error.message, log full error object for debugging ### 6. Vector Search Timeouts - **Problem**: Vector search operations can timeout - **Solution**: See `backend/sql/fix_vector_search_timeout.sql` for timeout fixes - **Reference**: `backend/src/services/vectorDatabaseService.ts` ### 7. Job Processing Timeouts - **Problem**: Jobs can exceed 14-minute timeout limit - **Solution**: Check `backend/src/services/jobProcessorService.ts` for timeout handling - **Pattern**: Jobs should update status before timeout, handle gracefully ### 8. LLM Response Validation - **Problem**: LLM responses may not match expected JSON schema - **Solution**: Use Zod validation with retry logic (see `llmService.ts` lines 236-450) - **Pattern**: 3 retry attempts with improved prompts on validation failure ## Context Management ### Using @ Symbols for Context **@Files** - Reference specific files: - `@backend/src/utils/logger.ts` - For logging patterns - `@backend/src/services/jobQueueService.ts` - For job processing patterns - `@backend/src/services/llmService.ts` - For LLM API patterns - `@backend/src/middleware/errorHandler.ts` - For error handling patterns **@Codebase** - Semantic search (Chat only): - Use for finding similar implementations - Example: "How is document processing handled?" → searches entire codebase **@Folders** - Include entire directories: - `@backend/src/services/` - All service files - `@backend/src/scripts/` - All debugging scripts - `@backend/src/models/` - All database models **@Lint Errors** - Reference current lint errors (Chat only): - Use when fixing linting issues **@Git** - Access git history: - Use to see recent changes and understand context ### Key File References for Common Tasks **Logging:** - `backend/src/utils/logger.ts` - Winston logger and StructuredLogger class **Job Processing:** - `backend/src/services/jobQueueService.ts` - Job queue management - `backend/src/services/jobProcessorService.ts` - Job execution logic **Document Processing:** - `backend/src/services/unifiedDocumentProcessor.ts` - Main orchestrator - `backend/src/services/documentAiProcessor.ts` - Google Document AI integration - `backend/src/services/optimizedAgenticRAGProcessor.ts` - AI-powered analysis **LLM Services:** - `backend/src/services/llmService.ts` - LLM API interactions with retry logic **File Storage:** - `backend/src/services/fileStorageService.ts` - GCS and Firebase Storage operations **Database:** - `backend/src/models/DocumentModel.ts` - Document database operations - `backend/src/models/ProcessingJobModel.ts` - Job database operations - `backend/src/config/supabase.ts` - Supabase client configuration **Debugging Scripts:** - `backend/src/scripts/` - Collection of debugging and monitoring scripts ## Debugging Scripts Usage ### When to Use Existing Scripts vs Create New Ones **Use Existing Scripts For:** - Monitoring document processing: `monitor-document-processing.ts` - Checking job status: `check-current-job.ts`, `track-current-job.ts` - Database failure checks: `check-database-failures.ts` - System monitoring: `monitor-system.ts` - Testing LLM pipeline: `test-full-llm-pipeline.ts` **Create New Scripts When:** - Need to debug a specific new issue - Existing scripts don't cover the use case - Creating a one-time diagnostic tool ### Script Naming Conventions - `check-*` - Diagnostic scripts that check status - `monitor-*` - Continuous monitoring scripts - `track-*` - Tracking specific operations - `test-*` - Testing specific functionality - `setup-*` - Setup and configuration scripts ### Common Debugging Workflows **Debugging a Stuck Document:** 1. Use `check-new-doc-status.ts` to check document status 2. Use `check-current-job.ts` to check associated job 3. Use `monitor-document.ts` for real-time monitoring 4. Use `manually-process-job.ts` to reprocess if needed **Debugging LLM Issues:** 1. Use `test-openrouter-simple.ts` for basic LLM connectivity 2. Use `test-full-llm-pipeline.ts` for end-to-end LLM testing 3. Use `test-llm-processing-offline.ts` for offline testing **Debugging Database Issues:** 1. Use `check-database-failures.ts` to check for failures 2. Check SQL files in `backend/sql/` for schema fixes 3. Review `backend/src/models/` for model issues ## YOLO Mode Configuration When using Cursor's YOLO mode, these commands are always allowed: - Test commands: `npm test`, `vitest`, `npm run test:watch`, `npm run test:coverage` - Build commands: `npm run build`, `tsc`, `npm run lint` - File operations: `touch`, `mkdir`, file creation/editing - Running debugging scripts: `ts-node backend/src/scripts/*.ts` - Database scripts: `npm run db:*` commands ## Logging Patterns ### Winston Logger Usage **Basic Logging:** ```typescript import { logger } from './utils/logger'; logger.info('Operation started', { documentId, userId }); logger.error('Operation failed', { error: error.message, documentId }); ``` **Structured Logger with Correlation ID:** ```typescript import { StructuredLogger } from './utils/logger'; const structuredLogger = new StructuredLogger(correlationId); structuredLogger.processingStart(documentId, userId, options); structuredLogger.processingError(error, documentId, userId, 'llm_processing'); ``` **Service-Specific Logging:** - Upload operations: Use `structuredLogger.uploadStart()`, `uploadSuccess()`, `uploadError()` - Processing operations: Use `structuredLogger.processingStart()`, `processingSuccess()`, `processingError()` - Storage operations: Use `structuredLogger.storageOperation()` - Job queue operations: Use `structuredLogger.jobQueueOperation()` **Error Logging Best Practices:** - Always include error message: `error instanceof Error ? error.message : String(error)` - Include stack trace: `error instanceof Error ? error.stack : undefined` - Add context: documentId, userId, jobId, operation name - Use structured data, not string concatenation ## Firebase/Supabase Error Handling ### Firebase Errors - Check `error.code` for specific error codes - Firebase Auth errors: Handle `auth/` prefixed codes - Firebase Storage errors: Handle `storage/` prefixed codes - Log full error object for debugging: `logger.error('Firebase error', { error, code: error.code })` ### Supabase Errors - Check `error.code` and `error.message` - RLS policy errors: Check `error.code === 'PGRST301'` - Connection errors: Implement retry logic - Log with context: `logger.error('Supabase error', { error: error.message, code: error.code, query })` ## Retry Patterns ### LLM API Retries (from llmService.ts) - 3 retry attempts for API calls - Exponential backoff between retries - Improved prompts on validation failure - Log each attempt with attempt number ### Database Operation Retries - Use connection pooling (handled by Supabase client) - Retry on connection errors - Don't retry on validation errors ## Testing Guidelines ### Test Structure - Unit tests: `backend/src/__tests__/unit/` - Integration tests: `backend/src/__tests__/integration/` - Test utilities: `backend/src/__tests__/utils/` - Mocks: `backend/src/__tests__/mocks/` ### Critical Paths to Test 1. Document upload workflow 2. Authentication flow 3. Core API endpoints 4. Job processing pipeline 5. LLM service interactions ### Mocking External Services - Firebase: Mock Firebase Admin SDK - Supabase: Mock Supabase client - LLM APIs: Mock HTTP responses - Google Cloud Storage: Mock GCS client ## Performance Considerations - Vector search operations can be slow - use timeouts - LLM API calls are expensive - implement caching where possible - Job processing has 14-minute timeout limit - Large PDFs may cause memory issues - use streaming where possible - Database queries should use indexes (check Supabase dashboard) ## Security Best Practices - Never log sensitive data (passwords, API keys, tokens) - Use environment variables for all secrets (see `backend/src/config/env.ts`) - Validate all user inputs (see `backend/src/middleware/validation.ts`) - Use Firebase Auth for authentication - never bypass - Respect Row Level Security (RLS) policies in Supabase