Files
cim_summary/.cursorrules
admin 9c916d12f4 feat: Production release v2.0.0 - Simple Document Processor
Major release with significant performance improvements and new processing strategy.

## Core Changes
- Implemented simple_full_document processing strategy (default)
- Full document → LLM approach: 1-2 passes, ~5-6 minutes processing time
- Achieved 100% completeness with 2 API calls (down from 5+)
- Removed redundant Document AI passes for faster processing

## Financial Data Extraction
- Enhanced deterministic financial table parser
- Improved FY3/FY2/FY1/LTM identification from varying CIM formats
- Automatic merging of parser results with LLM extraction

## Code Quality & Infrastructure
- Cleaned up debug logging (removed emoji markers from production code)
- Fixed Firebase Secrets configuration (using modern defineSecret approach)
- Updated OpenAI API key
- Resolved deployment conflicts (secrets vs environment variables)
- Added .env files to Firebase ignore list

## Deployment
- Firebase Functions v2 deployment successful
- All 7 required secrets verified and configured
- Function URL: https://api-y56ccs6wva-uc.a.run.app

## Performance Improvements
- Processing time: ~5-6 minutes (down from 23+ minutes)
- API calls: 1-2 (down from 5+)
- Completeness: 100% achievable
- LLM Model: claude-3-7-sonnet-latest

## Breaking Changes
- Default processing strategy changed to 'simple_full_document'
- RAG processor available as alternative strategy 'document_ai_agentic_rag'

## Files Changed
- 36 files changed, 5642 insertions(+), 4451 deletions(-)
- Removed deprecated documentation files
- Cleaned up unused services and models

This release represents a major refactoring focused on speed, accuracy, and maintainability.
2025-11-09 21:07:22 -05:00

341 lines
13 KiB
Plaintext

# CIM Document Processor - Cursor Rules
## Project Overview
This is an AI-powered document processing system for analyzing Confidential Information Memorandums (CIMs). The system extracts text from PDFs, processes them through LLM services (Claude AI/OpenAI), generates structured analysis, and creates summary PDFs.
**Core Purpose**: Automated processing and analysis of CIM documents using Google Document AI, vector embeddings, and LLM services.
## Tech Stack
### Backend
- **Runtime**: Node.js 18+ with TypeScript
- **Framework**: Express.js
- **Database**: Supabase (PostgreSQL + Vector Database)
- **Storage**: Google Cloud Storage (primary), Firebase Storage (fallback)
- **AI Services**:
- Google Document AI (text extraction)
- Anthropic Claude (primary LLM)
- OpenAI (fallback LLM)
- OpenRouter (LLM routing)
- **Authentication**: Firebase Auth
- **Deployment**: Firebase Functions v2
### Frontend
- **Framework**: React 18 + TypeScript
- **Build Tool**: Vite
- **HTTP Client**: Axios
- **Routing**: React Router
- **Styling**: Tailwind CSS
## Critical Rules
### TypeScript Standards
- **ALWAYS** use strict TypeScript types - avoid `any` type
- Use proper type definitions from `backend/src/types/` and `frontend/src/types/`
- Enable `noImplicitAny: true` in new code (currently disabled in tsconfig.json for legacy reasons)
- Use interfaces for object shapes, types for unions/primitives
- Prefer `unknown` over `any` when type is truly unknown
### Logging Standards
- **ALWAYS** use Winston logger from `backend/src/utils/logger.ts`
- Use `StructuredLogger` class for operations with correlation IDs
- Log levels:
- `logger.debug()` - Detailed diagnostic info
- `logger.info()` - Normal operations
- `logger.warn()` - Warning conditions
- `logger.error()` - Error conditions with context
- Include correlation IDs for request tracing
- Log structured data: `logger.error('Message', { key: value, error: error.message })`
- Never use `console.log` in production code - use logger instead
### Error Handling Patterns
- **ALWAYS** use try-catch blocks for async operations
- Include error context: `error instanceof Error ? error.message : String(error)`
- Log errors with structured data before re-throwing
- Use existing error handling middleware: `backend/src/middleware/errorHandler.ts`
- For Firebase/Supabase errors, extract meaningful messages from error objects
- Retry patterns: Use exponential backoff for external API calls (see `llmService.ts` for examples)
### Service Architecture
- Services should be in `backend/src/services/`
- Use dependency injection patterns where possible
- Services should handle their own errors and log appropriately
- Reference existing services before creating new ones:
- `jobQueueService.ts` - Background job processing
- `unifiedDocumentProcessor.ts` - Main document processing orchestrator
- `llmService.ts` - LLM API interactions
- `fileStorageService.ts` - File storage operations
- `vectorDatabaseService.ts` - Vector embeddings and search
### Database Patterns
- Use Supabase client from `backend/src/config/supabase.ts`
- Models should be in `backend/src/models/`
- Always handle Row Level Security (RLS) policies
- Use transactions for multi-step operations
- Handle connection errors gracefully with retries
### Testing Standards
- Use Vitest for testing (Jest was removed - see TESTING_STRATEGY_DOCUMENTATION.md)
- Write tests in `backend/src/__tests__/`
- Test critical paths first: document upload, authentication, core API endpoints
- Use TDD approach: write tests first, then implementation
- Mock external services (Firebase, Supabase, LLM APIs)
## Deprecated Patterns (DO NOT USE)
### Removed Services
- ❌ `agenticRAGDatabaseService.ts` - Removed, functionality moved to other services
- ❌ `sessionService.ts` - Removed, use Firebase Auth directly
- ❌ Direct PostgreSQL connections - Use Supabase client instead
- ❌ Redis caching - Not used in current architecture
- ❌ JWT authentication - Use Firebase Auth tokens instead
### Removed Test Patterns
- ❌ Jest - Use Vitest instead
- ❌ Tests for PostgreSQL/Redis architecture - Architecture changed to Supabase/Firebase
### Old API Patterns
- ❌ Direct database queries - Use model methods from `backend/src/models/`
- ❌ Manual error handling without structured logging - Use StructuredLogger
## Common Bugs to Avoid
### 1. Missing Correlation IDs
- **Problem**: Logs without correlation IDs make debugging difficult
- **Solution**: Always use `StructuredLogger` with correlation ID for request-scoped operations
- **Example**: `const logger = new StructuredLogger(correlationId);`
### 2. Unhandled Promise Rejections
- **Problem**: Async operations without try-catch cause unhandled rejections
- **Solution**: Always wrap async operations in try-catch blocks
- **Check**: `backend/src/index.ts` has global unhandled rejection handler
### 3. Type Assertions Instead of Type Guards
- **Problem**: Using `as` type assertions can hide type errors
- **Solution**: Use proper type guards: `error instanceof Error ? error.message : String(error)`
### 4. Missing Error Context
- **Problem**: Errors logged without sufficient context
- **Solution**: Include documentId, userId, jobId, and operation context in error logs
### 5. Firebase/Supabase Error Handling
- **Problem**: Not extracting meaningful error messages from Firebase/Supabase errors
- **Solution**: Check error.code and error.message, log full error object for debugging
### 6. Vector Search Timeouts
- **Problem**: Vector search operations can timeout
- **Solution**: See `backend/sql/fix_vector_search_timeout.sql` for timeout fixes
- **Reference**: `backend/src/services/vectorDatabaseService.ts`
### 7. Job Processing Timeouts
- **Problem**: Jobs can exceed 14-minute timeout limit
- **Solution**: Check `backend/src/services/jobProcessorService.ts` for timeout handling
- **Pattern**: Jobs should update status before timeout, handle gracefully
### 8. LLM Response Validation
- **Problem**: LLM responses may not match expected JSON schema
- **Solution**: Use Zod validation with retry logic (see `llmService.ts` lines 236-450)
- **Pattern**: 3 retry attempts with improved prompts on validation failure
## Context Management
### Using @ Symbols for Context
**@Files** - Reference specific files:
- `@backend/src/utils/logger.ts` - For logging patterns
- `@backend/src/services/jobQueueService.ts` - For job processing patterns
- `@backend/src/services/llmService.ts` - For LLM API patterns
- `@backend/src/middleware/errorHandler.ts` - For error handling patterns
**@Codebase** - Semantic search (Chat only):
- Use for finding similar implementations
- Example: "How is document processing handled?" → searches entire codebase
**@Folders** - Include entire directories:
- `@backend/src/services/` - All service files
- `@backend/src/scripts/` - All debugging scripts
- `@backend/src/models/` - All database models
**@Lint Errors** - Reference current lint errors (Chat only):
- Use when fixing linting issues
**@Git** - Access git history:
- Use to see recent changes and understand context
### Key File References for Common Tasks
**Logging:**
- `backend/src/utils/logger.ts` - Winston logger and StructuredLogger class
**Job Processing:**
- `backend/src/services/jobQueueService.ts` - Job queue management
- `backend/src/services/jobProcessorService.ts` - Job execution logic
**Document Processing:**
- `backend/src/services/unifiedDocumentProcessor.ts` - Main orchestrator
- `backend/src/services/documentAiProcessor.ts` - Google Document AI integration
- `backend/src/services/optimizedAgenticRAGProcessor.ts` - AI-powered analysis
**LLM Services:**
- `backend/src/services/llmService.ts` - LLM API interactions with retry logic
**File Storage:**
- `backend/src/services/fileStorageService.ts` - GCS and Firebase Storage operations
**Database:**
- `backend/src/models/DocumentModel.ts` - Document database operations
- `backend/src/models/ProcessingJobModel.ts` - Job database operations
- `backend/src/config/supabase.ts` - Supabase client configuration
**Debugging Scripts:**
- `backend/src/scripts/` - Collection of debugging and monitoring scripts
## Debugging Scripts Usage
### When to Use Existing Scripts vs Create New Ones
**Use Existing Scripts For:**
- Monitoring document processing: `monitor-document-processing.ts`
- Checking job status: `check-current-job.ts`, `track-current-job.ts`
- Database failure checks: `check-database-failures.ts`
- System monitoring: `monitor-system.ts`
- Testing LLM pipeline: `test-full-llm-pipeline.ts`
**Create New Scripts When:**
- Need to debug a specific new issue
- Existing scripts don't cover the use case
- Creating a one-time diagnostic tool
### Script Naming Conventions
- `check-*` - Diagnostic scripts that check status
- `monitor-*` - Continuous monitoring scripts
- `track-*` - Tracking specific operations
- `test-*` - Testing specific functionality
- `setup-*` - Setup and configuration scripts
### Common Debugging Workflows
**Debugging a Stuck Document:**
1. Use `check-new-doc-status.ts` to check document status
2. Use `check-current-job.ts` to check associated job
3. Use `monitor-document.ts` for real-time monitoring
4. Use `manually-process-job.ts` to reprocess if needed
**Debugging LLM Issues:**
1. Use `test-openrouter-simple.ts` for basic LLM connectivity
2. Use `test-full-llm-pipeline.ts` for end-to-end LLM testing
3. Use `test-llm-processing-offline.ts` for offline testing
**Debugging Database Issues:**
1. Use `check-database-failures.ts` to check for failures
2. Check SQL files in `backend/sql/` for schema fixes
3. Review `backend/src/models/` for model issues
## YOLO Mode Configuration
When using Cursor's YOLO mode, these commands are always allowed:
- Test commands: `npm test`, `vitest`, `npm run test:watch`, `npm run test:coverage`
- Build commands: `npm run build`, `tsc`, `npm run lint`
- File operations: `touch`, `mkdir`, file creation/editing
- Running debugging scripts: `ts-node backend/src/scripts/*.ts`
- Database scripts: `npm run db:*` commands
## Logging Patterns
### Winston Logger Usage
**Basic Logging:**
```typescript
import { logger } from './utils/logger';
logger.info('Operation started', { documentId, userId });
logger.error('Operation failed', { error: error.message, documentId });
```
**Structured Logger with Correlation ID:**
```typescript
import { StructuredLogger } from './utils/logger';
const structuredLogger = new StructuredLogger(correlationId);
structuredLogger.processingStart(documentId, userId, options);
structuredLogger.processingError(error, documentId, userId, 'llm_processing');
```
**Service-Specific Logging:**
- Upload operations: Use `structuredLogger.uploadStart()`, `uploadSuccess()`, `uploadError()`
- Processing operations: Use `structuredLogger.processingStart()`, `processingSuccess()`, `processingError()`
- Storage operations: Use `structuredLogger.storageOperation()`
- Job queue operations: Use `structuredLogger.jobQueueOperation()`
**Error Logging Best Practices:**
- Always include error message: `error instanceof Error ? error.message : String(error)`
- Include stack trace: `error instanceof Error ? error.stack : undefined`
- Add context: documentId, userId, jobId, operation name
- Use structured data, not string concatenation
## Firebase/Supabase Error Handling
### Firebase Errors
- Check `error.code` for specific error codes
- Firebase Auth errors: Handle `auth/` prefixed codes
- Firebase Storage errors: Handle `storage/` prefixed codes
- Log full error object for debugging: `logger.error('Firebase error', { error, code: error.code })`
### Supabase Errors
- Check `error.code` and `error.message`
- RLS policy errors: Check `error.code === 'PGRST301'`
- Connection errors: Implement retry logic
- Log with context: `logger.error('Supabase error', { error: error.message, code: error.code, query })`
## Retry Patterns
### LLM API Retries (from llmService.ts)
- 3 retry attempts for API calls
- Exponential backoff between retries
- Improved prompts on validation failure
- Log each attempt with attempt number
### Database Operation Retries
- Use connection pooling (handled by Supabase client)
- Retry on connection errors
- Don't retry on validation errors
## Testing Guidelines
### Test Structure
- Unit tests: `backend/src/__tests__/unit/`
- Integration tests: `backend/src/__tests__/integration/`
- Test utilities: `backend/src/__tests__/utils/`
- Mocks: `backend/src/__tests__/mocks/`
### Critical Paths to Test
1. Document upload workflow
2. Authentication flow
3. Core API endpoints
4. Job processing pipeline
5. LLM service interactions
### Mocking External Services
- Firebase: Mock Firebase Admin SDK
- Supabase: Mock Supabase client
- LLM APIs: Mock HTTP responses
- Google Cloud Storage: Mock GCS client
## Performance Considerations
- Vector search operations can be slow - use timeouts
- LLM API calls are expensive - implement caching where possible
- Job processing has 14-minute timeout limit
- Large PDFs may cause memory issues - use streaming where possible
- Database queries should use indexes (check Supabase dashboard)
## Security Best Practices
- Never log sensitive data (passwords, API keys, tokens)
- Use environment variables for all secrets (see `backend/src/config/env.ts`)
- Validate all user inputs (see `backend/src/middleware/validation.ts`)
- Use Firebase Auth for authentication - never bypass
- Respect Row Level Security (RLS) policies in Supabase