Major release with significant performance improvements and new processing strategy. ## Core Changes - Implemented simple_full_document processing strategy (default) - Full document → LLM approach: 1-2 passes, ~5-6 minutes processing time - Achieved 100% completeness with 2 API calls (down from 5+) - Removed redundant Document AI passes for faster processing ## Financial Data Extraction - Enhanced deterministic financial table parser - Improved FY3/FY2/FY1/LTM identification from varying CIM formats - Automatic merging of parser results with LLM extraction ## Code Quality & Infrastructure - Cleaned up debug logging (removed emoji markers from production code) - Fixed Firebase Secrets configuration (using modern defineSecret approach) - Updated OpenAI API key - Resolved deployment conflicts (secrets vs environment variables) - Added .env files to Firebase ignore list ## Deployment - Firebase Functions v2 deployment successful - All 7 required secrets verified and configured - Function URL: https://api-y56ccs6wva-uc.a.run.app ## Performance Improvements - Processing time: ~5-6 minutes (down from 23+ minutes) - API calls: 1-2 (down from 5+) - Completeness: 100% achievable - LLM Model: claude-3-7-sonnet-latest ## Breaking Changes - Default processing strategy changed to 'simple_full_document' - RAG processor available as alternative strategy 'document_ai_agentic_rag' ## Files Changed - 36 files changed, 5642 insertions(+), 4451 deletions(-) - Removed deprecated documentation files - Cleaned up unused services and models This release represents a major refactoring focused on speed, accuracy, and maintainability.
341 lines
13 KiB
Plaintext
341 lines
13 KiB
Plaintext
# CIM Document Processor - Cursor Rules
|
|
|
|
## Project Overview
|
|
|
|
This is an AI-powered document processing system for analyzing Confidential Information Memorandums (CIMs). The system extracts text from PDFs, processes them through LLM services (Claude AI/OpenAI), generates structured analysis, and creates summary PDFs.
|
|
|
|
**Core Purpose**: Automated processing and analysis of CIM documents using Google Document AI, vector embeddings, and LLM services.
|
|
|
|
## Tech Stack
|
|
|
|
### Backend
|
|
- **Runtime**: Node.js 18+ with TypeScript
|
|
- **Framework**: Express.js
|
|
- **Database**: Supabase (PostgreSQL + Vector Database)
|
|
- **Storage**: Google Cloud Storage (primary), Firebase Storage (fallback)
|
|
- **AI Services**:
|
|
- Google Document AI (text extraction)
|
|
- Anthropic Claude (primary LLM)
|
|
- OpenAI (fallback LLM)
|
|
- OpenRouter (LLM routing)
|
|
- **Authentication**: Firebase Auth
|
|
- **Deployment**: Firebase Functions v2
|
|
|
|
### Frontend
|
|
- **Framework**: React 18 + TypeScript
|
|
- **Build Tool**: Vite
|
|
- **HTTP Client**: Axios
|
|
- **Routing**: React Router
|
|
- **Styling**: Tailwind CSS
|
|
|
|
## Critical Rules
|
|
|
|
### TypeScript Standards
|
|
- **ALWAYS** use strict TypeScript types - avoid `any` type
|
|
- Use proper type definitions from `backend/src/types/` and `frontend/src/types/`
|
|
- Enable `noImplicitAny: true` in new code (currently disabled in tsconfig.json for legacy reasons)
|
|
- Use interfaces for object shapes, types for unions/primitives
|
|
- Prefer `unknown` over `any` when type is truly unknown
|
|
|
|
### Logging Standards
|
|
- **ALWAYS** use Winston logger from `backend/src/utils/logger.ts`
|
|
- Use `StructuredLogger` class for operations with correlation IDs
|
|
- Log levels:
|
|
- `logger.debug()` - Detailed diagnostic info
|
|
- `logger.info()` - Normal operations
|
|
- `logger.warn()` - Warning conditions
|
|
- `logger.error()` - Error conditions with context
|
|
- Include correlation IDs for request tracing
|
|
- Log structured data: `logger.error('Message', { key: value, error: error.message })`
|
|
- Never use `console.log` in production code - use logger instead
|
|
|
|
### Error Handling Patterns
|
|
- **ALWAYS** use try-catch blocks for async operations
|
|
- Include error context: `error instanceof Error ? error.message : String(error)`
|
|
- Log errors with structured data before re-throwing
|
|
- Use existing error handling middleware: `backend/src/middleware/errorHandler.ts`
|
|
- For Firebase/Supabase errors, extract meaningful messages from error objects
|
|
- Retry patterns: Use exponential backoff for external API calls (see `llmService.ts` for examples)
|
|
|
|
### Service Architecture
|
|
- Services should be in `backend/src/services/`
|
|
- Use dependency injection patterns where possible
|
|
- Services should handle their own errors and log appropriately
|
|
- Reference existing services before creating new ones:
|
|
- `jobQueueService.ts` - Background job processing
|
|
- `unifiedDocumentProcessor.ts` - Main document processing orchestrator
|
|
- `llmService.ts` - LLM API interactions
|
|
- `fileStorageService.ts` - File storage operations
|
|
- `vectorDatabaseService.ts` - Vector embeddings and search
|
|
|
|
### Database Patterns
|
|
- Use Supabase client from `backend/src/config/supabase.ts`
|
|
- Models should be in `backend/src/models/`
|
|
- Always handle Row Level Security (RLS) policies
|
|
- Use transactions for multi-step operations
|
|
- Handle connection errors gracefully with retries
|
|
|
|
### Testing Standards
|
|
- Use Vitest for testing (Jest was removed - see TESTING_STRATEGY_DOCUMENTATION.md)
|
|
- Write tests in `backend/src/__tests__/`
|
|
- Test critical paths first: document upload, authentication, core API endpoints
|
|
- Use TDD approach: write tests first, then implementation
|
|
- Mock external services (Firebase, Supabase, LLM APIs)
|
|
|
|
## Deprecated Patterns (DO NOT USE)
|
|
|
|
### Removed Services
|
|
- ❌ `agenticRAGDatabaseService.ts` - Removed, functionality moved to other services
|
|
- ❌ `sessionService.ts` - Removed, use Firebase Auth directly
|
|
- ❌ Direct PostgreSQL connections - Use Supabase client instead
|
|
- ❌ Redis caching - Not used in current architecture
|
|
- ❌ JWT authentication - Use Firebase Auth tokens instead
|
|
|
|
### Removed Test Patterns
|
|
- ❌ Jest - Use Vitest instead
|
|
- ❌ Tests for PostgreSQL/Redis architecture - Architecture changed to Supabase/Firebase
|
|
|
|
### Old API Patterns
|
|
- ❌ Direct database queries - Use model methods from `backend/src/models/`
|
|
- ❌ Manual error handling without structured logging - Use StructuredLogger
|
|
|
|
## Common Bugs to Avoid
|
|
|
|
### 1. Missing Correlation IDs
|
|
- **Problem**: Logs without correlation IDs make debugging difficult
|
|
- **Solution**: Always use `StructuredLogger` with correlation ID for request-scoped operations
|
|
- **Example**: `const logger = new StructuredLogger(correlationId);`
|
|
|
|
### 2. Unhandled Promise Rejections
|
|
- **Problem**: Async operations without try-catch cause unhandled rejections
|
|
- **Solution**: Always wrap async operations in try-catch blocks
|
|
- **Check**: `backend/src/index.ts` has global unhandled rejection handler
|
|
|
|
### 3. Type Assertions Instead of Type Guards
|
|
- **Problem**: Using `as` type assertions can hide type errors
|
|
- **Solution**: Use proper type guards: `error instanceof Error ? error.message : String(error)`
|
|
|
|
### 4. Missing Error Context
|
|
- **Problem**: Errors logged without sufficient context
|
|
- **Solution**: Include documentId, userId, jobId, and operation context in error logs
|
|
|
|
### 5. Firebase/Supabase Error Handling
|
|
- **Problem**: Not extracting meaningful error messages from Firebase/Supabase errors
|
|
- **Solution**: Check error.code and error.message, log full error object for debugging
|
|
|
|
### 6. Vector Search Timeouts
|
|
- **Problem**: Vector search operations can timeout
|
|
- **Solution**: See `backend/sql/fix_vector_search_timeout.sql` for timeout fixes
|
|
- **Reference**: `backend/src/services/vectorDatabaseService.ts`
|
|
|
|
### 7. Job Processing Timeouts
|
|
- **Problem**: Jobs can exceed 14-minute timeout limit
|
|
- **Solution**: Check `backend/src/services/jobProcessorService.ts` for timeout handling
|
|
- **Pattern**: Jobs should update status before timeout, handle gracefully
|
|
|
|
### 8. LLM Response Validation
|
|
- **Problem**: LLM responses may not match expected JSON schema
|
|
- **Solution**: Use Zod validation with retry logic (see `llmService.ts` lines 236-450)
|
|
- **Pattern**: 3 retry attempts with improved prompts on validation failure
|
|
|
|
## Context Management
|
|
|
|
### Using @ Symbols for Context
|
|
|
|
**@Files** - Reference specific files:
|
|
- `@backend/src/utils/logger.ts` - For logging patterns
|
|
- `@backend/src/services/jobQueueService.ts` - For job processing patterns
|
|
- `@backend/src/services/llmService.ts` - For LLM API patterns
|
|
- `@backend/src/middleware/errorHandler.ts` - For error handling patterns
|
|
|
|
**@Codebase** - Semantic search (Chat only):
|
|
- Use for finding similar implementations
|
|
- Example: "How is document processing handled?" → searches entire codebase
|
|
|
|
**@Folders** - Include entire directories:
|
|
- `@backend/src/services/` - All service files
|
|
- `@backend/src/scripts/` - All debugging scripts
|
|
- `@backend/src/models/` - All database models
|
|
|
|
**@Lint Errors** - Reference current lint errors (Chat only):
|
|
- Use when fixing linting issues
|
|
|
|
**@Git** - Access git history:
|
|
- Use to see recent changes and understand context
|
|
|
|
### Key File References for Common Tasks
|
|
|
|
**Logging:**
|
|
- `backend/src/utils/logger.ts` - Winston logger and StructuredLogger class
|
|
|
|
**Job Processing:**
|
|
- `backend/src/services/jobQueueService.ts` - Job queue management
|
|
- `backend/src/services/jobProcessorService.ts` - Job execution logic
|
|
|
|
**Document Processing:**
|
|
- `backend/src/services/unifiedDocumentProcessor.ts` - Main orchestrator
|
|
- `backend/src/services/documentAiProcessor.ts` - Google Document AI integration
|
|
- `backend/src/services/optimizedAgenticRAGProcessor.ts` - AI-powered analysis
|
|
|
|
**LLM Services:**
|
|
- `backend/src/services/llmService.ts` - LLM API interactions with retry logic
|
|
|
|
**File Storage:**
|
|
- `backend/src/services/fileStorageService.ts` - GCS and Firebase Storage operations
|
|
|
|
**Database:**
|
|
- `backend/src/models/DocumentModel.ts` - Document database operations
|
|
- `backend/src/models/ProcessingJobModel.ts` - Job database operations
|
|
- `backend/src/config/supabase.ts` - Supabase client configuration
|
|
|
|
**Debugging Scripts:**
|
|
- `backend/src/scripts/` - Collection of debugging and monitoring scripts
|
|
|
|
## Debugging Scripts Usage
|
|
|
|
### When to Use Existing Scripts vs Create New Ones
|
|
|
|
**Use Existing Scripts For:**
|
|
- Monitoring document processing: `monitor-document-processing.ts`
|
|
- Checking job status: `check-current-job.ts`, `track-current-job.ts`
|
|
- Database failure checks: `check-database-failures.ts`
|
|
- System monitoring: `monitor-system.ts`
|
|
- Testing LLM pipeline: `test-full-llm-pipeline.ts`
|
|
|
|
**Create New Scripts When:**
|
|
- Need to debug a specific new issue
|
|
- Existing scripts don't cover the use case
|
|
- Creating a one-time diagnostic tool
|
|
|
|
### Script Naming Conventions
|
|
- `check-*` - Diagnostic scripts that check status
|
|
- `monitor-*` - Continuous monitoring scripts
|
|
- `track-*` - Tracking specific operations
|
|
- `test-*` - Testing specific functionality
|
|
- `setup-*` - Setup and configuration scripts
|
|
|
|
### Common Debugging Workflows
|
|
|
|
**Debugging a Stuck Document:**
|
|
1. Use `check-new-doc-status.ts` to check document status
|
|
2. Use `check-current-job.ts` to check associated job
|
|
3. Use `monitor-document.ts` for real-time monitoring
|
|
4. Use `manually-process-job.ts` to reprocess if needed
|
|
|
|
**Debugging LLM Issues:**
|
|
1. Use `test-openrouter-simple.ts` for basic LLM connectivity
|
|
2. Use `test-full-llm-pipeline.ts` for end-to-end LLM testing
|
|
3. Use `test-llm-processing-offline.ts` for offline testing
|
|
|
|
**Debugging Database Issues:**
|
|
1. Use `check-database-failures.ts` to check for failures
|
|
2. Check SQL files in `backend/sql/` for schema fixes
|
|
3. Review `backend/src/models/` for model issues
|
|
|
|
## YOLO Mode Configuration
|
|
|
|
When using Cursor's YOLO mode, these commands are always allowed:
|
|
- Test commands: `npm test`, `vitest`, `npm run test:watch`, `npm run test:coverage`
|
|
- Build commands: `npm run build`, `tsc`, `npm run lint`
|
|
- File operations: `touch`, `mkdir`, file creation/editing
|
|
- Running debugging scripts: `ts-node backend/src/scripts/*.ts`
|
|
- Database scripts: `npm run db:*` commands
|
|
|
|
## Logging Patterns
|
|
|
|
### Winston Logger Usage
|
|
|
|
**Basic Logging:**
|
|
```typescript
|
|
import { logger } from './utils/logger';
|
|
|
|
logger.info('Operation started', { documentId, userId });
|
|
logger.error('Operation failed', { error: error.message, documentId });
|
|
```
|
|
|
|
**Structured Logger with Correlation ID:**
|
|
```typescript
|
|
import { StructuredLogger } from './utils/logger';
|
|
|
|
const structuredLogger = new StructuredLogger(correlationId);
|
|
structuredLogger.processingStart(documentId, userId, options);
|
|
structuredLogger.processingError(error, documentId, userId, 'llm_processing');
|
|
```
|
|
|
|
**Service-Specific Logging:**
|
|
- Upload operations: Use `structuredLogger.uploadStart()`, `uploadSuccess()`, `uploadError()`
|
|
- Processing operations: Use `structuredLogger.processingStart()`, `processingSuccess()`, `processingError()`
|
|
- Storage operations: Use `structuredLogger.storageOperation()`
|
|
- Job queue operations: Use `structuredLogger.jobQueueOperation()`
|
|
|
|
**Error Logging Best Practices:**
|
|
- Always include error message: `error instanceof Error ? error.message : String(error)`
|
|
- Include stack trace: `error instanceof Error ? error.stack : undefined`
|
|
- Add context: documentId, userId, jobId, operation name
|
|
- Use structured data, not string concatenation
|
|
|
|
## Firebase/Supabase Error Handling
|
|
|
|
### Firebase Errors
|
|
- Check `error.code` for specific error codes
|
|
- Firebase Auth errors: Handle `auth/` prefixed codes
|
|
- Firebase Storage errors: Handle `storage/` prefixed codes
|
|
- Log full error object for debugging: `logger.error('Firebase error', { error, code: error.code })`
|
|
|
|
### Supabase Errors
|
|
- Check `error.code` and `error.message`
|
|
- RLS policy errors: Check `error.code === 'PGRST301'`
|
|
- Connection errors: Implement retry logic
|
|
- Log with context: `logger.error('Supabase error', { error: error.message, code: error.code, query })`
|
|
|
|
## Retry Patterns
|
|
|
|
### LLM API Retries (from llmService.ts)
|
|
- 3 retry attempts for API calls
|
|
- Exponential backoff between retries
|
|
- Improved prompts on validation failure
|
|
- Log each attempt with attempt number
|
|
|
|
### Database Operation Retries
|
|
- Use connection pooling (handled by Supabase client)
|
|
- Retry on connection errors
|
|
- Don't retry on validation errors
|
|
|
|
## Testing Guidelines
|
|
|
|
### Test Structure
|
|
- Unit tests: `backend/src/__tests__/unit/`
|
|
- Integration tests: `backend/src/__tests__/integration/`
|
|
- Test utilities: `backend/src/__tests__/utils/`
|
|
- Mocks: `backend/src/__tests__/mocks/`
|
|
|
|
### Critical Paths to Test
|
|
1. Document upload workflow
|
|
2. Authentication flow
|
|
3. Core API endpoints
|
|
4. Job processing pipeline
|
|
5. LLM service interactions
|
|
|
|
### Mocking External Services
|
|
- Firebase: Mock Firebase Admin SDK
|
|
- Supabase: Mock Supabase client
|
|
- LLM APIs: Mock HTTP responses
|
|
- Google Cloud Storage: Mock GCS client
|
|
|
|
## Performance Considerations
|
|
|
|
- Vector search operations can be slow - use timeouts
|
|
- LLM API calls are expensive - implement caching where possible
|
|
- Job processing has 14-minute timeout limit
|
|
- Large PDFs may cause memory issues - use streaming where possible
|
|
- Database queries should use indexes (check Supabase dashboard)
|
|
|
|
## Security Best Practices
|
|
|
|
- Never log sensitive data (passwords, API keys, tokens)
|
|
- Use environment variables for all secrets (see `backend/src/config/env.ts`)
|
|
- Validate all user inputs (see `backend/src/middleware/validation.ts`)
|
|
- Use Firebase Auth for authentication - never bypass
|
|
- Respect Row Level Security (RLS) policies in Supabase
|
|
|