cim_summary/.cursorrules

# CIM Document Processor - Cursor Rules

## Project Overview

This is an AI-powered document processing system for analyzing Confidential Information Memorandums (CIMs). The system extracts text from PDFs, processes them through LLM services (Claude AI/OpenAI), generates structured analysis, and creates summary PDFs.

**Core Purpose**: Automated processing and analysis of CIM documents using Google Document AI, vector embeddings, and LLM services.

## Tech Stack

### Backend
- **Runtime**: Node.js 18+ with TypeScript
- **Framework**: Express.js
- **Database**: Supabase (PostgreSQL + Vector Database)
- **Storage**: Google Cloud Storage (primary), Firebase Storage (fallback)
- **AI Services**:
  - Google Document AI (text extraction)
  - Anthropic Claude (primary LLM)
  - OpenAI (fallback LLM)
  - OpenRouter (LLM routing)
- **Authentication**: Firebase Auth
- **Deployment**: Firebase Functions v2

### Frontend
- **Framework**: React 18 + TypeScript
- **Build Tool**: Vite
- **HTTP Client**: Axios
- **Routing**: React Router
- **Styling**: Tailwind CSS

## Critical Rules

### TypeScript Standards
- **ALWAYS** use strict TypeScript types - avoid `any` type
- Use proper type definitions from `backend/src/types/` and `frontend/src/types/`
- Enable `noImplicitAny: true` in new code (currently disabled in tsconfig.json for legacy reasons)
- Use interfaces for object shapes, types for unions/primitives
- Prefer `unknown` over `any` when type is truly unknown

### Logging Standards
- **ALWAYS** use Winston logger from `backend/src/utils/logger.ts`
- Use `StructuredLogger` class for operations with correlation IDs
- Log levels:
  - `logger.debug()` - Detailed diagnostic info
  - `logger.info()` - Normal operations
  - `logger.warn()` - Warning conditions
  - `logger.error()` - Error conditions with context
- Include correlation IDs for request tracing
- Log structured data: `logger.error('Message', { key: value, error: error.message })`
- Never use `console.log` in production code - use logger instead

### Error Handling Patterns
- **ALWAYS** use try-catch blocks for async operations
- Include error context: `error instanceof Error ? error.message : String(error)`
- Log errors with structured data before re-throwing
- Use existing error handling middleware: `backend/src/middleware/errorHandler.ts`
- For Firebase/Supabase errors, extract meaningful messages from error objects
- Retry patterns: Use exponential backoff for external API calls (see `llmService.ts` for examples)

### Service Architecture
- Services should be in `backend/src/services/`
- Use dependency injection patterns where possible
- Services should handle their own errors and log appropriately
- Reference existing services before creating new ones:
  - `jobQueueService.ts` - Background job processing
  - `unifiedDocumentProcessor.ts` - Main document processing orchestrator
  - `llmService.ts` - LLM API interactions
  - `fileStorageService.ts` - File storage operations
  - `vectorDatabaseService.ts` - Vector embeddings and search

### Database Patterns
- Use Supabase client from `backend/src/config/supabase.ts`
- Models should be in `backend/src/models/`
- Always handle Row Level Security (RLS) policies
- Use transactions for multi-step operations
- Handle connection errors gracefully with retries

### Testing Standards
- Use Vitest for testing (Jest was removed - see TESTING_STRATEGY_DOCUMENTATION.md)
- Write tests in `backend/src/__tests__/`
- Test critical paths first: document upload, authentication, core API endpoints
- Use TDD approach: write tests first, then implementation
- Mock external services (Firebase, Supabase, LLM APIs)

## Deprecated Patterns (DO NOT USE)

### Removed Services
- ❌ `agenticRAGDatabaseService.ts` - Removed, functionality moved to other services
- ❌ `sessionService.ts` - Removed, use Firebase Auth directly
- ❌ Direct PostgreSQL connections - Use Supabase client instead
- ❌ Redis caching - Not used in current architecture
- ❌ JWT authentication - Use Firebase Auth tokens instead

### Removed Test Patterns
- ❌ Jest - Use Vitest instead
- ❌ Tests for PostgreSQL/Redis architecture - Architecture changed to Supabase/Firebase

### Old API Patterns
- ❌ Direct database queries - Use model methods from `backend/src/models/`
- ❌ Manual error handling without structured logging - Use StructuredLogger

## Common Bugs to Avoid

### 1. Missing Correlation IDs
- **Problem**: Logs without correlation IDs make debugging difficult
- **Solution**: Always use `StructuredLogger` with correlation ID for request-scoped operations
- **Example**: `const logger = new StructuredLogger(correlationId);`

### 2. Unhandled Promise Rejections
- **Problem**: Async operations without try-catch cause unhandled rejections
- **Solution**: Always wrap async operations in try-catch blocks
- **Check**: `backend/src/index.ts` has global unhandled rejection handler

### 3. Type Assertions Instead of Type Guards
- **Problem**: Using `as` type assertions can hide type errors
- **Solution**: Use proper type guards: `error instanceof Error ? error.message : String(error)`

### 4. Missing Error Context
- **Problem**: Errors logged without sufficient context
- **Solution**: Include documentId, userId, jobId, and operation context in error logs

### 5. Firebase/Supabase Error Handling
- **Problem**: Not extracting meaningful error messages from Firebase/Supabase errors
- **Solution**: Check error.code and error.message, log full error object for debugging

### 6. Vector Search Timeouts
- **Problem**: Vector search operations can timeout
- **Solution**: See `backend/sql/fix_vector_search_timeout.sql` for timeout fixes
- **Reference**: `backend/src/services/vectorDatabaseService.ts`

### 7. Job Processing Timeouts
- **Problem**: Jobs can exceed 14-minute timeout limit
- **Solution**: Check `backend/src/services/jobProcessorService.ts` for timeout handling
- **Pattern**: Jobs should update status before timeout, handle gracefully

### 8. LLM Response Validation
- **Problem**: LLM responses may not match expected JSON schema
- **Solution**: Use Zod validation with retry logic (see `llmService.ts` lines 236-450)
- **Pattern**: 3 retry attempts with improved prompts on validation failure

## Context Management

### Using @ Symbols for Context

**@Files** - Reference specific files:
- `@backend/src/utils/logger.ts` - For logging patterns
- `@backend/src/services/jobQueueService.ts` - For job processing patterns
- `@backend/src/services/llmService.ts` - For LLM API patterns
- `@backend/src/middleware/errorHandler.ts` - For error handling patterns

**@Codebase** - Semantic search (Chat only):
- Use for finding similar implementations
- Example: "How is document processing handled?" → searches entire codebase

**@Folders** - Include entire directories:
- `@backend/src/services/` - All service files
- `@backend/src/scripts/` - All debugging scripts
- `@backend/src/models/` - All database models

**@Lint Errors** - Reference current lint errors (Chat only):
- Use when fixing linting issues

**@Git** - Access git history:
- Use to see recent changes and understand context

### Key File References for Common Tasks

**Logging:**
- `backend/src/utils/logger.ts` - Winston logger and StructuredLogger class

**Job Processing:**
- `backend/src/services/jobQueueService.ts` - Job queue management
- `backend/src/services/jobProcessorService.ts` - Job execution logic

**Document Processing:**
- `backend/src/services/unifiedDocumentProcessor.ts` - Main orchestrator
- `backend/src/services/documentAiProcessor.ts` - Google Document AI integration
- `backend/src/services/optimizedAgenticRAGProcessor.ts` - AI-powered analysis

**LLM Services:**
- `backend/src/services/llmService.ts` - LLM API interactions with retry logic

**File Storage:**
- `backend/src/services/fileStorageService.ts` - GCS and Firebase Storage operations

**Database:**
- `backend/src/models/DocumentModel.ts` - Document database operations
- `backend/src/models/ProcessingJobModel.ts` - Job database operations
- `backend/src/config/supabase.ts` - Supabase client configuration

**Debugging Scripts:**
- `backend/src/scripts/` - Collection of debugging and monitoring scripts

## Debugging Scripts Usage

### When to Use Existing Scripts vs Create New Ones

**Use Existing Scripts For:**
- Monitoring document processing: `monitor-document-processing.ts`
- Checking job status: `check-current-job.ts`, `track-current-job.ts`
- Database failure checks: `check-database-failures.ts`
- System monitoring: `monitor-system.ts`
- Testing LLM pipeline: `test-full-llm-pipeline.ts`

**Create New Scripts When:**
- Need to debug a specific new issue
- Existing scripts don't cover the use case
- Creating a one-time diagnostic tool

### Script Naming Conventions
- `check-*` - Diagnostic scripts that check status
- `monitor-*` - Continuous monitoring scripts
- `track-*` - Tracking specific operations
- `test-*` - Testing specific functionality
- `setup-*` - Setup and configuration scripts

### Common Debugging Workflows

**Debugging a Stuck Document:**
1. Use `check-new-doc-status.ts` to check document status
2. Use `check-current-job.ts` to check associated job
3. Use `monitor-document.ts` for real-time monitoring
4. Use `manually-process-job.ts` to reprocess if needed

**Debugging LLM Issues:**
1. Use `test-openrouter-simple.ts` for basic LLM connectivity
2. Use `test-full-llm-pipeline.ts` for end-to-end LLM testing
3. Use `test-llm-processing-offline.ts` for offline testing

**Debugging Database Issues:**
1. Use `check-database-failures.ts` to check for failures
2. Check SQL files in `backend/sql/` for schema fixes
3. Review `backend/src/models/` for model issues

## YOLO Mode Configuration

When using Cursor's YOLO mode, these commands are always allowed:
- Test commands: `npm test`, `vitest`, `npm run test:watch`, `npm run test:coverage`
- Build commands: `npm run build`, `tsc`, `npm run lint`
- File operations: `touch`, `mkdir`, file creation/editing
- Running debugging scripts: `ts-node backend/src/scripts/*.ts`
- Database scripts: `npm run db:*` commands

## Logging Patterns

### Winston Logger Usage

**Basic Logging:**
```typescript
import { logger } from './utils/logger';

logger.info('Operation started', { documentId, userId });
logger.error('Operation failed', { error: error.message, documentId });
```

**Structured Logger with Correlation ID:**
```typescript
import { StructuredLogger } from './utils/logger';

const structuredLogger = new StructuredLogger(correlationId);
structuredLogger.processingStart(documentId, userId, options);
structuredLogger.processingError(error, documentId, userId, 'llm_processing');
```

**Service-Specific Logging:**
- Upload operations: Use `structuredLogger.uploadStart()`, `uploadSuccess()`, `uploadError()`
- Processing operations: Use `structuredLogger.processingStart()`, `processingSuccess()`, `processingError()`
- Storage operations: Use `structuredLogger.storageOperation()`
- Job queue operations: Use `structuredLogger.jobQueueOperation()`

**Error Logging Best Practices:**
- Always include error message: `error instanceof Error ? error.message : String(error)`
- Include stack trace: `error instanceof Error ? error.stack : undefined`
- Add context: documentId, userId, jobId, operation name
- Use structured data, not string concatenation

## Firebase/Supabase Error Handling

### Firebase Errors
- Check `error.code` for specific error codes
- Firebase Auth errors: Handle `auth/` prefixed codes
- Firebase Storage errors: Handle `storage/` prefixed codes
- Log full error object for debugging: `logger.error('Firebase error', { error, code: error.code })`

### Supabase Errors
- Check `error.code` and `error.message`
- RLS policy errors: Check `error.code === 'PGRST301'`
- Connection errors: Implement retry logic
- Log with context: `logger.error('Supabase error', { error: error.message, code: error.code, query })`

## Retry Patterns

### LLM API Retries (from llmService.ts)
- 3 retry attempts for API calls
- Exponential backoff between retries
- Improved prompts on validation failure
- Log each attempt with attempt number

### Database Operation Retries
- Use connection pooling (handled by Supabase client)
- Retry on connection errors
- Don't retry on validation errors

## Testing Guidelines

### Test Structure
- Unit tests: `backend/src/__tests__/unit/`
- Integration tests: `backend/src/__tests__/integration/`
- Test utilities: `backend/src/__tests__/utils/`
- Mocks: `backend/src/__tests__/mocks/`

### Critical Paths to Test
1. Document upload workflow
2. Authentication flow
3. Core API endpoints
4. Job processing pipeline
5. LLM service interactions

### Mocking External Services
- Firebase: Mock Firebase Admin SDK
- Supabase: Mock Supabase client
- LLM APIs: Mock HTTP responses
- Google Cloud Storage: Mock GCS client

## Performance Considerations

- Vector search operations can be slow - use timeouts
- LLM API calls are expensive - implement caching where possible
- Job processing has 14-minute timeout limit
- Large PDFs may cause memory issues - use streaming where possible
- Database queries should use indexes (check Supabase dashboard)

## Security Best Practices

- Never log sensitive data (passwords, API keys, tokens)
- Use environment variables for all secrets (see `backend/src/config/env.ts`)
- Validate all user inputs (see `backend/src/middleware/validation.ts`)
- Use Firebase Auth for authentication - never bypass
- Respect Row Level Security (RLS) policies in Supabase