admin/cim_summary

Fork 0

Files

admin e6e1b1fa6f docs: map existing codebase

2026-02-24 10:28:22 -05:00

20 KiB

Raw Blame History

Codebase Concerns

Analysis Date: 2026-02-24

Tech Debt

Console.log Debug Statements in Controllers:

Issue: Excessive console.log() calls with emoji prefixes left throughout documentController.ts instead of using proper structured logging via Winston logger
Files: backend/src/controllers/documentController.ts (lines 12-80, multiple scattered instances)
Impact: Production logs become noisy and unstructured; debug output leaks to stdout/stderr; makes it harder to parse logs for errors and metrics
Fix approach: Replace all console.log() calls with logger.info(), logger.debug(), logger.error() via imported logger from utils/logger.ts. Follow pattern established in other services.

Incomplete Job Statistics Tracking:

Issue: jobQueueService.ts and jobProcessorService.ts both have TODO markers indicating completed/failed job counts are not tracked (lines 606-607, 635-636)
Files: backend/src/services/jobQueueService.ts, backend/src/services/jobProcessorService.ts
Impact: Job queue health metrics are incomplete; cannot audit success/failure rates; monitoring dashboards will show incomplete data
Fix approach: Implement completedJobs and failedJobs counters in both services using persistent storage or Redis. Update schema if needed.

Config Migration Debug Cruft:

Issue: Multiple console.log() debug statements in config/env.ts (lines 23, 46, 51, 292) for Firebase Functions v1→v2 migration are still present
Files: backend/src/config/env.ts
Impact: Production logs polluted with migration warnings; makes it harder to spot real issues; clutters server startup output
Fix approach: Remove all [CONFIG DEBUG] console.log statements once migration to Firebase Functions v2 is confirmed complete. Wrap remaining fallback logic in logger.debug() if diagnostics needed.

Hardcoded Processing Strategy:

Issue: Historical commit shows processing strategy was hardcoded, potential for incomplete refactoring
Files: backend/src/services/, controller logic
Impact: May not correctly use configured strategy; processing may default unexpectedly
Fix approach: Verify all processing paths read from config.processingStrategy and have proper fallback logic

Type Safety Issues - any Type Usage:

Issue: 378 instances of any or unknown types found across backend TypeScript files
Files: Widespread including optimizedAgenticRAGProcessor.ts:17, pdfGenerationService.ts, vectorDatabaseService.ts
Impact: Loses type safety guarantees; harder to catch errors at compile time; refactoring becomes risky
Fix approach: Gradually replace any with proper types. Start with service boundaries and public APIs. Create typed interfaces for common patterns.

Known Bugs

Project Panther CIM KPI Missing After Processing:

Symptoms: Document Project Panther - Confidential Information Memorandum_vBluePoint.pdf processed but dashboard shows "Not specified in CIM" for Revenue, EBITDA, Employees, Founded even though numeric tables exist in PDF
Files: backend/src/services/optimizedAgenticRAGProcessor.ts (dealOverview mapper), processing pipeline
Trigger: Process Project Panther test document through full agentic RAG pipeline
Impact: Dashboard KPI cards remain empty; users see incomplete summaries
Workaround: Manual data entry in dashboard; skip financial summary display for affected documents
Fix approach: Trace through optimizedAgenticRAGProcessor.generateLLMAnalysisMultiPass() → dealOverview mapper. Add regression test for this specific document. Check if structured table extraction is working correctly.

10+ Minute Processing Latency Regression:

Symptoms: Document document-55c4a6e2-8c08-4734-87f6-24407cea50ac.pdf (Project Panther) took ~10 minutes end-to-end despite typical processing being 2-3 minutes
Files: backend/src/services/unifiedDocumentProcessor.ts, optimizedAgenticRAGProcessor.ts, documentAiProcessor.ts, llmService.ts
Trigger: Large or complex CIM documents (30+ pages with tables)
Impact: Users experience timeouts; processing approaching or exceeding 14-minute Firebase Functions limit
Workaround: None currently; document fails to process if latency exceeds timeout
Fix approach: Instrument each pipeline phase (PDF chunking, Document AI extraction, RAG passes, financial parser) with timing logs. Identify bottleneck(s). Profile GCS upload retries, Anthropic fallbacks. Consider parallel multi-pass queries within quota limits.

Vector Search Timeouts After Index Growth:

Symptoms: Supabase vector search RPC calls timeout after 30 seconds; fallback to document-scoped search with limited results
Files: backend/src/services/vectorDatabaseService.ts (lines 122-182)
Trigger: Large embedded document collections (1000+ chunks); similarity search under load
Impact: Retrieval quality degrades as index grows; fallback search returns fewer contextual chunks; RAG quality suffers
Workaround: Fallback query uses document-scoped filtering and direct embedding lookup
Fix approach: Implement query batching, result caching by content hash, or query optimization. Consider Pinecone migration if Supabase vector performance doesn't improve. Add metrics to track timeout frequency.

Security Considerations

Unencrypted Debug Logs in Production:

Risk: Sensitive document content, user IDs, and processing details may be exposed in logs if debug mode enabled in production
Files: backend/src/middleware/firebaseAuth.ts (AUTH_DEBUG flag), backend/src/config/env.ts, backend/src/controllers/documentController.ts
Current mitigation: Debug logging controlled by AUTH_DEBUG environment variable; not enabled by default
Recommendations:
1. Ensure AUTH_DEBUG is never set to true in production
2. Implement log redaction middleware to strip PII (API keys, document content, user data)
3. Use correlation IDs instead of logging full request bodies
4. Add log level enforcement (error/warn only in production)

Hardcoded Service Account Credentials Path:

Risk: If service account key JSON is accidentally committed or exposed, attacker gains full GCS and Document AI access
Files: backend/src/config/env.ts, backend/src/utils/googleServiceAccount.ts
Current mitigation: .env file in .gitignore; credentials path via env var
Recommendations:
1. Use Firebase Function secrets (defineSecret()) instead of env files
2. Implement credential rotation policy
3. Add pre-commit hook to prevent .json key files in commits
4. Audit GCS bucket permissions quarterly

Concurrent LLM Rate Limiting Insufficient:

Risk: Although llmService.ts limits concurrent calls to 1 (line 52), burst requests could still trigger Anthropic 429 rate limit errors during high load
Files: backend/src/services/llmService.ts (MAX_CONCURRENT_LLM_CALLS = 1)
Current mitigation: Max 1 concurrent call; retry with exponential backoff (3 attempts)
Recommendations:
1. Consider reducing to 0.5 concurrent calls (queue instead of async) during peak hours
2. Add request batching for multi-pass analysis
3. Implement circuit breaker pattern for cascading failures
4. Monitor token spend and throttle proactively

No Request Rate Limiting on Upload Endpoint:

Risk: Unauthenticated attackers could flood /upload/url endpoint to exhaust quota or fill storage
Files: backend/src/controllers/documentController.ts (getUploadUrl endpoint), backend/src/routes/documents.ts
Current mitigation: Firebase Auth check; file size limit enforced
Recommendations:
1. Add rate limiter middleware (e.g., express-rate-limit) with per-user quotas
2. Implement request signing for upload URLs
3. Add CORS restrictions to known frontend domains
4. Monitor upload rate and alert on anomalies

Performance Bottlenecks

Large File PDF Chunking Memory Usage:

Problem: Documents larger than 50 MB may cause OOM errors during chunking; no memory limit guards
Files: backend/src/services/optimizedAgenticRAGProcessor.ts (line 35, 4000-char chunks), backend/src/services/unifiedDocumentProcessor.ts
Cause: Entire document text loaded into memory before chunking; large overlap between chunks multiplies footprint
Improvement path:
1. Implement streaming chunk processing from GCS (read chunks, embed, write to DB before next chunk)
2. Reduce overlap from 200 to 100 characters or make dynamic based on document size
3. Add memory threshold checks; fail early with user-friendly error if approaching limit
4. Profile heap usage in tests with 50+ MB documents

Embedding Generation for Large Documents:

Problem: Embedding 1000+ chunks sequentially takes 2-3 minutes; no concurrency despite maxConcurrentEmbeddings = 5 setting
Files: backend/src/services/optimizedAgenticRAGProcessor.ts (lines 37, 172-180 region)
Cause: Batch size of 10 may be inefficient; OpenAI/Anthropic API concurrency not fully utilized
Improvement path:
1. Increase batch size to 25-50 chunks per concurrent request (test quota limits)
2. Use Promise.all() instead of sequential embedding calls
3. Cache embeddings by content hash to skip re-embedding on retries
4. Add progress callback to track batch completion

Multiple LLM Retries on Network Failure:

Problem: 3 retry attempts for each LLM call with exponential backoff means up to 30+ seconds per call; multi-pass analysis does 3+ passes
Files: backend/src/services/llmService.ts (retry logic, lines 320+), backend/src/services/optimizedAgenticRAGProcessor.ts (line 83 multi-pass)
Cause: No circuit breaker; all retries execute even if service degraded
Improvement path:
1. Track consecutive failures; disable retries if failure rate >50% in last minute
2. Use adaptive retry backoff (double wait time only after first failure)
3. Implement multi-pass fallback: if Pass 2 fails, use Pass 1 results instead of failing entire document
4. Add metrics endpoint to show retry frequency and success rates

PDF Generation Memory Leak with Puppeteer Page Pool:

Problem: Page pool in pdfGenerationService.ts may not properly release browser resources; max pool size 5 but no eviction policy
Files: backend/src/services/pdfGenerationService.ts (lines 66-71, page pool)
Cause: Pages may not be closed if PDF generation errors mid-stream; no cleanup on timeout
Improvement path:
1. Implement LRU eviction: close oldest page if pool reaches max size
2. Add page timeout with forced close after 30s
3. Add memory monitoring; close all pages if heap >500MB
4. Log page pool stats every 5 minutes to detect leaks

Fragile Areas

Job Queue State Machine:

Files: backend/src/services/jobQueueService.ts, backend/src/services/jobProcessorService.ts, backend/src/models/ProcessingJobModel.ts
Why fragile:
1. Job status transitions (pending → processing → completed) not atomic; race condition if two workers pick same job
2. Stuck job detection relies on timestamp comparison; clock skew or server restart breaks detection
3. No idempotency tokens; job retry on network error could trigger duplicate processing
Safe modification:
1. Add database-level unique constraint on job ID + processing timestamp
2. Use database transactions for status updates
3. Implement idempotency with request deduplication ID
Test coverage:
1. No unit tests found for concurrent job processing scenario
2. No integration tests with actual database
3. Add tests for: concurrent workers, stuck job reset, duplicate submissions

Document Processing Pipeline Error Handling:

Files: backend/src/controllers/documentController.ts (lines 200+), backend/src/services/unifiedDocumentProcessor.ts
Why fragile:
1. Hybrid approach tries job queue then fallback to immediate processing; error in job queue doesn't fully propagate
2. Document status not updated if processing fails mid-pipeline (remains 'processing_llm')
3. No compensating transaction to roll back partial results
Safe modification:
1. Separate job submission from immediate processing; always update document status atomically
2. Add processing stage tracking (document_ai → chunking → embedding → llm → pdf)
3. Implement rollback logic: delete chunks and embeddings if LLM stage fails
Test coverage:
1. Add tests for each pipeline stage failure
2. Test document status consistency after each failure
3. Add integration test with network failure injection

Vector Database Search Fallback Chain:

Files: backend/src/services/vectorDatabaseService.ts (lines 110-182)
Why fragile:
1. Three-level fallback (RPC search → document-scoped search → direct lookup) masks underlying issues
2. If Supabase RPC is degraded, system degrades silently instead of alerting
3. Fallback search may return stale or incorrect results without indication
Safe modification:
1. Add circuit breaker: if timeout happens 3x in 5 minutes, stop trying RPC search
2. Return metadata flag indicating which fallback was used (for logging/debugging)
3. Add explicit timeout wrapped in try/catch, not via Promise.race() (cleaner code)
Test coverage:
1. Mock Supabase timeout at each RPC level
2. Verify correct fallback is triggered
3. Add performance benchmarks for each search method

Config Initialization Race Condition:

Files: backend/src/config/env.ts (lines 15-52)
Why fragile:
1. Firebase Functions v1 fallback (functions.config()) may not be thread-safe
2. If multiple instances start simultaneously, config merge may be incomplete
3. No validation that config merge was successful
Safe modification:
1. Remove v1 fallback entirely; require explicit Firebase Functions v2 setup
2. Validate all critical env vars before allowing service startup
3. Fail fast with clear error message if required vars missing
Test coverage:
1. Add test for missing required env vars
2. Test with incomplete config to verify error message clarity

Scaling Limits

Supabase Concurrent Vector Search Connections:

Current capacity: RPC timeout 30 seconds; Supabase connection pool typically 100 max
Limit: With 3 concurrent workers × multiple users, could exhaust connection pool during peak load
Scaling path:
1. Implement connection pooling via PgBouncer (already in Supabase Pro tier)
2. Reduce timeout from 30s to 10s; fail faster and retry
3. Migrate to Pinecone if vector search becomes >30% of workload

Firebase Functions Timeout (14 minutes):

Current capacity: Serverless function execution up to 15 minutes (1 minute buffer before hard timeout)
Limit: Document processing hitting ~10 minutes; adding new features could exceed limit
Scaling path:
1. Move processing to Cloud Run (1 hour limit) for large documents
2. Implement processing timeout failover: if approach 12 minutes, checkpoint and requeue
3. Add background worker pool for long-running jobs (separate from request path)

LLM API Rate Limits (Anthropic/OpenAI):

Current capacity: 1 concurrent call; 3 retries per call; no per-minute or per-second throttling beyond single-call serialization
Limit: Burst requests from multiple users could trigger 429 rate limit errors
Scaling path:
1. Negotiate higher rate limits with API providers
2. Implement request queuing with exponential backoff per user
3. Add cost monitoring and soft-limit alerts (warn at 80% of quota)

PDF Generation Browser Pool:

Current capacity: 5 browser pages maximum
Limit: With 3+ concurrent document processing jobs, pool contention causes delays (queue wait time)
Scaling path:
1. Increase pool size to 10 (requires more memory)
2. Move PDF generation to separate worker queue (decouple from request path)
3. Implement adaptive pool sizing based on available memory

GCS Upload/Download Throughput:

Current capacity: Single-threaded upload/download; file transfer waits on GCS API latency
Limit: Large documents (50+ MB) may timeout or be slow
Scaling path:
1. Implement resumable uploads with multi-part chunks
2. Add parallel chunk uploads for files >10 MB
3. Cache frequently accessed documents in Redis

Dependencies at Risk

Firebase Functions v1 Deprecation (EOL Dec 31, 2025):

Risk: Runtime will be decommissioned; Node.js 20 support ending Oct 30, 2026 (warning already surfaced)
Impact: Functions will stop working after deprecation date; forced migration required
Migration plan:
1. Migrate to Firebase Functions v2 runtime (already partially done; fallback code still present)
2. Update firebase-functions package to latest major version
3. Remove deprecated functions.config() fallback once migration confirmed
4. Test all functions after upgrade

Puppeteer Version Pinning:

Risk: Puppeteer has frequent security updates; pinned version likely outdated
Impact: Browser vulnerabilities in PDF generation; potential sandbox bypass
Migration plan:
1. Audit current Puppeteer version in package.json
2. Test upgrade path (may have breaking API changes)
3. Implement automated dependency security scanning

Document AI API Versioning:

Risk: Google Cloud Document AI API may deprecate current processor version
Impact: Processing pipeline breaks if processor ID no longer valid
Migration plan:
1. Document current processor version and creation date
2. Subscribe to Google Cloud deprecation notices
3. Add feature flag to switch processor versions
4. Test new processor version before migration

Missing Critical Features

Job Processing Observability:

Problem: No metrics for job success rate, average processing time per stage, or failure breakdown by error type
Blocks: Cannot diagnose performance regressions; cannot identify bottlenecks
Implementation: Add /health/agentic-rag endpoint exposing per-pass timing, token usage, cost data

Document Version History:

Problem: Processing pipeline overwrites analysis_data on each run; no ability to compare old vs. new results
Blocks: Cannot detect if new model version improves accuracy; hard to debug regression
Implementation: Add document_versions table; keep historical results; implement diff UI

Retry Mechanism for Failed Documents:

Problem: Failed documents stay in failed state; no way to retry after infrastructure recovers
Blocks: User must re-upload document; processing failures are permanent per upload
Implementation: Add "Retry" button to failed document status; re-queue without user re-upload

Test Coverage Gaps

End-to-End Pipeline with Large Documents:

What's not tested: Full processing pipeline with 50+ MB documents; covers PDF chunking, Document AI extraction, embeddings, LLM analysis, PDF generation
Files: No integration test covering full flow with large fixture
Risk: Cannot detect if scaling to large documents introduces timeouts or memory issues
Priority: High (Project Panther regression was not caught by tests)

Concurrent Job Processing:

What's not tested: Multiple jobs submitted simultaneously; verify no race conditions in job queue or database
Files: backend/src/services/jobQueueService.ts, backend/src/models/ProcessingJobModel.ts
Risk: Race condition causes duplicate processing or lost job state in production
Priority: High (affects reliability)

Vector Database Fallback Scenarios:

What's not tested: Simulate Supabase RPC timeout and verify correct fallback search is executed
Files: backend/src/services/vectorDatabaseService.ts (lines 110-182)
Risk: Fallback search silent failures or incorrect results not detected
Priority: Medium (affects search quality)

LLM API Provider Switching:

What's not tested: Switch between Anthropic, OpenAI, OpenRouter; verify each provider works correctly
Files: backend/src/services/llmService.ts (provider selection logic)
Risk: Provider-specific bugs not caught until production usage
Priority: Medium (currently only Anthropic heavily used)

Error Propagation in Hybrid Processing:

What's not tested: Job queue failure → immediate processing fallback; verify document status and error reporting
Files: backend/src/controllers/documentController.ts (lines 200+)
Risk: Silent failures or incorrect status updates if fallback error not properly handled
Priority: High (affects user experience)

Concerns audit: 2026-02-24

20 KiB Raw Blame History Unescape Escape