Files
cim_summary/backend/TROUBLESHOOTING_PLAN.md
admin 053426c88d fix: Correct OpenRouter model IDs and add error handling
Critical fixes for LLM processing failures:
- Updated model mapping to use valid OpenRouter IDs (claude-haiku-4.5, claude-sonnet-4.5)
- Changed default models from dated versions to generic names
- Added HTTP status checking before accessing response data
- Enhanced logging for OpenRouter provider selection

Resolves "invalid model ID" errors that were causing all CIM processing to fail.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-06 20:58:26 -05:00

9.8 KiB

CIM Summary LLM Processing - Rapid Diagnostic & Fix Plan

🚨 If Processing Fails - Execute This Plan

Phase 1: Immediate Diagnosis (2-5 minutes)

Step 1.1: Check Recent Failures in Database

npx ts-node -e "
import { supabase } from './src/config/supabase';

(async () => {
  const { data } = await supabase
    .from('documents')
    .select('id, filename, status, error_message, created_at, updated_at')
    .eq('status', 'failed')
    .order('updated_at', { ascending: false })
    .limit(5);

  console.log('Recent Failures:');
  data?.forEach(d => {
    console.log(\`- \${d.filename}: \${d.error_message?.substring(0, 200)}\`);
  });
  process.exit(0);
})();
"

What to look for:

  • Repeating error patterns
  • Specific error messages (timeout, API error, invalid model, etc.)
  • Time pattern (all failures at same time = system issue)

Step 1.2: Check Real-Time Error Logs

# Check last 100 errors
tail -100 logs/error.log | grep -E "(error|ERROR|failed|FAILED|timeout|TIMEOUT)" | tail -20

# Or check specific patterns
grep -E "OpenRouter|Anthropic|LLM|model ID" logs/error.log | tail -20

What to look for:

  • "invalid model ID" → Model name issue
  • "timeout" → Timeout configuration issue
  • "rate limit" → API quota exceeded
  • "401" or "403" → Authentication issue
  • "Cannot read properties" → Code bug

Step 1.3: Test LLM Directly (Fastest Check)

# This takes 30-60 seconds
npx ts-node src/scripts/test-openrouter-simple.ts 2>&1 | grep -E "(SUCCESS|FAILED|error.*model|OpenRouter API)"

Expected output if working:

✅ OpenRouter API call successful
✅ Test Result: SUCCESS

If it fails, note the EXACT error message.


Phase 2: Root Cause Identification (3-10 minutes)

Based on the error from Phase 1, jump to the appropriate section:

Error Type A: Invalid Model ID

Symptoms:

"anthropic/claude-haiku-4 is not a valid model ID"
"anthropic/claude-sonnet-4 is not a valid model ID"

Root Cause: Model name mismatch with OpenRouter API

Fix Location: backend/src/services/llmService.ts lines 526-552

Verification:

# Check what OpenRouter actually supports
curl -s "https://openrouter.ai/api/v1/models" \
  -H "Authorization: Bearer $OPENROUTER_API_KEY" | \
  python3 -m json.tool | \
  grep -A 2 "\"id\": \"anthropic" | \
  head -30

Quick Fix: Update the model mapping in llmService.ts:

// Current valid OpenRouter model IDs (as of Nov 2024):
if (model.includes('sonnet') && model.includes('4')) {
  openRouterModel = 'anthropic/claude-sonnet-4.5';
} else if (model.includes('haiku') && model.includes('4')) {
  openRouterModel = 'anthropic/claude-haiku-4.5';
}

Error Type B: Timeout Errors

Symptoms:

"LLM call timeout after X minutes"
"Processing timeout: Document stuck"

Root Cause: Operation taking longer than configured timeout

Diagnosis:

# Check current timeout settings
grep -E "timeout|TIMEOUT" backend/src/config/env.ts | grep -v "//"
grep "timeoutMs" backend/src/services/llmService.ts | head -5

Check Locations:

  1. env.ts:319 - LLM_TIMEOUT_MS (default 180000 = 3 min)
  2. llmService.ts:343 - Wrapper timeout
  3. llmService.ts:516 - OpenRouter abort timeout

Quick Fix: Add to .env:

LLM_TIMEOUT_MS=360000  # Increase to 6 minutes

Or edit env.ts:319:

timeoutMs: parseInt(envVars['LLM_TIMEOUT_MS'] || '360000'), // 6 min

Error Type C: Authentication/API Key Issues

Symptoms:

"401 Unauthorized"
"403 Forbidden"
"API key is missing"
"ANTHROPIC_API_KEY is not set"

Root Cause: Missing or invalid API keys

Diagnosis:

# Check which keys are set
echo "ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY:0:20}..."
echo "OPENROUTER_API_KEY: ${OPENROUTER_API_KEY:0:20}..."
echo "OPENAI_API_KEY: ${OPENAI_API_KEY:0:20}..."

# Check .env file
grep -E "ANTHROPIC|OPENROUTER|OPENAI" backend/.env | grep -v "^#"

Quick Fix: Ensure these are set in backend/.env:

ANTHROPIC_API_KEY=sk-ant-api03-...
OPENROUTER_API_KEY=sk-or-v1-...
OPENROUTER_USE_BYOK=true

Error Type D: Rate Limit Exceeded

Symptoms:

"429 Too Many Requests"
"rate limit exceeded"
"Retry after X seconds"

Root Cause: Too many API calls in short time

Diagnosis:

# Check recent API call frequency
grep "LLM API call" logs/testing.log | tail -20 | \
  awk '{print $1, $2}' | uniq -c

Quick Fix:

  1. Wait for rate limit to reset (check error for retry time)
  2. Add rate limiting in code:
    // In llmService.ts, add delay between retries
    await new Promise(resolve => setTimeout(resolve, 2000)); // 2 sec delay
    

Error Type E: Code Bugs (TypeError, Cannot read property)

Symptoms:

"Cannot read properties of undefined (reading '0')"
"TypeError: response.data is undefined"
"Unexpected token in JSON"

Root Cause: Missing null checks or incorrect data access

Diagnosis:

# Find the exact line causing the error
grep -A 5 "Cannot read properties" logs/error.log | tail -10

Quick Fix Pattern: Replace unsafe access:

// Bad:
const content = response.data.choices[0].message.content;

// Good:
const content = response.data?.choices?.[0]?.message?.content || '';

File to check: llmService.ts:696-720


Phase 3: Systematic Testing (5-10 minutes)

After applying a fix, test in this order:

Test 1: Direct LLM Call

npx ts-node src/scripts/test-openrouter-simple.ts

Expected: Success in 30-90 seconds

Test 2: Simple RAG Processing

npx ts-node -e "
import { llmService } from './src/services/llmService';

(async () => {
  const text = 'CIM for Target Corp. Revenue: \$100M. EBITDA: \$20M.';
  const result = await llmService.processCIMDocument(text, 'BPCP Template');
  console.log('Success:', result.success);
  console.log('Has JSON:', !!result.jsonOutput);
  process.exit(result.success ? 0 : 1);
})();
"

Expected: Success with JSON output

Test 3: Full Document Upload

Use the frontend to upload a real CIM and monitor:

# In one terminal, watch logs
tail -f logs/testing.log | grep -E "(error|success|completed)"

# Check processing status
npx ts-node src/scripts/check-current-processing.ts

Phase 4: Emergency Fallback Options

If all else fails, use these fallback strategies:

Option 1: Switch to Direct Anthropic (Bypass OpenRouter)

# In .env
LLM_PROVIDER=anthropic  # Instead of openrouter

Pro: Eliminates OpenRouter as variable Con: Different rate limits

Option 2: Use Older Claude Model

# In .env or env.ts
LLM_MODEL=claude-3.5-sonnet
LLM_FAST_MODEL=claude-3.5-haiku

Pro: More stable, widely supported Con: Slightly older model

Option 3: Reduce Input Size

// In optimizedAgenticRAGProcessor.ts:651
const targetTokenCount = 8000; // Down from 50000

Pro: Faster processing, less likely to timeout Con: Less context for analysis


Phase 5: Preventive Monitoring

Set up these checks to catch issues early:

Daily Health Check Script

Create backend/scripts/daily-health-check.sh:

#!/bin/bash
echo "=== Daily CIM Processor Health Check ==="
echo ""

# Check for stuck documents
npx ts-node src/scripts/check-database-failures.ts

# Test LLM connectivity
npx ts-node src/scripts/test-openrouter-simple.ts

# Check recent success rate
echo "Recent processing stats (last 24 hours):"
npx ts-node -e "
import { supabase } from './src/config/supabase';
(async () => {
  const yesterday = new Date(Date.now() - 86400000).toISOString();
  const { data } = await supabase
    .from('documents')
    .select('status')
    .gte('created_at', yesterday);

  const stats = data?.reduce((acc, d) => {
    acc[d.status] = (acc[d.status] || 0) + 1;
    return acc;
  }, {});

  console.log(stats);
  process.exit(0);
})();
"

Run daily:

chmod +x backend/scripts/daily-health-check.sh
./backend/scripts/daily-health-check.sh

📋 Quick Reference Checklist

When processing fails, check in this order:

  • Error logs (tail -100 logs/error.log)
  • Recent failures (database query in Step 1.1)
  • Direct LLM test (test-openrouter-simple.ts)
  • Model ID validity (curl OpenRouter API)
  • API keys set (check .env)
  • Timeout values (check env.ts)
  • OpenRouter vs Anthropic (which provider?)
  • Rate limits (check error for 429)
  • Code bugs (look for TypeErrors in logs)
  • Build succeeded (npm run build)

🔧 Common Fix Commands

# Rebuild after code changes
npm run build

# Clear error logs and start fresh
> logs/error.log

# Test with verbose logging
LOG_LEVEL=debug npx ts-node src/scripts/test-openrouter-simple.ts

# Check what's actually in .env
cat .env | grep -v "^#" | grep -E "LLM|ANTHROPIC|OPENROUTER"

# Verify OpenRouter models
curl -s "https://openrouter.ai/api/v1/models" -H "Authorization: Bearer $OPENROUTER_API_KEY" | python3 -m json.tool | grep "claude.*haiku\|claude.*sonnet"

📞 Escalation Path

If issue persists after 30 minutes:

  1. Check OpenRouter Status: https://status.openrouter.ai/
  2. Check Anthropic Status: https://status.anthropic.com/
  3. Review OpenRouter Docs: https://openrouter.ai/docs
  4. Test with curl: Send raw API request to isolate issue
  5. Compare git history: git diff HEAD~10 -- backend/src/services/llmService.ts

🎯 Success Criteria

Processing is "working" when:

  • Direct LLM test completes in < 2 minutes
  • Returns valid JSON matching schema
  • No errors in last 10 log entries
  • Database shows recent "completed" documents
  • Frontend can upload and process test CIM

Last Updated: 2025-11-07 Next Review: After any production deployment