Files

admin 053426c88d fix: Correct OpenRouter model IDs and add error handling

Critical fixes for LLM processing failures:
- Updated model mapping to use valid OpenRouter IDs (claude-haiku-4.5, claude-sonnet-4.5)
- Changed default models from dated versions to generic names
- Added HTTP status checking before accessing response data
- Enhanced logging for OpenRouter provider selection

Resolves "invalid model ID" errors that were causing all CIM processing to fail.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-06 20:58:26 -05:00

9.8 KiB

Raw Blame History

CIM Summary LLM Processing - Rapid Diagnostic & Fix Plan

🚨 If Processing Fails - Execute This Plan

Phase 1: Immediate Diagnosis (2-5 minutes)

Step 1.1: Check Recent Failures in Database

npx ts-node -e "
import { supabase } from './src/config/supabase';

(async () => {
  const { data } = await supabase
    .from('documents')
    .select('id, filename, status, error_message, created_at, updated_at')
    .eq('status', 'failed')
    .order('updated_at', { ascending: false })
    .limit(5);

  console.log('Recent Failures:');
  data?.forEach(d => {
    console.log(\`- \${d.filename}: \${d.error_message?.substring(0, 200)}\`);
  });
  process.exit(0);
})();
"

What to look for:

Repeating error patterns
Specific error messages (timeout, API error, invalid model, etc.)
Time pattern (all failures at same time = system issue)

Step 1.2: Check Real-Time Error Logs

# Check last 100 errors
tail -100 logs/error.log | grep -E "(error|ERROR|failed|FAILED|timeout|TIMEOUT)" | tail -20

# Or check specific patterns
grep -E "OpenRouter|Anthropic|LLM|model ID" logs/error.log | tail -20

What to look for:

"invalid model ID" → Model name issue
"timeout" → Timeout configuration issue
"rate limit" → API quota exceeded
"401" or "403" → Authentication issue
"Cannot read properties" → Code bug

Step 1.3: Test LLM Directly (Fastest Check)

# This takes 30-60 seconds
npx ts-node src/scripts/test-openrouter-simple.ts 2>&1 | grep -E "(SUCCESS|FAILED|error.*model|OpenRouter API)"

Expected output if working:

✅ OpenRouter API call successful
✅ Test Result: SUCCESS

If it fails, note the EXACT error message.

Phase 2: Root Cause Identification (3-10 minutes)

Based on the error from Phase 1, jump to the appropriate section:

Error Type A: Invalid Model ID

Symptoms:

"anthropic/claude-haiku-4 is not a valid model ID"
"anthropic/claude-sonnet-4 is not a valid model ID"

Root Cause: Model name mismatch with OpenRouter API

Fix Location: backend/src/services/llmService.ts lines 526-552

Verification:

# Check what OpenRouter actually supports
curl -s "https://openrouter.ai/api/v1/models" \
  -H "Authorization: Bearer $OPENROUTER_API_KEY" | \
  python3 -m json.tool | \
  grep -A 2 "\"id\": \"anthropic" | \
  head -30

Quick Fix: Update the model mapping in llmService.ts:

// Current valid OpenRouter model IDs (as of Nov 2024):
if (model.includes('sonnet') && model.includes('4')) {
  openRouterModel = 'anthropic/claude-sonnet-4.5';
} else if (model.includes('haiku') && model.includes('4')) {
  openRouterModel = 'anthropic/claude-haiku-4.5';
}

Error Type B: Timeout Errors

Symptoms:

"LLM call timeout after X minutes"
"Processing timeout: Document stuck"

Root Cause: Operation taking longer than configured timeout

Diagnosis:

# Check current timeout settings
grep -E "timeout|TIMEOUT" backend/src/config/env.ts | grep -v "//"
grep "timeoutMs" backend/src/services/llmService.ts | head -5

Check Locations:

env.ts:319 - LLM_TIMEOUT_MS (default 180000 = 3 min)
llmService.ts:343 - Wrapper timeout
llmService.ts:516 - OpenRouter abort timeout

Quick Fix: Add to .env:

LLM_TIMEOUT_MS=360000  # Increase to 6 minutes

Or edit env.ts:319:

timeoutMs: parseInt(envVars['LLM_TIMEOUT_MS'] || '360000'), // 6 min

Error Type C: Authentication/API Key Issues

Symptoms:

"401 Unauthorized"
"403 Forbidden"
"API key is missing"
"ANTHROPIC_API_KEY is not set"

Root Cause: Missing or invalid API keys

Diagnosis:

# Check which keys are set
echo "ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY:0:20}..."
echo "OPENROUTER_API_KEY: ${OPENROUTER_API_KEY:0:20}..."
echo "OPENAI_API_KEY: ${OPENAI_API_KEY:0:20}..."

# Check .env file
grep -E "ANTHROPIC|OPENROUTER|OPENAI" backend/.env | grep -v "^#"

Quick Fix: Ensure these are set in backend/.env:

ANTHROPIC_API_KEY=sk-ant-api03-...
OPENROUTER_API_KEY=sk-or-v1-...
OPENROUTER_USE_BYOK=true

Error Type D: Rate Limit Exceeded

Symptoms:

"429 Too Many Requests"
"rate limit exceeded"
"Retry after X seconds"

Root Cause: Too many API calls in short time

Diagnosis:

# Check recent API call frequency
grep "LLM API call" logs/testing.log | tail -20 | \
  awk '{print $1, $2}' | uniq -c

Quick Fix:

Wait for rate limit to reset (check error for retry time)

Add rate limiting in code:

// In llmService.ts, add delay between retries
await new Promise(resolve => setTimeout(resolve, 2000)); // 2 sec delay

Error Type E: Code Bugs (TypeError, Cannot read property)

Symptoms:

"Cannot read properties of undefined (reading '0')"
"TypeError: response.data is undefined"
"Unexpected token in JSON"

Root Cause: Missing null checks or incorrect data access

Diagnosis:

# Find the exact line causing the error
grep -A 5 "Cannot read properties" logs/error.log | tail -10

Quick Fix Pattern: Replace unsafe access:

// Bad:
const content = response.data.choices[0].message.content;

// Good:
const content = response.data?.choices?.[0]?.message?.content || '';

File to check: llmService.ts:696-720

Phase 3: Systematic Testing (5-10 minutes)

After applying a fix, test in this order:

Test 1: Direct LLM Call

npx ts-node src/scripts/test-openrouter-simple.ts

Expected: Success in 30-90 seconds

Test 2: Simple RAG Processing

npx ts-node -e "
import { llmService } from './src/services/llmService';

(async () => {
  const text = 'CIM for Target Corp. Revenue: \$100M. EBITDA: \$20M.';
  const result = await llmService.processCIMDocument(text, 'BPCP Template');
  console.log('Success:', result.success);
  console.log('Has JSON:', !!result.jsonOutput);
  process.exit(result.success ? 0 : 1);
})();
"

Expected: Success with JSON output

Test 3: Full Document Upload

Use the frontend to upload a real CIM and monitor:

# In one terminal, watch logs
tail -f logs/testing.log | grep -E "(error|success|completed)"

# Check processing status
npx ts-node src/scripts/check-current-processing.ts

Phase 4: Emergency Fallback Options

If all else fails, use these fallback strategies:

Option 1: Switch to Direct Anthropic (Bypass OpenRouter)

# In .env
LLM_PROVIDER=anthropic  # Instead of openrouter

Pro: Eliminates OpenRouter as variable Con: Different rate limits

Option 2: Use Older Claude Model

# In .env or env.ts
LLM_MODEL=claude-3.5-sonnet
LLM_FAST_MODEL=claude-3.5-haiku

Pro: More stable, widely supported Con: Slightly older model

Option 3: Reduce Input Size

// In optimizedAgenticRAGProcessor.ts:651
const targetTokenCount = 8000; // Down from 50000

Pro: Faster processing, less likely to timeout Con: Less context for analysis

Phase 5: Preventive Monitoring

Set up these checks to catch issues early:

Daily Health Check Script

Create backend/scripts/daily-health-check.sh:

#!/bin/bash
echo "=== Daily CIM Processor Health Check ==="
echo ""

# Check for stuck documents
npx ts-node src/scripts/check-database-failures.ts

# Test LLM connectivity
npx ts-node src/scripts/test-openrouter-simple.ts

# Check recent success rate
echo "Recent processing stats (last 24 hours):"
npx ts-node -e "
import { supabase } from './src/config/supabase';
(async () => {
  const yesterday = new Date(Date.now() - 86400000).toISOString();
  const { data } = await supabase
    .from('documents')
    .select('status')
    .gte('created_at', yesterday);

  const stats = data?.reduce((acc, d) => {
    acc[d.status] = (acc[d.status] || 0) + 1;
    return acc;
  }, {});

  console.log(stats);
  process.exit(0);
})();
"

Run daily:

chmod +x backend/scripts/daily-health-check.sh
./backend/scripts/daily-health-check.sh

📋 Quick Reference Checklist

When processing fails, check in this order:

Error logs (tail -100 logs/error.log)
Recent failures (database query in Step 1.1)
Direct LLM test (test-openrouter-simple.ts)
Model ID validity (curl OpenRouter API)
API keys set (check .env)
Timeout values (check env.ts)
OpenRouter vs Anthropic (which provider?)
Rate limits (check error for 429)
Code bugs (look for TypeErrors in logs)
Build succeeded (npm run build)

🔧 Common Fix Commands

# Rebuild after code changes
npm run build

# Clear error logs and start fresh
> logs/error.log

# Test with verbose logging
LOG_LEVEL=debug npx ts-node src/scripts/test-openrouter-simple.ts

# Check what's actually in .env
cat .env | grep -v "^#" | grep -E "LLM|ANTHROPIC|OPENROUTER"

# Verify OpenRouter models
curl -s "https://openrouter.ai/api/v1/models" -H "Authorization: Bearer $OPENROUTER_API_KEY" | python3 -m json.tool | grep "claude.*haiku\|claude.*sonnet"

📞 Escalation Path

If issue persists after 30 minutes:

Check OpenRouter Status: https://status.openrouter.ai/
Check Anthropic Status: https://status.anthropic.com/
Review OpenRouter Docs: https://openrouter.ai/docs
Test with curl: Send raw API request to isolate issue
Compare git history: git diff HEAD~10 -- backend/src/services/llmService.ts

🎯 Success Criteria

Processing is "working" when:

✅ Direct LLM test completes in < 2 minutes
✅ Returns valid JSON matching schema
✅ No errors in last 10 log entries
✅ Database shows recent "completed" documents
✅ Frontend can upload and process test CIM

Last Updated: 2025-11-07 Next Review: After any production deployment

9.8 KiB Raw Blame History

CIM Summary LLM Processing - Rapid Diagnostic & Fix Plan

🚨 If Processing Fails - Execute This Plan

Phase 1: Immediate Diagnosis (2-5 minutes)

Step 1.1: Check Recent Failures in Database

Step 1.2: Check Real-Time Error Logs

Step 1.3: Test LLM Directly (Fastest Check)

Phase 2: Root Cause Identification (3-10 minutes)

Error Type A: Invalid Model ID

Error Type B: Timeout Errors

Error Type C: Authentication/API Key Issues

Error Type D: Rate Limit Exceeded

Error Type E: Code Bugs (TypeError, Cannot read property)

Phase 3: Systematic Testing (5-10 minutes)

Test 1: Direct LLM Call

Test 2: Simple RAG Processing

Test 3: Full Document Upload

Phase 4: Emergency Fallback Options

Option 1: Switch to Direct Anthropic (Bypass OpenRouter)

Option 2: Use Older Claude Model

Option 3: Reduce Input Size

Phase 5: Preventive Monitoring

Daily Health Check Script

📋 Quick Reference Checklist

🔧 Common Fix Commands

📞 Escalation Path

🎯 Success Criteria

9.8 KiB

Raw Blame History