Critical fixes for LLM processing failures: - Updated model mapping to use valid OpenRouter IDs (claude-haiku-4.5, claude-sonnet-4.5) - Changed default models from dated versions to generic names - Added HTTP status checking before accessing response data - Enhanced logging for OpenRouter provider selection Resolves "invalid model ID" errors that were causing all CIM processing to fail. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
9.8 KiB
CIM Summary LLM Processing - Rapid Diagnostic & Fix Plan
🚨 If Processing Fails - Execute This Plan
Phase 1: Immediate Diagnosis (2-5 minutes)
Step 1.1: Check Recent Failures in Database
npx ts-node -e "
import { supabase } from './src/config/supabase';
(async () => {
const { data } = await supabase
.from('documents')
.select('id, filename, status, error_message, created_at, updated_at')
.eq('status', 'failed')
.order('updated_at', { ascending: false })
.limit(5);
console.log('Recent Failures:');
data?.forEach(d => {
console.log(\`- \${d.filename}: \${d.error_message?.substring(0, 200)}\`);
});
process.exit(0);
})();
"
What to look for:
- Repeating error patterns
- Specific error messages (timeout, API error, invalid model, etc.)
- Time pattern (all failures at same time = system issue)
Step 1.2: Check Real-Time Error Logs
# Check last 100 errors
tail -100 logs/error.log | grep -E "(error|ERROR|failed|FAILED|timeout|TIMEOUT)" | tail -20
# Or check specific patterns
grep -E "OpenRouter|Anthropic|LLM|model ID" logs/error.log | tail -20
What to look for:
"invalid model ID"→ Model name issue"timeout"→ Timeout configuration issue"rate limit"→ API quota exceeded"401"or"403"→ Authentication issue"Cannot read properties"→ Code bug
Step 1.3: Test LLM Directly (Fastest Check)
# This takes 30-60 seconds
npx ts-node src/scripts/test-openrouter-simple.ts 2>&1 | grep -E "(SUCCESS|FAILED|error.*model|OpenRouter API)"
Expected output if working:
✅ OpenRouter API call successful
✅ Test Result: SUCCESS
If it fails, note the EXACT error message.
Phase 2: Root Cause Identification (3-10 minutes)
Based on the error from Phase 1, jump to the appropriate section:
Error Type A: Invalid Model ID
Symptoms:
"anthropic/claude-haiku-4 is not a valid model ID"
"anthropic/claude-sonnet-4 is not a valid model ID"
Root Cause: Model name mismatch with OpenRouter API
Fix Location: backend/src/services/llmService.ts lines 526-552
Verification:
# Check what OpenRouter actually supports
curl -s "https://openrouter.ai/api/v1/models" \
-H "Authorization: Bearer $OPENROUTER_API_KEY" | \
python3 -m json.tool | \
grep -A 2 "\"id\": \"anthropic" | \
head -30
Quick Fix:
Update the model mapping in llmService.ts:
// Current valid OpenRouter model IDs (as of Nov 2024):
if (model.includes('sonnet') && model.includes('4')) {
openRouterModel = 'anthropic/claude-sonnet-4.5';
} else if (model.includes('haiku') && model.includes('4')) {
openRouterModel = 'anthropic/claude-haiku-4.5';
}
Error Type B: Timeout Errors
Symptoms:
"LLM call timeout after X minutes"
"Processing timeout: Document stuck"
Root Cause: Operation taking longer than configured timeout
Diagnosis:
# Check current timeout settings
grep -E "timeout|TIMEOUT" backend/src/config/env.ts | grep -v "//"
grep "timeoutMs" backend/src/services/llmService.ts | head -5
Check Locations:
env.ts:319-LLM_TIMEOUT_MS(default 180000 = 3 min)llmService.ts:343- Wrapper timeoutllmService.ts:516- OpenRouter abort timeout
Quick Fix:
Add to .env:
LLM_TIMEOUT_MS=360000 # Increase to 6 minutes
Or edit env.ts:319:
timeoutMs: parseInt(envVars['LLM_TIMEOUT_MS'] || '360000'), // 6 min
Error Type C: Authentication/API Key Issues
Symptoms:
"401 Unauthorized"
"403 Forbidden"
"API key is missing"
"ANTHROPIC_API_KEY is not set"
Root Cause: Missing or invalid API keys
Diagnosis:
# Check which keys are set
echo "ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY:0:20}..."
echo "OPENROUTER_API_KEY: ${OPENROUTER_API_KEY:0:20}..."
echo "OPENAI_API_KEY: ${OPENAI_API_KEY:0:20}..."
# Check .env file
grep -E "ANTHROPIC|OPENROUTER|OPENAI" backend/.env | grep -v "^#"
Quick Fix:
Ensure these are set in backend/.env:
ANTHROPIC_API_KEY=sk-ant-api03-...
OPENROUTER_API_KEY=sk-or-v1-...
OPENROUTER_USE_BYOK=true
Error Type D: Rate Limit Exceeded
Symptoms:
"429 Too Many Requests"
"rate limit exceeded"
"Retry after X seconds"
Root Cause: Too many API calls in short time
Diagnosis:
# Check recent API call frequency
grep "LLM API call" logs/testing.log | tail -20 | \
awk '{print $1, $2}' | uniq -c
Quick Fix:
- Wait for rate limit to reset (check error for retry time)
- Add rate limiting in code:
// In llmService.ts, add delay between retries await new Promise(resolve => setTimeout(resolve, 2000)); // 2 sec delay
Error Type E: Code Bugs (TypeError, Cannot read property)
Symptoms:
"Cannot read properties of undefined (reading '0')"
"TypeError: response.data is undefined"
"Unexpected token in JSON"
Root Cause: Missing null checks or incorrect data access
Diagnosis:
# Find the exact line causing the error
grep -A 5 "Cannot read properties" logs/error.log | tail -10
Quick Fix Pattern: Replace unsafe access:
// Bad:
const content = response.data.choices[0].message.content;
// Good:
const content = response.data?.choices?.[0]?.message?.content || '';
File to check: llmService.ts:696-720
Phase 3: Systematic Testing (5-10 minutes)
After applying a fix, test in this order:
Test 1: Direct LLM Call
npx ts-node src/scripts/test-openrouter-simple.ts
Expected: Success in 30-90 seconds
Test 2: Simple RAG Processing
npx ts-node -e "
import { llmService } from './src/services/llmService';
(async () => {
const text = 'CIM for Target Corp. Revenue: \$100M. EBITDA: \$20M.';
const result = await llmService.processCIMDocument(text, 'BPCP Template');
console.log('Success:', result.success);
console.log('Has JSON:', !!result.jsonOutput);
process.exit(result.success ? 0 : 1);
})();
"
Expected: Success with JSON output
Test 3: Full Document Upload
Use the frontend to upload a real CIM and monitor:
# In one terminal, watch logs
tail -f logs/testing.log | grep -E "(error|success|completed)"
# Check processing status
npx ts-node src/scripts/check-current-processing.ts
Phase 4: Emergency Fallback Options
If all else fails, use these fallback strategies:
Option 1: Switch to Direct Anthropic (Bypass OpenRouter)
# In .env
LLM_PROVIDER=anthropic # Instead of openrouter
Pro: Eliminates OpenRouter as variable Con: Different rate limits
Option 2: Use Older Claude Model
# In .env or env.ts
LLM_MODEL=claude-3.5-sonnet
LLM_FAST_MODEL=claude-3.5-haiku
Pro: More stable, widely supported Con: Slightly older model
Option 3: Reduce Input Size
// In optimizedAgenticRAGProcessor.ts:651
const targetTokenCount = 8000; // Down from 50000
Pro: Faster processing, less likely to timeout Con: Less context for analysis
Phase 5: Preventive Monitoring
Set up these checks to catch issues early:
Daily Health Check Script
Create backend/scripts/daily-health-check.sh:
#!/bin/bash
echo "=== Daily CIM Processor Health Check ==="
echo ""
# Check for stuck documents
npx ts-node src/scripts/check-database-failures.ts
# Test LLM connectivity
npx ts-node src/scripts/test-openrouter-simple.ts
# Check recent success rate
echo "Recent processing stats (last 24 hours):"
npx ts-node -e "
import { supabase } from './src/config/supabase';
(async () => {
const yesterday = new Date(Date.now() - 86400000).toISOString();
const { data } = await supabase
.from('documents')
.select('status')
.gte('created_at', yesterday);
const stats = data?.reduce((acc, d) => {
acc[d.status] = (acc[d.status] || 0) + 1;
return acc;
}, {});
console.log(stats);
process.exit(0);
})();
"
Run daily:
chmod +x backend/scripts/daily-health-check.sh
./backend/scripts/daily-health-check.sh
📋 Quick Reference Checklist
When processing fails, check in this order:
- Error logs (
tail -100 logs/error.log) - Recent failures (database query in Step 1.1)
- Direct LLM test (
test-openrouter-simple.ts) - Model ID validity (curl OpenRouter API)
- API keys set (check
.env) - Timeout values (check
env.ts) - OpenRouter vs Anthropic (which provider?)
- Rate limits (check error for 429)
- Code bugs (look for TypeErrors in logs)
- Build succeeded (
npm run build)
🔧 Common Fix Commands
# Rebuild after code changes
npm run build
# Clear error logs and start fresh
> logs/error.log
# Test with verbose logging
LOG_LEVEL=debug npx ts-node src/scripts/test-openrouter-simple.ts
# Check what's actually in .env
cat .env | grep -v "^#" | grep -E "LLM|ANTHROPIC|OPENROUTER"
# Verify OpenRouter models
curl -s "https://openrouter.ai/api/v1/models" -H "Authorization: Bearer $OPENROUTER_API_KEY" | python3 -m json.tool | grep "claude.*haiku\|claude.*sonnet"
📞 Escalation Path
If issue persists after 30 minutes:
- Check OpenRouter Status: https://status.openrouter.ai/
- Check Anthropic Status: https://status.anthropic.com/
- Review OpenRouter Docs: https://openrouter.ai/docs
- Test with curl: Send raw API request to isolate issue
- Compare git history:
git diff HEAD~10 -- backend/src/services/llmService.ts
🎯 Success Criteria
Processing is "working" when:
- ✅ Direct LLM test completes in < 2 minutes
- ✅ Returns valid JSON matching schema
- ✅ No errors in last 10 log entries
- ✅ Database shows recent "completed" documents
- ✅ Frontend can upload and process test CIM
Last Updated: 2025-11-07 Next Review: After any production deployment