# CIM Summary LLM Processing - Rapid Diagnostic & Fix Plan ## 🚨 If Processing Fails - Execute This Plan ### Phase 1: Immediate Diagnosis (2-5 minutes) #### Step 1.1: Check Recent Failures in Database ```bash npx ts-node -e " import { supabase } from './src/config/supabase'; (async () => { const { data } = await supabase .from('documents') .select('id, filename, status, error_message, created_at, updated_at') .eq('status', 'failed') .order('updated_at', { ascending: false }) .limit(5); console.log('Recent Failures:'); data?.forEach(d => { console.log(\`- \${d.filename}: \${d.error_message?.substring(0, 200)}\`); }); process.exit(0); })(); " ``` **What to look for:** - Repeating error patterns - Specific error messages (timeout, API error, invalid model, etc.) - Time pattern (all failures at same time = system issue) --- #### Step 1.2: Check Real-Time Error Logs ```bash # Check last 100 errors tail -100 logs/error.log | grep -E "(error|ERROR|failed|FAILED|timeout|TIMEOUT)" | tail -20 # Or check specific patterns grep -E "OpenRouter|Anthropic|LLM|model ID" logs/error.log | tail -20 ``` **What to look for:** - `"invalid model ID"` → Model name issue - `"timeout"` → Timeout configuration issue - `"rate limit"` → API quota exceeded - `"401"` or `"403"` → Authentication issue - `"Cannot read properties"` → Code bug --- #### Step 1.3: Test LLM Directly (Fastest Check) ```bash # This takes 30-60 seconds npx ts-node src/scripts/test-openrouter-simple.ts 2>&1 | grep -E "(SUCCESS|FAILED|error.*model|OpenRouter API)" ``` **Expected output if working:** ``` ✅ OpenRouter API call successful ✅ Test Result: SUCCESS ``` **If it fails, note the EXACT error message.** --- ### Phase 2: Root Cause Identification (3-10 minutes) Based on the error from Phase 1, jump to the appropriate section: #### **Error Type A: Invalid Model ID** **Symptoms:** ``` "anthropic/claude-haiku-4 is not a valid model ID" "anthropic/claude-sonnet-4 is not a valid model ID" ``` **Root Cause:** Model name mismatch with OpenRouter API **Fix Location:** `backend/src/services/llmService.ts` lines 526-552 **Verification:** ```bash # Check what OpenRouter actually supports curl -s "https://openrouter.ai/api/v1/models" \ -H "Authorization: Bearer $OPENROUTER_API_KEY" | \ python3 -m json.tool | \ grep -A 2 "\"id\": \"anthropic" | \ head -30 ``` **Quick Fix:** Update the model mapping in `llmService.ts`: ```typescript // Current valid OpenRouter model IDs (as of Nov 2024): if (model.includes('sonnet') && model.includes('4')) { openRouterModel = 'anthropic/claude-sonnet-4.5'; } else if (model.includes('haiku') && model.includes('4')) { openRouterModel = 'anthropic/claude-haiku-4.5'; } ``` --- #### **Error Type B: Timeout Errors** **Symptoms:** ``` "LLM call timeout after X minutes" "Processing timeout: Document stuck" ``` **Root Cause:** Operation taking longer than configured timeout **Diagnosis:** ```bash # Check current timeout settings grep -E "timeout|TIMEOUT" backend/src/config/env.ts | grep -v "//" grep "timeoutMs" backend/src/services/llmService.ts | head -5 ``` **Check Locations:** 1. `env.ts:319` - `LLM_TIMEOUT_MS` (default 180000 = 3 min) 2. `llmService.ts:343` - Wrapper timeout 3. `llmService.ts:516` - OpenRouter abort timeout **Quick Fix:** Add to `.env`: ```bash LLM_TIMEOUT_MS=360000 # Increase to 6 minutes ``` Or edit `env.ts:319`: ```typescript timeoutMs: parseInt(envVars['LLM_TIMEOUT_MS'] || '360000'), // 6 min ``` --- #### **Error Type C: Authentication/API Key Issues** **Symptoms:** ``` "401 Unauthorized" "403 Forbidden" "API key is missing" "ANTHROPIC_API_KEY is not set" ``` **Root Cause:** Missing or invalid API keys **Diagnosis:** ```bash # Check which keys are set echo "ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY:0:20}..." echo "OPENROUTER_API_KEY: ${OPENROUTER_API_KEY:0:20}..." echo "OPENAI_API_KEY: ${OPENAI_API_KEY:0:20}..." # Check .env file grep -E "ANTHROPIC|OPENROUTER|OPENAI" backend/.env | grep -v "^#" ``` **Quick Fix:** Ensure these are set in `backend/.env`: ```bash ANTHROPIC_API_KEY=sk-ant-api03-... OPENROUTER_API_KEY=sk-or-v1-... OPENROUTER_USE_BYOK=true ``` --- #### **Error Type D: Rate Limit Exceeded** **Symptoms:** ``` "429 Too Many Requests" "rate limit exceeded" "Retry after X seconds" ``` **Root Cause:** Too many API calls in short time **Diagnosis:** ```bash # Check recent API call frequency grep "LLM API call" logs/testing.log | tail -20 | \ awk '{print $1, $2}' | uniq -c ``` **Quick Fix:** 1. Wait for rate limit to reset (check error for retry time) 2. Add rate limiting in code: ```typescript // In llmService.ts, add delay between retries await new Promise(resolve => setTimeout(resolve, 2000)); // 2 sec delay ``` --- #### **Error Type E: Code Bugs (TypeError, Cannot read property)** **Symptoms:** ``` "Cannot read properties of undefined (reading '0')" "TypeError: response.data is undefined" "Unexpected token in JSON" ``` **Root Cause:** Missing null checks or incorrect data access **Diagnosis:** ```bash # Find the exact line causing the error grep -A 5 "Cannot read properties" logs/error.log | tail -10 ``` **Quick Fix Pattern:** Replace unsafe access: ```typescript // Bad: const content = response.data.choices[0].message.content; // Good: const content = response.data?.choices?.[0]?.message?.content || ''; ``` **File to check:** `llmService.ts:696-720` --- ### Phase 3: Systematic Testing (5-10 minutes) After applying a fix, test in this order: #### Test 1: Direct LLM Call ```bash npx ts-node src/scripts/test-openrouter-simple.ts ``` **Expected:** Success in 30-90 seconds #### Test 2: Simple RAG Processing ```bash npx ts-node -e " import { llmService } from './src/services/llmService'; (async () => { const text = 'CIM for Target Corp. Revenue: \$100M. EBITDA: \$20M.'; const result = await llmService.processCIMDocument(text, 'BPCP Template'); console.log('Success:', result.success); console.log('Has JSON:', !!result.jsonOutput); process.exit(result.success ? 0 : 1); })(); " ``` **Expected:** Success with JSON output #### Test 3: Full Document Upload Use the frontend to upload a real CIM and monitor: ```bash # In one terminal, watch logs tail -f logs/testing.log | grep -E "(error|success|completed)" # Check processing status npx ts-node src/scripts/check-current-processing.ts ``` --- ### Phase 4: Emergency Fallback Options If all else fails, use these fallback strategies: #### Option 1: Switch to Direct Anthropic (Bypass OpenRouter) ```bash # In .env LLM_PROVIDER=anthropic # Instead of openrouter ``` **Pro:** Eliminates OpenRouter as variable **Con:** Different rate limits #### Option 2: Use Older Claude Model ```bash # In .env or env.ts LLM_MODEL=claude-3.5-sonnet LLM_FAST_MODEL=claude-3.5-haiku ``` **Pro:** More stable, widely supported **Con:** Slightly older model #### Option 3: Reduce Input Size ```typescript // In optimizedAgenticRAGProcessor.ts:651 const targetTokenCount = 8000; // Down from 50000 ``` **Pro:** Faster processing, less likely to timeout **Con:** Less context for analysis --- ### Phase 5: Preventive Monitoring Set up these checks to catch issues early: #### Daily Health Check Script Create `backend/scripts/daily-health-check.sh`: ```bash #!/bin/bash echo "=== Daily CIM Processor Health Check ===" echo "" # Check for stuck documents npx ts-node src/scripts/check-database-failures.ts # Test LLM connectivity npx ts-node src/scripts/test-openrouter-simple.ts # Check recent success rate echo "Recent processing stats (last 24 hours):" npx ts-node -e " import { supabase } from './src/config/supabase'; (async () => { const yesterday = new Date(Date.now() - 86400000).toISOString(); const { data } = await supabase .from('documents') .select('status') .gte('created_at', yesterday); const stats = data?.reduce((acc, d) => { acc[d.status] = (acc[d.status] || 0) + 1; return acc; }, {}); console.log(stats); process.exit(0); })(); " ``` Run daily: ```bash chmod +x backend/scripts/daily-health-check.sh ./backend/scripts/daily-health-check.sh ``` --- ## 📋 Quick Reference Checklist When processing fails, check in this order: - [ ] **Error logs** (`tail -100 logs/error.log`) - [ ] **Recent failures** (database query in Step 1.1) - [ ] **Direct LLM test** (`test-openrouter-simple.ts`) - [ ] **Model ID validity** (curl OpenRouter API) - [ ] **API keys set** (check `.env`) - [ ] **Timeout values** (check `env.ts`) - [ ] **OpenRouter vs Anthropic** (which provider?) - [ ] **Rate limits** (check error for 429) - [ ] **Code bugs** (look for TypeErrors in logs) - [ ] **Build succeeded** (`npm run build`) --- ## 🔧 Common Fix Commands ```bash # Rebuild after code changes npm run build # Clear error logs and start fresh > logs/error.log # Test with verbose logging LOG_LEVEL=debug npx ts-node src/scripts/test-openrouter-simple.ts # Check what's actually in .env cat .env | grep -v "^#" | grep -E "LLM|ANTHROPIC|OPENROUTER" # Verify OpenRouter models curl -s "https://openrouter.ai/api/v1/models" -H "Authorization: Bearer $OPENROUTER_API_KEY" | python3 -m json.tool | grep "claude.*haiku\|claude.*sonnet" ``` --- ## 📞 Escalation Path If issue persists after 30 minutes: 1. **Check OpenRouter Status:** https://status.openrouter.ai/ 2. **Check Anthropic Status:** https://status.anthropic.com/ 3. **Review OpenRouter Docs:** https://openrouter.ai/docs 4. **Test with curl:** Send raw API request to isolate issue 5. **Compare git history:** `git diff HEAD~10 -- backend/src/services/llmService.ts` --- ## 🎯 Success Criteria Processing is "working" when: - ✅ Direct LLM test completes in < 2 minutes - ✅ Returns valid JSON matching schema - ✅ No errors in last 10 log entries - ✅ Database shows recent "completed" documents - ✅ Frontend can upload and process test CIM --- **Last Updated:** 2025-11-07 **Next Review:** After any production deployment