# CIM Summary LLM Processing - Rapid Diagnostic & Fix Plan

## 🚨 If Processing Fails - Execute This Plan

### Phase 1: Immediate Diagnosis (2-5 minutes)

#### Step 1.1: Check Recent Failures in Database
```bash
npx ts-node -e "
import { supabase } from './src/config/supabase';

(async () => {
  const { data } = await supabase
    .from('documents')
    .select('id, filename, status, error_message, created_at, updated_at')
    .eq('status', 'failed')
    .order('updated_at', { ascending: false })
    .limit(5);

  console.log('Recent Failures:');
  data?.forEach(d => {
    console.log(\`- \${d.filename}: \${d.error_message?.substring(0, 200)}\`);
  });
  process.exit(0);
})();
"
```

**What to look for:**
- Repeating error patterns
- Specific error messages (timeout, API error, invalid model, etc.)
- Time pattern (all failures at same time = system issue)

---

#### Step 1.2: Check Real-Time Error Logs
```bash
# Check last 100 errors
tail -100 logs/error.log | grep -E "(error|ERROR|failed|FAILED|timeout|TIMEOUT)" | tail -20

# Or check specific patterns
grep -E "OpenRouter|Anthropic|LLM|model ID" logs/error.log | tail -20
```

**What to look for:**
- `"invalid model ID"` → Model name issue
- `"timeout"` → Timeout configuration issue
- `"rate limit"` → API quota exceeded
- `"401"` or `"403"` → Authentication issue
- `"Cannot read properties"` → Code bug

---

#### Step 1.3: Test LLM Directly (Fastest Check)
```bash
# This takes 30-60 seconds
npx ts-node src/scripts/test-openrouter-simple.ts 2>&1 | grep -E "(SUCCESS|FAILED|error.*model|OpenRouter API)"
```

**Expected output if working:**
```
✅ OpenRouter API call successful
✅ Test Result: SUCCESS
```

**If it fails, note the EXACT error message.**

---

### Phase 2: Root Cause Identification (3-10 minutes)

Based on the error from Phase 1, jump to the appropriate section:

#### **Error Type A: Invalid Model ID**

**Symptoms:**
```
"anthropic/claude-haiku-4 is not a valid model ID"
"anthropic/claude-sonnet-4 is not a valid model ID"
```

**Root Cause:** Model name mismatch with OpenRouter API

**Fix Location:** `backend/src/services/llmService.ts` lines 526-552

**Verification:**
```bash
# Check what OpenRouter actually supports
curl -s "https://openrouter.ai/api/v1/models" \
  -H "Authorization: Bearer $OPENROUTER_API_KEY" | \
  python3 -m json.tool | \
  grep -A 2 "\"id\": \"anthropic" | \
  head -30
```

**Quick Fix:**
Update the model mapping in `llmService.ts`:
```typescript
// Current valid OpenRouter model IDs (as of Nov 2024):
if (model.includes('sonnet') && model.includes('4')) {
  openRouterModel = 'anthropic/claude-sonnet-4.5';
} else if (model.includes('haiku') && model.includes('4')) {
  openRouterModel = 'anthropic/claude-haiku-4.5';
}
```

---

#### **Error Type B: Timeout Errors**

**Symptoms:**
```
"LLM call timeout after X minutes"
"Processing timeout: Document stuck"
```

**Root Cause:** Operation taking longer than configured timeout

**Diagnosis:**
```bash
# Check current timeout settings
grep -E "timeout|TIMEOUT" backend/src/config/env.ts | grep -v "//"
grep "timeoutMs" backend/src/services/llmService.ts | head -5
```

**Check Locations:**
1. `env.ts:319` - `LLM_TIMEOUT_MS` (default 180000 = 3 min)
2. `llmService.ts:343` - Wrapper timeout
3. `llmService.ts:516` - OpenRouter abort timeout

**Quick Fix:**
Add to `.env`:
```bash
LLM_TIMEOUT_MS=360000  # Increase to 6 minutes
```

Or edit `env.ts:319`:
```typescript
timeoutMs: parseInt(envVars['LLM_TIMEOUT_MS'] || '360000'), // 6 min
```

---

#### **Error Type C: Authentication/API Key Issues**

**Symptoms:**
```
"401 Unauthorized"
"403 Forbidden"
"API key is missing"
"ANTHROPIC_API_KEY is not set"
```

**Root Cause:** Missing or invalid API keys

**Diagnosis:**
```bash
# Check which keys are set
echo "ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY:0:20}..."
echo "OPENROUTER_API_KEY: ${OPENROUTER_API_KEY:0:20}..."
echo "OPENAI_API_KEY: ${OPENAI_API_KEY:0:20}..."

# Check .env file
grep -E "ANTHROPIC|OPENROUTER|OPENAI" backend/.env | grep -v "^#"
```

**Quick Fix:**
Ensure these are set in `backend/.env`:
```bash
ANTHROPIC_API_KEY=sk-ant-api03-...
OPENROUTER_API_KEY=sk-or-v1-...
OPENROUTER_USE_BYOK=true
```

---

#### **Error Type D: Rate Limit Exceeded**

**Symptoms:**
```
"429 Too Many Requests"
"rate limit exceeded"
"Retry after X seconds"
```

**Root Cause:** Too many API calls in short time

**Diagnosis:**
```bash
# Check recent API call frequency
grep "LLM API call" logs/testing.log | tail -20 | \
  awk '{print $1, $2}' | uniq -c
```

**Quick Fix:**
1. Wait for rate limit to reset (check error for retry time)
2. Add rate limiting in code:
   ```typescript
   // In llmService.ts, add delay between retries
   await new Promise(resolve => setTimeout(resolve, 2000)); // 2 sec delay
   ```

---

#### **Error Type E: Code Bugs (TypeError, Cannot read property)**

**Symptoms:**
```
"Cannot read properties of undefined (reading '0')"
"TypeError: response.data is undefined"
"Unexpected token in JSON"
```

**Root Cause:** Missing null checks or incorrect data access

**Diagnosis:**
```bash
# Find the exact line causing the error
grep -A 5 "Cannot read properties" logs/error.log | tail -10
```

**Quick Fix Pattern:**
Replace unsafe access:
```typescript
// Bad:
const content = response.data.choices[0].message.content;

// Good:
const content = response.data?.choices?.[0]?.message?.content || '';
```

**File to check:** `llmService.ts:696-720`

---

### Phase 3: Systematic Testing (5-10 minutes)

After applying a fix, test in this order:

#### Test 1: Direct LLM Call
```bash
npx ts-node src/scripts/test-openrouter-simple.ts
```
**Expected:** Success in 30-90 seconds

#### Test 2: Simple RAG Processing
```bash
npx ts-node -e "
import { llmService } from './src/services/llmService';

(async () => {
  const text = 'CIM for Target Corp. Revenue: \$100M. EBITDA: \$20M.';
  const result = await llmService.processCIMDocument(text, 'BPCP Template');
  console.log('Success:', result.success);
  console.log('Has JSON:', !!result.jsonOutput);
  process.exit(result.success ? 0 : 1);
})();
"
```
**Expected:** Success with JSON output

#### Test 3: Full Document Upload
Use the frontend to upload a real CIM and monitor:
```bash
# In one terminal, watch logs
tail -f logs/testing.log | grep -E "(error|success|completed)"

# Check processing status
npx ts-node src/scripts/check-current-processing.ts
```

---

### Phase 4: Emergency Fallback Options

If all else fails, use these fallback strategies:

#### Option 1: Switch to Direct Anthropic (Bypass OpenRouter)
```bash
# In .env
LLM_PROVIDER=anthropic  # Instead of openrouter
```

**Pro:** Eliminates OpenRouter as variable
**Con:** Different rate limits

#### Option 2: Use Older Claude Model
```bash
# In .env or env.ts
LLM_MODEL=claude-3.5-sonnet
LLM_FAST_MODEL=claude-3.5-haiku
```

**Pro:** More stable, widely supported
**Con:** Slightly older model

#### Option 3: Reduce Input Size
```typescript
// In optimizedAgenticRAGProcessor.ts:651
const targetTokenCount = 8000; // Down from 50000
```

**Pro:** Faster processing, less likely to timeout
**Con:** Less context for analysis

---

### Phase 5: Preventive Monitoring

Set up these checks to catch issues early:

#### Daily Health Check Script
Create `backend/scripts/daily-health-check.sh`:
```bash
#!/bin/bash
echo "=== Daily CIM Processor Health Check ==="
echo ""

# Check for stuck documents
npx ts-node src/scripts/check-database-failures.ts

# Test LLM connectivity
npx ts-node src/scripts/test-openrouter-simple.ts

# Check recent success rate
echo "Recent processing stats (last 24 hours):"
npx ts-node -e "
import { supabase } from './src/config/supabase';
(async () => {
  const yesterday = new Date(Date.now() - 86400000).toISOString();
  const { data } = await supabase
    .from('documents')
    .select('status')
    .gte('created_at', yesterday);

  const stats = data?.reduce((acc, d) => {
    acc[d.status] = (acc[d.status] || 0) + 1;
    return acc;
  }, {});

  console.log(stats);
  process.exit(0);
})();
"
```

Run daily:
```bash
chmod +x backend/scripts/daily-health-check.sh
./backend/scripts/daily-health-check.sh
```

---

## 📋 Quick Reference Checklist

When processing fails, check in this order:

- [ ] **Error logs** (`tail -100 logs/error.log`)
- [ ] **Recent failures** (database query in Step 1.1)
- [ ] **Direct LLM test** (`test-openrouter-simple.ts`)
- [ ] **Model ID validity** (curl OpenRouter API)
- [ ] **API keys set** (check `.env`)
- [ ] **Timeout values** (check `env.ts`)
- [ ] **OpenRouter vs Anthropic** (which provider?)
- [ ] **Rate limits** (check error for 429)
- [ ] **Code bugs** (look for TypeErrors in logs)
- [ ] **Build succeeded** (`npm run build`)

---

## 🔧 Common Fix Commands

```bash
# Rebuild after code changes
npm run build

# Clear error logs and start fresh
> logs/error.log

# Test with verbose logging
LOG_LEVEL=debug npx ts-node src/scripts/test-openrouter-simple.ts

# Check what's actually in .env
cat .env | grep -v "^#" | grep -E "LLM|ANTHROPIC|OPENROUTER"

# Verify OpenRouter models
curl -s "https://openrouter.ai/api/v1/models" -H "Authorization: Bearer $OPENROUTER_API_KEY" | python3 -m json.tool | grep "claude.*haiku\|claude.*sonnet"
```

---

## 📞 Escalation Path

If issue persists after 30 minutes:

1. **Check OpenRouter Status:** https://status.openrouter.ai/
2. **Check Anthropic Status:** https://status.anthropic.com/
3. **Review OpenRouter Docs:** https://openrouter.ai/docs
4. **Test with curl:** Send raw API request to isolate issue
5. **Compare git history:** `git diff HEAD~10 -- backend/src/services/llmService.ts`

---

## 🎯 Success Criteria

Processing is "working" when:

- ✅ Direct LLM test completes in < 2 minutes
- ✅ Returns valid JSON matching schema
- ✅ No errors in last 10 log entries
- ✅ Database shows recent "completed" documents
- ✅ Frontend can upload and process test CIM

---

**Last Updated:** 2025-11-07
**Next Review:** After any production deployment