Files
cim_summary/HYBRID_SOLUTION.md
admin 9c916d12f4 feat: Production release v2.0.0 - Simple Document Processor
Major release with significant performance improvements and new processing strategy.

## Core Changes
- Implemented simple_full_document processing strategy (default)
- Full document → LLM approach: 1-2 passes, ~5-6 minutes processing time
- Achieved 100% completeness with 2 API calls (down from 5+)
- Removed redundant Document AI passes for faster processing

## Financial Data Extraction
- Enhanced deterministic financial table parser
- Improved FY3/FY2/FY1/LTM identification from varying CIM formats
- Automatic merging of parser results with LLM extraction

## Code Quality & Infrastructure
- Cleaned up debug logging (removed emoji markers from production code)
- Fixed Firebase Secrets configuration (using modern defineSecret approach)
- Updated OpenAI API key
- Resolved deployment conflicts (secrets vs environment variables)
- Added .env files to Firebase ignore list

## Deployment
- Firebase Functions v2 deployment successful
- All 7 required secrets verified and configured
- Function URL: https://api-y56ccs6wva-uc.a.run.app

## Performance Improvements
- Processing time: ~5-6 minutes (down from 23+ minutes)
- API calls: 1-2 (down from 5+)
- Completeness: 100% achievable
- LLM Model: claude-3-7-sonnet-latest

## Breaking Changes
- Default processing strategy changed to 'simple_full_document'
- RAG processor available as alternative strategy 'document_ai_agentic_rag'

## Files Changed
- 36 files changed, 5642 insertions(+), 4451 deletions(-)
- Removed deprecated documentation files
- Cleaned up unused services and models

This release represents a major refactoring focused on speed, accuracy, and maintainability.
2025-11-09 21:07:22 -05:00

889 lines
26 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Financial Data Extraction: Hybrid Solution
## Better Regex + Enhanced LLM Approach
## Philosophy
Rather than a major architectural refactor, this solution enhances what's already working:
1. **Smarter regex** to catch more table patterns
2. **Better LLM context** to ensure financial tables are always seen
3. **Hybrid validation** where regex and LLM cross-check each other
---
## Problem Analysis (Refined)
### Current Issues:
1. **Regex is too strict** - Misses valid table formats
2. **LLM gets incomplete context** - Financial tables truncated or missing
3. **No cross-validation** - Regex and LLM don't verify each other
4. **Table structure lost** - But we can preserve it better with preprocessing
### Key Insight:
The LLM is actually VERY good at understanding financial tables, even in messy text. We just need to:
- Give it the RIGHT chunks (always include financial sections)
- Give it MORE context (increase chunk size for financial data)
- Give it BETTER formatting hints (preserve spacing/alignment where possible)
**When to use this hybrid track:** Rely on the telemetry described in `FINANCIAL_EXTRACTION_ANALYSIS.md` / `IMPLEMENTATION_PLAN.md`. If a document finishes Phase1/2 processing with `tablesFound === 0` or `financialDataPopulated === false`, route it through the hybrid steps below so we only pay the extra cost when the structured-table path truly fails.
---
## Solution Architecture
### Three-Tier Extraction Strategy
```
Tier 1: Enhanced Regex Parser (Fast, Deterministic)
↓ (if successful)
✓ Use regex results
↓ (if incomplete/failed)
Tier 2: LLM with Enhanced Context (Powerful, Flexible)
↓ (extract from full financial sections)
✓ Fill in gaps from Tier 1
↓ (if still missing data)
Tier 3: LLM Deep Dive (Focused, Exhaustive)
↓ (targeted re-scan of entire document)
✓ Final gap-filling
```
---
## Implementation Plan
## Phase 1: Enhanced Regex Parser (2-3 hours)
### 1.1: Improve Text Preprocessing
**Goal**: Preserve table structure better before regex parsing
**File**: Create `backend/src/utils/textPreprocessor.ts`
```typescript
/**
* Enhanced text preprocessing to preserve table structures
* Attempts to maintain column alignment from PDF extraction
*/
export interface PreprocessedText {
original: string;
enhanced: string;
tableRegions: TextRegion[];
metadata: {
likelyTableCount: number;
preservedAlignment: boolean;
};
}
export interface TextRegion {
start: number;
end: number;
type: 'table' | 'narrative' | 'header';
confidence: number;
content: string;
}
/**
* Identify regions that look like tables based on formatting patterns
*/
export function identifyTableRegions(text: string): TextRegion[] {
const regions: TextRegion[] = [];
const lines = text.split('\n');
let currentRegion: TextRegion | null = null;
let regionStart = 0;
let linePosition = 0;
for (let i = 0; i < lines.length; i++) {
const line = lines[i];
const nextLine = lines[i + 1] || '';
const isTableLike = detectTableLine(line, nextLine);
if (isTableLike.isTable && !currentRegion) {
// Start new table region
currentRegion = {
start: linePosition,
end: linePosition + line.length,
type: 'table',
confidence: isTableLike.confidence,
content: line
};
regionStart = i;
} else if (isTableLike.isTable && currentRegion) {
// Extend current table region
currentRegion.end = linePosition + line.length;
currentRegion.content += '\n' + line;
currentRegion.confidence = Math.max(currentRegion.confidence, isTableLike.confidence);
} else if (!isTableLike.isTable && currentRegion) {
// End table region
if (currentRegion.confidence > 0.5 && (i - regionStart) >= 3) {
regions.push(currentRegion);
}
currentRegion = null;
}
linePosition += line.length + 1; // +1 for newline
}
// Add final region if exists
if (currentRegion && currentRegion.confidence > 0.5) {
regions.push(currentRegion);
}
return regions;
}
/**
* Detect if a line looks like part of a table
*/
function detectTableLine(line: string, nextLine: string): { isTable: boolean; confidence: number } {
let score = 0;
// Check for multiple aligned numbers
const numberMatches = line.match(/\$?[\d,]+\.?\d*[KMB%]?/g);
if (numberMatches && numberMatches.length >= 3) {
score += 0.4; // Multiple numbers = likely table row
}
// Check for consistent spacing (indicates columns)
const hasConsistentSpacing = /\s{2,}/.test(line); // 2+ spaces = column separator
if (hasConsistentSpacing && numberMatches) {
score += 0.3;
}
// Check for year/period patterns
if (/\b(FY[-\s]?\d{1,2}|20\d{2}|LTM|TTM)\b/i.test(line)) {
score += 0.3;
}
// Check for financial keywords
if (/(revenue|ebitda|sales|profit|margin|growth)/i.test(line)) {
score += 0.2;
}
// Bonus: Next line also looks like a table
if (nextLine && /\$?[\d,]+\.?\d*[KMB%]?/.test(nextLine)) {
score += 0.2;
}
return {
isTable: score > 0.5,
confidence: Math.min(score, 1.0)
};
}
/**
* Enhance text by preserving spacing in table regions
*/
export function preprocessText(text: string): PreprocessedText {
const tableRegions = identifyTableRegions(text);
// For now, return original text with identified regions
// In the future, could normalize spacing, align columns, etc.
return {
original: text,
enhanced: text, // TODO: Apply enhancement algorithms
tableRegions,
metadata: {
likelyTableCount: tableRegions.length,
preservedAlignment: true
}
};
}
/**
* Extract just the table regions as separate texts
*/
export function extractTableTexts(preprocessed: PreprocessedText): string[] {
return preprocessed.tableRegions
.filter(region => region.type === 'table' && region.confidence > 0.6)
.map(region => region.content);
}
```
### 1.2: Enhance Financial Table Parser
**File**: `backend/src/services/financialTableParser.ts`
**Add new patterns to catch more variations:**
```typescript
// ENHANCED: More flexible period token regex (add around line 21)
const PERIOD_TOKEN_REGEX = /\b(?:
(?:FY[-\s]?\d{1,2})| # FY-1, FY 2, etc.
(?:FY[-\s]?)?20\d{2}[A-Z]*| # 2021, FY2022A, etc.
(?:FY[-\s]?[1234])| # FY1, FY 2
(?:LTM|TTM)| # LTM, TTM
(?:CY\d{2})| # CY21, CY22
(?:Q[1-4]\s*(?:FY|CY)?\d{2}) # Q1 FY23, Q4 2022
)\b/gix;
// ENHANCED: Better money regex to catch more formats (update line 22)
const MONEY_REGEX = /(?:
\$\s*[\d,]+(?:\.\d+)?(?:\s*[KMB])?| # $1,234.5M
[\d,]+(?:\.\d+)?\s*[KMB]| # 1,234.5M
\([\d,]+(?:\.\d+)?(?:\s*[KMB])?\)| # (1,234.5M) - negative
[\d,]+(?:\.\d+)? # Plain numbers
)/gx;
// ENHANCED: Better percentage regex (update line 23)
const PERCENT_REGEX = /(?:
\(?[\d,]+\.?\d*\s*%\)?| # 12.5% or (12.5%)
[\d,]+\.?\d*\s*pct| # 12.5 pct
NM|N\/A|n\/a # Not meaningful, N/A
)/gix;
```
**Add multi-pass header detection:**
```typescript
// ADD after line 278 (after current header detection)
// ENHANCED: Multi-pass header detection if first pass failed
if (bestHeaderIndex === -1) {
logger.info('First pass header detection failed, trying relaxed patterns');
// Second pass: Look for ANY line with 3+ numbers and a year pattern
for (let i = 0; i < lines.length; i++) {
const line = lines[i];
const hasYearPattern = /20\d{2}|FY|LTM|TTM/i.test(line);
const numberCount = (line.match(/[\d,]+/g) || []).length;
if (hasYearPattern && numberCount >= 3) {
// Look at next 10 lines for financial keywords
const lookAhead = lines.slice(i + 1, i + 11).join(' ');
const hasFinancialKeywords = /revenue|ebitda|sales|profit/i.test(lookAhead);
if (hasFinancialKeywords) {
logger.info('Relaxed header detection found candidate', {
headerIndex: i,
headerLine: line.substring(0, 100)
});
// Try to parse this as header
const tokens = tokenizePeriodHeaders(line);
if (tokens.length >= 2) {
bestHeaderIndex = i;
bestBuckets = yearTokensToBuckets(tokens);
bestHeaderScore = 50; // Lower confidence than primary detection
break;
}
}
}
}
}
```
**Add fuzzy row matching:**
```typescript
// ENHANCED: Add after line 354 (in the row matching loop)
// If exact match fails, try fuzzy matching
if (!ROW_MATCHERS[field].test(line)) {
// Try fuzzy matching (partial matches, typos)
const fuzzyMatch = fuzzyMatchFinancialRow(line, field);
if (!fuzzyMatch) continue;
}
// ADD this helper function
function fuzzyMatchFinancialRow(line: string, field: string): boolean {
const lineLower = line.toLowerCase();
switch (field) {
case 'revenue':
return /rev\b|sales|top.?line/.test(lineLower);
case 'ebitda':
return /ebit|earnings.*operations|operating.*income/.test(lineLower);
case 'grossProfit':
return /gross.*profit|gp\b/.test(lineLower);
case 'grossMargin':
return /gross.*margin|gm\b|gross.*%/.test(lineLower);
case 'ebitdaMargin':
return /ebitda.*margin|ebitda.*%|margin.*ebitda/.test(lineLower);
case 'revenueGrowth':
return /revenue.*growth|growth.*revenue|rev.*growth|yoy|y.y/.test(lineLower);
default:
return false;
}
}
```
---
## Phase 2: Enhanced LLM Context Delivery (2-3 hours)
### 2.1: Financial Section Prioritization
**File**: `backend/src/services/optimizedAgenticRAGProcessor.ts`
**Improve the `prioritizeFinancialChunks` method (around line 1265):**
```typescript
// ENHANCED: Much more aggressive financial chunk prioritization
private prioritizeFinancialChunks(chunks: ProcessingChunk[]): ProcessingChunk[] {
const scoredChunks = chunks.map(chunk => {
const content = chunk.content.toLowerCase();
let score = 0;
// TIER 1: Strong financial indicators (high score)
const tier1Patterns = [
/financial\s+summary/i,
/historical\s+financials/i,
/financial\s+performance/i,
/income\s+statement/i,
/financial\s+highlights/i,
];
tier1Patterns.forEach(pattern => {
if (pattern.test(content)) score += 100;
});
// TIER 2: Contains both periods AND metrics (very likely financial table)
const hasPeriods = /\b(20[12]\d|FY[-\s]?\d{1,2}|LTM|TTM)\b/i.test(content);
const hasMetrics = /(revenue|ebitda|sales|profit|margin)/i.test(content);
const hasNumbers = /\$[\d,]+|[\d,]+[KMB]/i.test(content);
if (hasPeriods && hasMetrics && hasNumbers) {
score += 80; // Very likely financial table
} else if (hasPeriods && hasMetrics) {
score += 50;
} else if (hasPeriods && hasNumbers) {
score += 30;
}
// TIER 3: Multiple financial keywords
const financialKeywords = [
'revenue', 'ebitda', 'gross profit', 'margin', 'sales',
'operating income', 'net income', 'cash flow', 'growth'
];
const keywordMatches = financialKeywords.filter(kw => content.includes(kw)).length;
score += keywordMatches * 5;
// TIER 4: Has year progression (2021, 2022, 2023)
const years = content.match(/20[12]\d/g);
if (years && years.length >= 3) {
score += 25; // Sequential years = likely financial table
}
// TIER 5: Multiple currency values
const currencyMatches = content.match(/\$[\d,]+(?:\.\d+)?[KMB]?/gi);
if (currencyMatches) {
score += Math.min(currencyMatches.length * 3, 30);
}
// TIER 6: Section type boost
if (chunk.sectionType && /financial|income|statement/i.test(chunk.sectionType)) {
score += 40;
}
return { chunk, score };
});
// Sort by score and return
const sorted = scoredChunks.sort((a, b) => b.score - a.score);
// Log top financial chunks for debugging
logger.info('Financial chunk prioritization results', {
topScores: sorted.slice(0, 5).map(s => ({
chunkIndex: s.chunk.chunkIndex,
score: s.score,
preview: s.chunk.content.substring(0, 100)
}))
});
return sorted.map(s => s.chunk);
}
```
### 2.2: Increase Context for Financial Pass
**File**: `backend/src/services/optimizedAgenticRAGProcessor.ts`
**Update Pass 1 to use more chunks and larger context:**
```typescript
// ENHANCED: Update line 1259 (extractPass1CombinedMetadataFinancial)
// Change from 7 chunks to 12 chunks, and increase character limit
const maxChunks = 12; // Was 7 - give LLM more context for financials
const maxCharsPerChunk = 3000; // Was 1500 - don't truncate tables as aggressively
// And update line 1595 in extractWithTargetedQuery
const maxCharsPerChunk = options?.isFinancialPass ? 3000 : 1500;
```
### 2.3: Enhanced Financial Extraction Prompt
**File**: `backend/src/services/optimizedAgenticRAGProcessor.ts`
**Update the Pass 1 query (around line 1196-1240) to be more explicit:**
```typescript
// ENHANCED: Much more detailed extraction instructions
const query = `Extract deal information, company metadata, and COMPREHENSIVE financial data.
CRITICAL FINANCIAL TABLE EXTRACTION INSTRUCTIONS:
I. LOCATE FINANCIAL TABLES
Look for sections titled: "Financial Summary", "Historical Financials", "Financial Performance",
"Income Statement", "P&L", "Key Metrics", "Financial Highlights", or similar.
Financial tables typically appear in these formats:
FORMAT 1 - Row-based:
FY 2021 FY 2022 FY 2023 LTM
Revenue $45.2M $52.8M $61.2M $58.5M
Revenue Growth N/A 16.8% 15.9% (4.4%)
EBITDA $8.5M $10.2M $12.1M $11.5M
FORMAT 2 - Column-based:
Metric | Value
-------------------|---------
FY21 Revenue | $45.2M
FY22 Revenue | $52.8M
FY23 Revenue | $61.2M
FORMAT 3 - Inline:
Revenue grew from $45.2M in FY2021 to $52.8M in FY2022 (+16.8%) and $61.2M in FY2023 (+15.9%)
II. EXTRACTION RULES
1. PERIOD IDENTIFICATION
- FY-3, FY-2, FY-1 = Three most recent FULL fiscal years (not projections)
- LTM/TTM = Most recent 12-month period
- Map year labels: If you see "FY2021, FY2022, FY2023, LTM Sep'23", then:
* FY2021 → fy3
* FY2022 → fy2
* FY2023 → fy1
* LTM Sep'23 → ltm
2. VALUE EXTRACTION
- Extract EXACT values as shown: "$45.2M", "16.8%", etc.
- Preserve formatting: "$45.2M" not "45.2" or "45200000"
- Include negative indicators: "(4.4%)" or "-4.4%"
- Use "N/A" or "NM" if explicitly stated (not "Not specified")
3. METRIC IDENTIFICATION
- Revenue = "Revenue", "Net Sales", "Total Sales", "Top Line"
- EBITDA = "EBITDA", "Adjusted EBITDA", "Adj. EBITDA"
- Margins = Look for "%" after metric name
- Growth = "Growth %", "YoY", "Y/Y", "Change %"
4. DEAL OVERVIEW
- Extract: company name, industry, geography, transaction type
- Extract: employee count, deal source, reason for sale
- Extract: CIM dates and metadata
III. QUALITY CHECKS
Before submitting your response:
- [ ] Did I find at least 3 distinct fiscal periods?
- [ ] Do I have Revenue AND EBITDA for at least 2 periods?
- [ ] Did I preserve exact number formats from the document?
- [ ] Did I map the periods correctly (newest = fy1, oldest = fy3)?
IV. WHAT TO DO IF TABLE IS UNCLEAR
If the table is hard to parse:
- Include the ENTIRE table section in your analysis
- Extract what you can with confidence
- Mark unclear values as "Not specified in CIM" only if truly absent
- DO NOT guess or interpolate values
V. ADDITIONAL FINANCIAL DATA
Also extract:
- Quality of earnings notes
- EBITDA adjustments and add-backs
- Revenue growth drivers
- Margin trends and analysis
- CapEx requirements
- Working capital needs
- Free cash flow comments`;
```
---
## Phase 3: Hybrid Validation & Cross-Checking (1-2 hours)
### 3.1: Create Validation Layer
**File**: Create `backend/src/services/financialDataValidator.ts`
```typescript
import { logger } from '../utils/logger';
import type { ParsedFinancials } from './financialTableParser';
import type { CIMReview } from './llmSchemas';
export interface ValidationResult {
isValid: boolean;
confidence: number;
issues: string[];
corrections: ParsedFinancials;
}
/**
* Cross-validate financial data from multiple sources
*/
export function validateFinancialData(
regexResult: ParsedFinancials,
llmResult: Partial<CIMReview>
): ValidationResult {
const issues: string[] = [];
const corrections: ParsedFinancials = { ...regexResult };
let confidence = 1.0;
// Extract LLM financials
const llmFinancials = llmResult.financialSummary?.financials;
if (!llmFinancials) {
return {
isValid: true,
confidence: 0.5,
issues: ['No LLM financial data to validate against'],
corrections: regexResult
};
}
// Validate each period
const periods: Array<keyof ParsedFinancials> = ['fy3', 'fy2', 'fy1', 'ltm'];
for (const period of periods) {
const regexPeriod = regexResult[period];
const llmPeriod = llmFinancials[period];
if (!llmPeriod) continue;
// Compare revenue
if (regexPeriod.revenue && llmPeriod.revenue) {
const match = compareFinancialValues(regexPeriod.revenue, llmPeriod.revenue);
if (!match.matches) {
issues.push(`${period} revenue mismatch: Regex="${regexPeriod.revenue}" vs LLM="${llmPeriod.revenue}"`);
confidence -= 0.1;
// Trust LLM if regex value looks suspicious
if (match.llmMoreCredible) {
corrections[period].revenue = llmPeriod.revenue;
}
}
} else if (!regexPeriod.revenue && llmPeriod.revenue && llmPeriod.revenue !== 'Not specified in CIM') {
// Regex missed it, LLM found it
corrections[period].revenue = llmPeriod.revenue;
issues.push(`${period} revenue: Regex missed, using LLM value: ${llmPeriod.revenue}`);
}
// Compare EBITDA
if (regexPeriod.ebitda && llmPeriod.ebitda) {
const match = compareFinancialValues(regexPeriod.ebitda, llmPeriod.ebitda);
if (!match.matches) {
issues.push(`${period} EBITDA mismatch: Regex="${regexPeriod.ebitda}" vs LLM="${llmPeriod.ebitda}"`);
confidence -= 0.1;
if (match.llmMoreCredible) {
corrections[period].ebitda = llmPeriod.ebitda;
}
}
} else if (!regexPeriod.ebitda && llmPeriod.ebitda && llmPeriod.ebitda !== 'Not specified in CIM') {
corrections[period].ebitda = llmPeriod.ebitda;
issues.push(`${period} EBITDA: Regex missed, using LLM value: ${llmPeriod.ebitda}`);
}
// Fill in other fields from LLM if regex didn't get them
const fields: Array<keyof typeof regexPeriod> = [
'revenueGrowth', 'grossProfit', 'grossMargin', 'ebitdaMargin'
];
for (const field of fields) {
if (!regexPeriod[field] && llmPeriod[field] && llmPeriod[field] !== 'Not specified in CIM') {
corrections[period][field] = llmPeriod[field];
}
}
}
logger.info('Financial data validation completed', {
confidence,
issueCount: issues.length,
issues: issues.slice(0, 5)
});
return {
isValid: confidence > 0.6,
confidence,
issues,
corrections
};
}
/**
* Compare two financial values to see if they match
*/
function compareFinancialValues(
value1: string,
value2: string
): { matches: boolean; llmMoreCredible: boolean } {
const clean1 = value1.replace(/[$,\s]/g, '').toUpperCase();
const clean2 = value2.replace(/[$,\s]/g, '').toUpperCase();
// Exact match
if (clean1 === clean2) {
return { matches: true, llmMoreCredible: false };
}
// Check if numeric values are close (within 5%)
const num1 = parseFinancialValue(value1);
const num2 = parseFinancialValue(value2);
if (num1 && num2) {
const percentDiff = Math.abs((num1 - num2) / num1);
if (percentDiff < 0.05) {
// Values are close enough
return { matches: true, llmMoreCredible: false };
}
// Large difference - trust value with more precision
const precision1 = (value1.match(/\./g) || []).length;
const precision2 = (value2.match(/\./g) || []).length;
return {
matches: false,
llmMoreCredible: precision2 > precision1
};
}
return { matches: false, llmMoreCredible: false };
}
/**
* Parse a financial value string to number
*/
function parseFinancialValue(value: string): number | null {
const clean = value.replace(/[$,\s]/g, '');
let multiplier = 1;
if (/M$/i.test(clean)) {
multiplier = 1000000;
} else if (/K$/i.test(clean)) {
multiplier = 1000;
} else if (/B$/i.test(clean)) {
multiplier = 1000000000;
}
const numStr = clean.replace(/[MKB]/i, '');
const num = parseFloat(numStr);
return isNaN(num) ? null : num * multiplier;
}
```
### 3.2: Integrate Validation into Processing
**File**: `backend/src/services/optimizedAgenticRAGProcessor.ts`
**Add after line 1137 (after merging partial results):**
```typescript
// ENHANCED: Cross-validate regex and LLM results
if (deterministicFinancials) {
logger.info('Validating deterministic financials against LLM results');
const { validateFinancialData } = await import('./financialDataValidator');
const validation = validateFinancialData(deterministicFinancials, mergedData);
logger.info('Validation results', {
documentId,
isValid: validation.isValid,
confidence: validation.confidence,
issueCount: validation.issues.length
});
// Use validated/corrected data
if (validation.confidence > 0.7) {
deterministicFinancials = validation.corrections;
logger.info('Using validated corrections', {
documentId,
corrections: validation.corrections
});
}
// Merge validated data
this.mergeDeterministicFinancialData(mergedData, deterministicFinancials, documentId);
} else {
logger.info('No deterministic financial data to validate', { documentId });
}
```
---
## Phase 4: Text Preprocessing Integration (1 hour)
### 4.1: Apply Preprocessing to Document AI Text
**File**: `backend/src/services/documentAiProcessor.ts`
**Add preprocessing before passing to RAG:**
```typescript
// ADD import at top
import { preprocessText, extractTableTexts } from '../utils/textPreprocessor';
// UPDATE line 83 (processWithAgenticRAG method)
private async processWithAgenticRAG(documentId: string, extractedText: string): Promise<any> {
try {
logger.info('Processing extracted text with Agentic RAG', {
documentId,
textLength: extractedText.length
});
// ENHANCED: Preprocess text to identify table regions
const preprocessed = preprocessText(extractedText);
logger.info('Text preprocessing completed', {
documentId,
tableRegionsFound: preprocessed.tableRegions.length,
likelyTableCount: preprocessed.metadata.likelyTableCount
});
// Extract table texts separately for better parsing
const tableSections = extractTableTexts(preprocessed);
// Import and use the optimized agentic RAG processor
const { optimizedAgenticRAGProcessor } = await import('./optimizedAgenticRAGProcessor');
const result = await optimizedAgenticRAGProcessor.processLargeDocument(
documentId,
extractedText,
{
preprocessedData: preprocessed, // Pass preprocessing results
tableSections: tableSections // Pass isolated table texts
}
);
return result;
} catch (error) {
// ... existing error handling
}
}
```
---
## Expected Results
### Current State (Baseline):
```
Financial data extraction rate: 10-20%
Typical result: "Not specified in CIM" for most fields
```
### After Phase 1 (Enhanced Regex):
```
Financial data extraction rate: 35-45%
Improvement: Better pattern matching catches more tables
```
### After Phase 2 (Enhanced LLM):
```
Financial data extraction rate: 65-75%
Improvement: LLM sees financial tables more reliably
```
### After Phase 3 (Validation):
```
Financial data extraction rate: 75-85%
Improvement: Cross-validation fills gaps and corrects errors
```
### After Phase 4 (Preprocessing):
```
Financial data extraction rate: 80-90%
Improvement: Table structure preservation helps both regex and LLM
```
---
## Implementation Priority
### Start Here (Highest ROI):
1. **Phase 2.1** - Financial Section Prioritization (30 min, +30% accuracy)
2. **Phase 2.2** - Increase LLM Context (15 min, +15% accuracy)
3. **Phase 2.3** - Enhanced Prompt (30 min, +20% accuracy)
**Total: 1.5 hours for ~50-60% improvement**
### Then Do:
4. **Phase 1.2** - Enhanced Parser Patterns (1 hour, +10% accuracy)
5. **Phase 3.1-3.2** - Validation (1.5 hours, +10% accuracy)
**Total: 4 hours for ~70-80% improvement**
### Optional:
6. **Phase 1.1, 4.1** - Text Preprocessing (2 hours, +10% accuracy)
---
## Testing Strategy
### Test 1: Baseline Measurement
```bash
# Process 10 CIMs and record extraction rate
npm run test:pipeline
# Record: How many financial fields are populated?
```
### Test 2: After Each Phase
```bash
# Same 10 CIMs, measure improvement
npm run test:pipeline
# Compare against baseline
```
### Test 3: Edge Cases
- PDFs with rotated pages
- PDFs with merged table cells
- PDFs with multi-line headers
- Narrative-only financials (no tables)
---
## Rollback Plan
Each phase is additive and can be disabled via feature flags:
```typescript
// config/env.ts
export const features = {
enhancedRegexParsing: process.env.ENHANCED_REGEX === 'true',
enhancedLLMContext: process.env.ENHANCED_LLM === 'true',
financialValidation: process.env.VALIDATE_FINANCIALS === 'true',
textPreprocessing: process.env.PREPROCESS_TEXT === 'true'
};
```
Set `ENHANCED_REGEX=false` to disable any phase.
---
## Success Metrics
| Metric | Current | Target | Measurement |
|--------|---------|--------|-------------|
| Financial data extracted | 10-20% | 80-90% | % of fields populated |
| Processing time | 45s | <60s | End-to-end time |
| False positives | Unknown | <5% | Manual validation |
| Column misalignment | ~50% | <10% | Check FY mapping |
---
## Next Steps
1. Implement Phase 2 (Enhanced LLM) first - biggest impact, lowest risk
2. Test with 5-10 real CIM documents
3. Measure improvement
4. If >70% accuracy, stop. If not, add Phase 1 and 3.
5. Keep Phase 4 as optional enhancement
The LLM is actually very good at this - we just need to give it the right context!