# Financial Data Extraction: Hybrid Solution ## Better Regex + Enhanced LLM Approach ## Philosophy Rather than a major architectural refactor, this solution enhances what's already working: 1. **Smarter regex** to catch more table patterns 2. **Better LLM context** to ensure financial tables are always seen 3. **Hybrid validation** where regex and LLM cross-check each other --- ## Problem Analysis (Refined) ### Current Issues: 1. **Regex is too strict** - Misses valid table formats 2. **LLM gets incomplete context** - Financial tables truncated or missing 3. **No cross-validation** - Regex and LLM don't verify each other 4. **Table structure lost** - But we can preserve it better with preprocessing ### Key Insight: The LLM is actually VERY good at understanding financial tables, even in messy text. We just need to: - Give it the RIGHT chunks (always include financial sections) - Give it MORE context (increase chunk size for financial data) - Give it BETTER formatting hints (preserve spacing/alignment where possible) **When to use this hybrid track:** Rely on the telemetry described in `FINANCIAL_EXTRACTION_ANALYSIS.md` / `IMPLEMENTATION_PLAN.md`. If a document finishes Phase 1/2 processing with `tablesFound === 0` or `financialDataPopulated === false`, route it through the hybrid steps below so we only pay the extra cost when the structured-table path truly fails. --- ## Solution Architecture ### Three-Tier Extraction Strategy ``` Tier 1: Enhanced Regex Parser (Fast, Deterministic) ↓ (if successful) ✓ Use regex results ↓ (if incomplete/failed) Tier 2: LLM with Enhanced Context (Powerful, Flexible) ↓ (extract from full financial sections) ✓ Fill in gaps from Tier 1 ↓ (if still missing data) Tier 3: LLM Deep Dive (Focused, Exhaustive) ↓ (targeted re-scan of entire document) ✓ Final gap-filling ``` --- ## Implementation Plan ## Phase 1: Enhanced Regex Parser (2-3 hours) ### 1.1: Improve Text Preprocessing **Goal**: Preserve table structure better before regex parsing **File**: Create `backend/src/utils/textPreprocessor.ts` ```typescript /** * Enhanced text preprocessing to preserve table structures * Attempts to maintain column alignment from PDF extraction */ export interface PreprocessedText { original: string; enhanced: string; tableRegions: TextRegion[]; metadata: { likelyTableCount: number; preservedAlignment: boolean; }; } export interface TextRegion { start: number; end: number; type: 'table' | 'narrative' | 'header'; confidence: number; content: string; } /** * Identify regions that look like tables based on formatting patterns */ export function identifyTableRegions(text: string): TextRegion[] { const regions: TextRegion[] = []; const lines = text.split('\n'); let currentRegion: TextRegion | null = null; let regionStart = 0; let linePosition = 0; for (let i = 0; i < lines.length; i++) { const line = lines[i]; const nextLine = lines[i + 1] || ''; const isTableLike = detectTableLine(line, nextLine); if (isTableLike.isTable && !currentRegion) { // Start new table region currentRegion = { start: linePosition, end: linePosition + line.length, type: 'table', confidence: isTableLike.confidence, content: line }; regionStart = i; } else if (isTableLike.isTable && currentRegion) { // Extend current table region currentRegion.end = linePosition + line.length; currentRegion.content += '\n' + line; currentRegion.confidence = Math.max(currentRegion.confidence, isTableLike.confidence); } else if (!isTableLike.isTable && currentRegion) { // End table region if (currentRegion.confidence > 0.5 && (i - regionStart) >= 3) { regions.push(currentRegion); } currentRegion = null; } linePosition += line.length + 1; // +1 for newline } // Add final region if exists if (currentRegion && currentRegion.confidence > 0.5) { regions.push(currentRegion); } return regions; } /** * Detect if a line looks like part of a table */ function detectTableLine(line: string, nextLine: string): { isTable: boolean; confidence: number } { let score = 0; // Check for multiple aligned numbers const numberMatches = line.match(/\$?[\d,]+\.?\d*[KMB%]?/g); if (numberMatches && numberMatches.length >= 3) { score += 0.4; // Multiple numbers = likely table row } // Check for consistent spacing (indicates columns) const hasConsistentSpacing = /\s{2,}/.test(line); // 2+ spaces = column separator if (hasConsistentSpacing && numberMatches) { score += 0.3; } // Check for year/period patterns if (/\b(FY[-\s]?\d{1,2}|20\d{2}|LTM|TTM)\b/i.test(line)) { score += 0.3; } // Check for financial keywords if (/(revenue|ebitda|sales|profit|margin|growth)/i.test(line)) { score += 0.2; } // Bonus: Next line also looks like a table if (nextLine && /\$?[\d,]+\.?\d*[KMB%]?/.test(nextLine)) { score += 0.2; } return { isTable: score > 0.5, confidence: Math.min(score, 1.0) }; } /** * Enhance text by preserving spacing in table regions */ export function preprocessText(text: string): PreprocessedText { const tableRegions = identifyTableRegions(text); // For now, return original text with identified regions // In the future, could normalize spacing, align columns, etc. return { original: text, enhanced: text, // TODO: Apply enhancement algorithms tableRegions, metadata: { likelyTableCount: tableRegions.length, preservedAlignment: true } }; } /** * Extract just the table regions as separate texts */ export function extractTableTexts(preprocessed: PreprocessedText): string[] { return preprocessed.tableRegions .filter(region => region.type === 'table' && region.confidence > 0.6) .map(region => region.content); } ``` ### 1.2: Enhance Financial Table Parser **File**: `backend/src/services/financialTableParser.ts` **Add new patterns to catch more variations:** ```typescript // ENHANCED: More flexible period token regex (add around line 21) const PERIOD_TOKEN_REGEX = /\b(?: (?:FY[-\s]?\d{1,2})| # FY-1, FY 2, etc. (?:FY[-\s]?)?20\d{2}[A-Z]*| # 2021, FY2022A, etc. (?:FY[-\s]?[1234])| # FY1, FY 2 (?:LTM|TTM)| # LTM, TTM (?:CY\d{2})| # CY21, CY22 (?:Q[1-4]\s*(?:FY|CY)?\d{2}) # Q1 FY23, Q4 2022 )\b/gix; // ENHANCED: Better money regex to catch more formats (update line 22) const MONEY_REGEX = /(?: \$\s*[\d,]+(?:\.\d+)?(?:\s*[KMB])?| # $1,234.5M [\d,]+(?:\.\d+)?\s*[KMB]| # 1,234.5M \([\d,]+(?:\.\d+)?(?:\s*[KMB])?\)| # (1,234.5M) - negative [\d,]+(?:\.\d+)? # Plain numbers )/gx; // ENHANCED: Better percentage regex (update line 23) const PERCENT_REGEX = /(?: \(?[\d,]+\.?\d*\s*%\)?| # 12.5% or (12.5%) [\d,]+\.?\d*\s*pct| # 12.5 pct NM|N\/A|n\/a # Not meaningful, N/A )/gix; ``` **Add multi-pass header detection:** ```typescript // ADD after line 278 (after current header detection) // ENHANCED: Multi-pass header detection if first pass failed if (bestHeaderIndex === -1) { logger.info('First pass header detection failed, trying relaxed patterns'); // Second pass: Look for ANY line with 3+ numbers and a year pattern for (let i = 0; i < lines.length; i++) { const line = lines[i]; const hasYearPattern = /20\d{2}|FY|LTM|TTM/i.test(line); const numberCount = (line.match(/[\d,]+/g) || []).length; if (hasYearPattern && numberCount >= 3) { // Look at next 10 lines for financial keywords const lookAhead = lines.slice(i + 1, i + 11).join(' '); const hasFinancialKeywords = /revenue|ebitda|sales|profit/i.test(lookAhead); if (hasFinancialKeywords) { logger.info('Relaxed header detection found candidate', { headerIndex: i, headerLine: line.substring(0, 100) }); // Try to parse this as header const tokens = tokenizePeriodHeaders(line); if (tokens.length >= 2) { bestHeaderIndex = i; bestBuckets = yearTokensToBuckets(tokens); bestHeaderScore = 50; // Lower confidence than primary detection break; } } } } } ``` **Add fuzzy row matching:** ```typescript // ENHANCED: Add after line 354 (in the row matching loop) // If exact match fails, try fuzzy matching if (!ROW_MATCHERS[field].test(line)) { // Try fuzzy matching (partial matches, typos) const fuzzyMatch = fuzzyMatchFinancialRow(line, field); if (!fuzzyMatch) continue; } // ADD this helper function function fuzzyMatchFinancialRow(line: string, field: string): boolean { const lineLower = line.toLowerCase(); switch (field) { case 'revenue': return /rev\b|sales|top.?line/.test(lineLower); case 'ebitda': return /ebit|earnings.*operations|operating.*income/.test(lineLower); case 'grossProfit': return /gross.*profit|gp\b/.test(lineLower); case 'grossMargin': return /gross.*margin|gm\b|gross.*%/.test(lineLower); case 'ebitdaMargin': return /ebitda.*margin|ebitda.*%|margin.*ebitda/.test(lineLower); case 'revenueGrowth': return /revenue.*growth|growth.*revenue|rev.*growth|yoy|y.y/.test(lineLower); default: return false; } } ``` --- ## Phase 2: Enhanced LLM Context Delivery (2-3 hours) ### 2.1: Financial Section Prioritization **File**: `backend/src/services/optimizedAgenticRAGProcessor.ts` **Improve the `prioritizeFinancialChunks` method (around line 1265):** ```typescript // ENHANCED: Much more aggressive financial chunk prioritization private prioritizeFinancialChunks(chunks: ProcessingChunk[]): ProcessingChunk[] { const scoredChunks = chunks.map(chunk => { const content = chunk.content.toLowerCase(); let score = 0; // TIER 1: Strong financial indicators (high score) const tier1Patterns = [ /financial\s+summary/i, /historical\s+financials/i, /financial\s+performance/i, /income\s+statement/i, /financial\s+highlights/i, ]; tier1Patterns.forEach(pattern => { if (pattern.test(content)) score += 100; }); // TIER 2: Contains both periods AND metrics (very likely financial table) const hasPeriods = /\b(20[12]\d|FY[-\s]?\d{1,2}|LTM|TTM)\b/i.test(content); const hasMetrics = /(revenue|ebitda|sales|profit|margin)/i.test(content); const hasNumbers = /\$[\d,]+|[\d,]+[KMB]/i.test(content); if (hasPeriods && hasMetrics && hasNumbers) { score += 80; // Very likely financial table } else if (hasPeriods && hasMetrics) { score += 50; } else if (hasPeriods && hasNumbers) { score += 30; } // TIER 3: Multiple financial keywords const financialKeywords = [ 'revenue', 'ebitda', 'gross profit', 'margin', 'sales', 'operating income', 'net income', 'cash flow', 'growth' ]; const keywordMatches = financialKeywords.filter(kw => content.includes(kw)).length; score += keywordMatches * 5; // TIER 4: Has year progression (2021, 2022, 2023) const years = content.match(/20[12]\d/g); if (years && years.length >= 3) { score += 25; // Sequential years = likely financial table } // TIER 5: Multiple currency values const currencyMatches = content.match(/\$[\d,]+(?:\.\d+)?[KMB]?/gi); if (currencyMatches) { score += Math.min(currencyMatches.length * 3, 30); } // TIER 6: Section type boost if (chunk.sectionType && /financial|income|statement/i.test(chunk.sectionType)) { score += 40; } return { chunk, score }; }); // Sort by score and return const sorted = scoredChunks.sort((a, b) => b.score - a.score); // Log top financial chunks for debugging logger.info('Financial chunk prioritization results', { topScores: sorted.slice(0, 5).map(s => ({ chunkIndex: s.chunk.chunkIndex, score: s.score, preview: s.chunk.content.substring(0, 100) })) }); return sorted.map(s => s.chunk); } ``` ### 2.2: Increase Context for Financial Pass **File**: `backend/src/services/optimizedAgenticRAGProcessor.ts` **Update Pass 1 to use more chunks and larger context:** ```typescript // ENHANCED: Update line 1259 (extractPass1CombinedMetadataFinancial) // Change from 7 chunks to 12 chunks, and increase character limit const maxChunks = 12; // Was 7 - give LLM more context for financials const maxCharsPerChunk = 3000; // Was 1500 - don't truncate tables as aggressively // And update line 1595 in extractWithTargetedQuery const maxCharsPerChunk = options?.isFinancialPass ? 3000 : 1500; ``` ### 2.3: Enhanced Financial Extraction Prompt **File**: `backend/src/services/optimizedAgenticRAGProcessor.ts` **Update the Pass 1 query (around line 1196-1240) to be more explicit:** ```typescript // ENHANCED: Much more detailed extraction instructions const query = `Extract deal information, company metadata, and COMPREHENSIVE financial data. CRITICAL FINANCIAL TABLE EXTRACTION INSTRUCTIONS: I. LOCATE FINANCIAL TABLES Look for sections titled: "Financial Summary", "Historical Financials", "Financial Performance", "Income Statement", "P&L", "Key Metrics", "Financial Highlights", or similar. Financial tables typically appear in these formats: FORMAT 1 - Row-based: FY 2021 FY 2022 FY 2023 LTM Revenue $45.2M $52.8M $61.2M $58.5M Revenue Growth N/A 16.8% 15.9% (4.4%) EBITDA $8.5M $10.2M $12.1M $11.5M FORMAT 2 - Column-based: Metric | Value -------------------|--------- FY21 Revenue | $45.2M FY22 Revenue | $52.8M FY23 Revenue | $61.2M FORMAT 3 - Inline: Revenue grew from $45.2M in FY2021 to $52.8M in FY2022 (+16.8%) and $61.2M in FY2023 (+15.9%) II. EXTRACTION RULES 1. PERIOD IDENTIFICATION - FY-3, FY-2, FY-1 = Three most recent FULL fiscal years (not projections) - LTM/TTM = Most recent 12-month period - Map year labels: If you see "FY2021, FY2022, FY2023, LTM Sep'23", then: * FY2021 → fy3 * FY2022 → fy2 * FY2023 → fy1 * LTM Sep'23 → ltm 2. VALUE EXTRACTION - Extract EXACT values as shown: "$45.2M", "16.8%", etc. - Preserve formatting: "$45.2M" not "45.2" or "45200000" - Include negative indicators: "(4.4%)" or "-4.4%" - Use "N/A" or "NM" if explicitly stated (not "Not specified") 3. METRIC IDENTIFICATION - Revenue = "Revenue", "Net Sales", "Total Sales", "Top Line" - EBITDA = "EBITDA", "Adjusted EBITDA", "Adj. EBITDA" - Margins = Look for "%" after metric name - Growth = "Growth %", "YoY", "Y/Y", "Change %" 4. DEAL OVERVIEW - Extract: company name, industry, geography, transaction type - Extract: employee count, deal source, reason for sale - Extract: CIM dates and metadata III. QUALITY CHECKS Before submitting your response: - [ ] Did I find at least 3 distinct fiscal periods? - [ ] Do I have Revenue AND EBITDA for at least 2 periods? - [ ] Did I preserve exact number formats from the document? - [ ] Did I map the periods correctly (newest = fy1, oldest = fy3)? IV. WHAT TO DO IF TABLE IS UNCLEAR If the table is hard to parse: - Include the ENTIRE table section in your analysis - Extract what you can with confidence - Mark unclear values as "Not specified in CIM" only if truly absent - DO NOT guess or interpolate values V. ADDITIONAL FINANCIAL DATA Also extract: - Quality of earnings notes - EBITDA adjustments and add-backs - Revenue growth drivers - Margin trends and analysis - CapEx requirements - Working capital needs - Free cash flow comments`; ``` --- ## Phase 3: Hybrid Validation & Cross-Checking (1-2 hours) ### 3.1: Create Validation Layer **File**: Create `backend/src/services/financialDataValidator.ts` ```typescript import { logger } from '../utils/logger'; import type { ParsedFinancials } from './financialTableParser'; import type { CIMReview } from './llmSchemas'; export interface ValidationResult { isValid: boolean; confidence: number; issues: string[]; corrections: ParsedFinancials; } /** * Cross-validate financial data from multiple sources */ export function validateFinancialData( regexResult: ParsedFinancials, llmResult: Partial ): ValidationResult { const issues: string[] = []; const corrections: ParsedFinancials = { ...regexResult }; let confidence = 1.0; // Extract LLM financials const llmFinancials = llmResult.financialSummary?.financials; if (!llmFinancials) { return { isValid: true, confidence: 0.5, issues: ['No LLM financial data to validate against'], corrections: regexResult }; } // Validate each period const periods: Array = ['fy3', 'fy2', 'fy1', 'ltm']; for (const period of periods) { const regexPeriod = regexResult[period]; const llmPeriod = llmFinancials[period]; if (!llmPeriod) continue; // Compare revenue if (regexPeriod.revenue && llmPeriod.revenue) { const match = compareFinancialValues(regexPeriod.revenue, llmPeriod.revenue); if (!match.matches) { issues.push(`${period} revenue mismatch: Regex="${regexPeriod.revenue}" vs LLM="${llmPeriod.revenue}"`); confidence -= 0.1; // Trust LLM if regex value looks suspicious if (match.llmMoreCredible) { corrections[period].revenue = llmPeriod.revenue; } } } else if (!regexPeriod.revenue && llmPeriod.revenue && llmPeriod.revenue !== 'Not specified in CIM') { // Regex missed it, LLM found it corrections[period].revenue = llmPeriod.revenue; issues.push(`${period} revenue: Regex missed, using LLM value: ${llmPeriod.revenue}`); } // Compare EBITDA if (regexPeriod.ebitda && llmPeriod.ebitda) { const match = compareFinancialValues(regexPeriod.ebitda, llmPeriod.ebitda); if (!match.matches) { issues.push(`${period} EBITDA mismatch: Regex="${regexPeriod.ebitda}" vs LLM="${llmPeriod.ebitda}"`); confidence -= 0.1; if (match.llmMoreCredible) { corrections[period].ebitda = llmPeriod.ebitda; } } } else if (!regexPeriod.ebitda && llmPeriod.ebitda && llmPeriod.ebitda !== 'Not specified in CIM') { corrections[period].ebitda = llmPeriod.ebitda; issues.push(`${period} EBITDA: Regex missed, using LLM value: ${llmPeriod.ebitda}`); } // Fill in other fields from LLM if regex didn't get them const fields: Array = [ 'revenueGrowth', 'grossProfit', 'grossMargin', 'ebitdaMargin' ]; for (const field of fields) { if (!regexPeriod[field] && llmPeriod[field] && llmPeriod[field] !== 'Not specified in CIM') { corrections[period][field] = llmPeriod[field]; } } } logger.info('Financial data validation completed', { confidence, issueCount: issues.length, issues: issues.slice(0, 5) }); return { isValid: confidence > 0.6, confidence, issues, corrections }; } /** * Compare two financial values to see if they match */ function compareFinancialValues( value1: string, value2: string ): { matches: boolean; llmMoreCredible: boolean } { const clean1 = value1.replace(/[$,\s]/g, '').toUpperCase(); const clean2 = value2.replace(/[$,\s]/g, '').toUpperCase(); // Exact match if (clean1 === clean2) { return { matches: true, llmMoreCredible: false }; } // Check if numeric values are close (within 5%) const num1 = parseFinancialValue(value1); const num2 = parseFinancialValue(value2); if (num1 && num2) { const percentDiff = Math.abs((num1 - num2) / num1); if (percentDiff < 0.05) { // Values are close enough return { matches: true, llmMoreCredible: false }; } // Large difference - trust value with more precision const precision1 = (value1.match(/\./g) || []).length; const precision2 = (value2.match(/\./g) || []).length; return { matches: false, llmMoreCredible: precision2 > precision1 }; } return { matches: false, llmMoreCredible: false }; } /** * Parse a financial value string to number */ function parseFinancialValue(value: string): number | null { const clean = value.replace(/[$,\s]/g, ''); let multiplier = 1; if (/M$/i.test(clean)) { multiplier = 1000000; } else if (/K$/i.test(clean)) { multiplier = 1000; } else if (/B$/i.test(clean)) { multiplier = 1000000000; } const numStr = clean.replace(/[MKB]/i, ''); const num = parseFloat(numStr); return isNaN(num) ? null : num * multiplier; } ``` ### 3.2: Integrate Validation into Processing **File**: `backend/src/services/optimizedAgenticRAGProcessor.ts` **Add after line 1137 (after merging partial results):** ```typescript // ENHANCED: Cross-validate regex and LLM results if (deterministicFinancials) { logger.info('Validating deterministic financials against LLM results'); const { validateFinancialData } = await import('./financialDataValidator'); const validation = validateFinancialData(deterministicFinancials, mergedData); logger.info('Validation results', { documentId, isValid: validation.isValid, confidence: validation.confidence, issueCount: validation.issues.length }); // Use validated/corrected data if (validation.confidence > 0.7) { deterministicFinancials = validation.corrections; logger.info('Using validated corrections', { documentId, corrections: validation.corrections }); } // Merge validated data this.mergeDeterministicFinancialData(mergedData, deterministicFinancials, documentId); } else { logger.info('No deterministic financial data to validate', { documentId }); } ``` --- ## Phase 4: Text Preprocessing Integration (1 hour) ### 4.1: Apply Preprocessing to Document AI Text **File**: `backend/src/services/documentAiProcessor.ts` **Add preprocessing before passing to RAG:** ```typescript // ADD import at top import { preprocessText, extractTableTexts } from '../utils/textPreprocessor'; // UPDATE line 83 (processWithAgenticRAG method) private async processWithAgenticRAG(documentId: string, extractedText: string): Promise { try { logger.info('Processing extracted text with Agentic RAG', { documentId, textLength: extractedText.length }); // ENHANCED: Preprocess text to identify table regions const preprocessed = preprocessText(extractedText); logger.info('Text preprocessing completed', { documentId, tableRegionsFound: preprocessed.tableRegions.length, likelyTableCount: preprocessed.metadata.likelyTableCount }); // Extract table texts separately for better parsing const tableSections = extractTableTexts(preprocessed); // Import and use the optimized agentic RAG processor const { optimizedAgenticRAGProcessor } = await import('./optimizedAgenticRAGProcessor'); const result = await optimizedAgenticRAGProcessor.processLargeDocument( documentId, extractedText, { preprocessedData: preprocessed, // Pass preprocessing results tableSections: tableSections // Pass isolated table texts } ); return result; } catch (error) { // ... existing error handling } } ``` --- ## Expected Results ### Current State (Baseline): ``` Financial data extraction rate: 10-20% Typical result: "Not specified in CIM" for most fields ``` ### After Phase 1 (Enhanced Regex): ``` Financial data extraction rate: 35-45% Improvement: Better pattern matching catches more tables ``` ### After Phase 2 (Enhanced LLM): ``` Financial data extraction rate: 65-75% Improvement: LLM sees financial tables more reliably ``` ### After Phase 3 (Validation): ``` Financial data extraction rate: 75-85% Improvement: Cross-validation fills gaps and corrects errors ``` ### After Phase 4 (Preprocessing): ``` Financial data extraction rate: 80-90% Improvement: Table structure preservation helps both regex and LLM ``` --- ## Implementation Priority ### Start Here (Highest ROI): 1. **Phase 2.1** - Financial Section Prioritization (30 min, +30% accuracy) 2. **Phase 2.2** - Increase LLM Context (15 min, +15% accuracy) 3. **Phase 2.3** - Enhanced Prompt (30 min, +20% accuracy) **Total: 1.5 hours for ~50-60% improvement** ### Then Do: 4. **Phase 1.2** - Enhanced Parser Patterns (1 hour, +10% accuracy) 5. **Phase 3.1-3.2** - Validation (1.5 hours, +10% accuracy) **Total: 4 hours for ~70-80% improvement** ### Optional: 6. **Phase 1.1, 4.1** - Text Preprocessing (2 hours, +10% accuracy) --- ## Testing Strategy ### Test 1: Baseline Measurement ```bash # Process 10 CIMs and record extraction rate npm run test:pipeline # Record: How many financial fields are populated? ``` ### Test 2: After Each Phase ```bash # Same 10 CIMs, measure improvement npm run test:pipeline # Compare against baseline ``` ### Test 3: Edge Cases - PDFs with rotated pages - PDFs with merged table cells - PDFs with multi-line headers - Narrative-only financials (no tables) --- ## Rollback Plan Each phase is additive and can be disabled via feature flags: ```typescript // config/env.ts export const features = { enhancedRegexParsing: process.env.ENHANCED_REGEX === 'true', enhancedLLMContext: process.env.ENHANCED_LLM === 'true', financialValidation: process.env.VALIDATE_FINANCIALS === 'true', textPreprocessing: process.env.PREPROCESS_TEXT === 'true' }; ``` Set `ENHANCED_REGEX=false` to disable any phase. --- ## Success Metrics | Metric | Current | Target | Measurement | |--------|---------|--------|-------------| | Financial data extracted | 10-20% | 80-90% | % of fields populated | | Processing time | 45s | <60s | End-to-end time | | False positives | Unknown | <5% | Manual validation | | Column misalignment | ~50% | <10% | Check FY mapping | --- ## Next Steps 1. Implement Phase 2 (Enhanced LLM) first - biggest impact, lowest risk 2. Test with 5-10 real CIM documents 3. Measure improvement 4. If >70% accuracy, stop. If not, add Phase 1 and 3. 5. Keep Phase 4 as optional enhancement The LLM is actually very good at this - we just need to give it the right context!