Major release with significant performance improvements and new processing strategy. ## Core Changes - Implemented simple_full_document processing strategy (default) - Full document → LLM approach: 1-2 passes, ~5-6 minutes processing time - Achieved 100% completeness with 2 API calls (down from 5+) - Removed redundant Document AI passes for faster processing ## Financial Data Extraction - Enhanced deterministic financial table parser - Improved FY3/FY2/FY1/LTM identification from varying CIM formats - Automatic merging of parser results with LLM extraction ## Code Quality & Infrastructure - Cleaned up debug logging (removed emoji markers from production code) - Fixed Firebase Secrets configuration (using modern defineSecret approach) - Updated OpenAI API key - Resolved deployment conflicts (secrets vs environment variables) - Added .env files to Firebase ignore list ## Deployment - Firebase Functions v2 deployment successful - All 7 required secrets verified and configured - Function URL: https://api-y56ccs6wva-uc.a.run.app ## Performance Improvements - Processing time: ~5-6 minutes (down from 23+ minutes) - API calls: 1-2 (down from 5+) - Completeness: 100% achievable - LLM Model: claude-3-7-sonnet-latest ## Breaking Changes - Default processing strategy changed to 'simple_full_document' - RAG processor available as alternative strategy 'document_ai_agentic_rag' ## Files Changed - 36 files changed, 5642 insertions(+), 4451 deletions(-) - Removed deprecated documentation files - Cleaned up unused services and models This release represents a major refactoring focused on speed, accuracy, and maintainability.
26 KiB
Financial Data Extraction: Hybrid Solution
Better Regex + Enhanced LLM Approach
Philosophy
Rather than a major architectural refactor, this solution enhances what's already working:
- Smarter regex to catch more table patterns
- Better LLM context to ensure financial tables are always seen
- Hybrid validation where regex and LLM cross-check each other
Problem Analysis (Refined)
Current Issues:
- Regex is too strict - Misses valid table formats
- LLM gets incomplete context - Financial tables truncated or missing
- No cross-validation - Regex and LLM don't verify each other
- Table structure lost - But we can preserve it better with preprocessing
Key Insight:
The LLM is actually VERY good at understanding financial tables, even in messy text. We just need to:
- Give it the RIGHT chunks (always include financial sections)
- Give it MORE context (increase chunk size for financial data)
- Give it BETTER formatting hints (preserve spacing/alignment where possible)
When to use this hybrid track: Rely on the telemetry described in FINANCIAL_EXTRACTION_ANALYSIS.md / IMPLEMENTATION_PLAN.md. If a document finishes Phase 1/2 processing with tablesFound === 0 or financialDataPopulated === false, route it through the hybrid steps below so we only pay the extra cost when the structured-table path truly fails.
Solution Architecture
Three-Tier Extraction Strategy
Tier 1: Enhanced Regex Parser (Fast, Deterministic)
↓ (if successful)
✓ Use regex results
↓ (if incomplete/failed)
Tier 2: LLM with Enhanced Context (Powerful, Flexible)
↓ (extract from full financial sections)
✓ Fill in gaps from Tier 1
↓ (if still missing data)
Tier 3: LLM Deep Dive (Focused, Exhaustive)
↓ (targeted re-scan of entire document)
✓ Final gap-filling
Implementation Plan
Phase 1: Enhanced Regex Parser (2-3 hours)
1.1: Improve Text Preprocessing
Goal: Preserve table structure better before regex parsing
File: Create backend/src/utils/textPreprocessor.ts
/**
* Enhanced text preprocessing to preserve table structures
* Attempts to maintain column alignment from PDF extraction
*/
export interface PreprocessedText {
original: string;
enhanced: string;
tableRegions: TextRegion[];
metadata: {
likelyTableCount: number;
preservedAlignment: boolean;
};
}
export interface TextRegion {
start: number;
end: number;
type: 'table' | 'narrative' | 'header';
confidence: number;
content: string;
}
/**
* Identify regions that look like tables based on formatting patterns
*/
export function identifyTableRegions(text: string): TextRegion[] {
const regions: TextRegion[] = [];
const lines = text.split('\n');
let currentRegion: TextRegion | null = null;
let regionStart = 0;
let linePosition = 0;
for (let i = 0; i < lines.length; i++) {
const line = lines[i];
const nextLine = lines[i + 1] || '';
const isTableLike = detectTableLine(line, nextLine);
if (isTableLike.isTable && !currentRegion) {
// Start new table region
currentRegion = {
start: linePosition,
end: linePosition + line.length,
type: 'table',
confidence: isTableLike.confidence,
content: line
};
regionStart = i;
} else if (isTableLike.isTable && currentRegion) {
// Extend current table region
currentRegion.end = linePosition + line.length;
currentRegion.content += '\n' + line;
currentRegion.confidence = Math.max(currentRegion.confidence, isTableLike.confidence);
} else if (!isTableLike.isTable && currentRegion) {
// End table region
if (currentRegion.confidence > 0.5 && (i - regionStart) >= 3) {
regions.push(currentRegion);
}
currentRegion = null;
}
linePosition += line.length + 1; // +1 for newline
}
// Add final region if exists
if (currentRegion && currentRegion.confidence > 0.5) {
regions.push(currentRegion);
}
return regions;
}
/**
* Detect if a line looks like part of a table
*/
function detectTableLine(line: string, nextLine: string): { isTable: boolean; confidence: number } {
let score = 0;
// Check for multiple aligned numbers
const numberMatches = line.match(/\$?[\d,]+\.?\d*[KMB%]?/g);
if (numberMatches && numberMatches.length >= 3) {
score += 0.4; // Multiple numbers = likely table row
}
// Check for consistent spacing (indicates columns)
const hasConsistentSpacing = /\s{2,}/.test(line); // 2+ spaces = column separator
if (hasConsistentSpacing && numberMatches) {
score += 0.3;
}
// Check for year/period patterns
if (/\b(FY[-\s]?\d{1,2}|20\d{2}|LTM|TTM)\b/i.test(line)) {
score += 0.3;
}
// Check for financial keywords
if (/(revenue|ebitda|sales|profit|margin|growth)/i.test(line)) {
score += 0.2;
}
// Bonus: Next line also looks like a table
if (nextLine && /\$?[\d,]+\.?\d*[KMB%]?/.test(nextLine)) {
score += 0.2;
}
return {
isTable: score > 0.5,
confidence: Math.min(score, 1.0)
};
}
/**
* Enhance text by preserving spacing in table regions
*/
export function preprocessText(text: string): PreprocessedText {
const tableRegions = identifyTableRegions(text);
// For now, return original text with identified regions
// In the future, could normalize spacing, align columns, etc.
return {
original: text,
enhanced: text, // TODO: Apply enhancement algorithms
tableRegions,
metadata: {
likelyTableCount: tableRegions.length,
preservedAlignment: true
}
};
}
/**
* Extract just the table regions as separate texts
*/
export function extractTableTexts(preprocessed: PreprocessedText): string[] {
return preprocessed.tableRegions
.filter(region => region.type === 'table' && region.confidence > 0.6)
.map(region => region.content);
}
1.2: Enhance Financial Table Parser
File: backend/src/services/financialTableParser.ts
Add new patterns to catch more variations:
// ENHANCED: More flexible period token regex (add around line 21)
const PERIOD_TOKEN_REGEX = /\b(?:
(?:FY[-\s]?\d{1,2})| # FY-1, FY 2, etc.
(?:FY[-\s]?)?20\d{2}[A-Z]*| # 2021, FY2022A, etc.
(?:FY[-\s]?[1234])| # FY1, FY 2
(?:LTM|TTM)| # LTM, TTM
(?:CY\d{2})| # CY21, CY22
(?:Q[1-4]\s*(?:FY|CY)?\d{2}) # Q1 FY23, Q4 2022
)\b/gix;
// ENHANCED: Better money regex to catch more formats (update line 22)
const MONEY_REGEX = /(?:
\$\s*[\d,]+(?:\.\d+)?(?:\s*[KMB])?| # $1,234.5M
[\d,]+(?:\.\d+)?\s*[KMB]| # 1,234.5M
\([\d,]+(?:\.\d+)?(?:\s*[KMB])?\)| # (1,234.5M) - negative
[\d,]+(?:\.\d+)? # Plain numbers
)/gx;
// ENHANCED: Better percentage regex (update line 23)
const PERCENT_REGEX = /(?:
\(?[\d,]+\.?\d*\s*%\)?| # 12.5% or (12.5%)
[\d,]+\.?\d*\s*pct| # 12.5 pct
NM|N\/A|n\/a # Not meaningful, N/A
)/gix;
Add multi-pass header detection:
// ADD after line 278 (after current header detection)
// ENHANCED: Multi-pass header detection if first pass failed
if (bestHeaderIndex === -1) {
logger.info('First pass header detection failed, trying relaxed patterns');
// Second pass: Look for ANY line with 3+ numbers and a year pattern
for (let i = 0; i < lines.length; i++) {
const line = lines[i];
const hasYearPattern = /20\d{2}|FY|LTM|TTM/i.test(line);
const numberCount = (line.match(/[\d,]+/g) || []).length;
if (hasYearPattern && numberCount >= 3) {
// Look at next 10 lines for financial keywords
const lookAhead = lines.slice(i + 1, i + 11).join(' ');
const hasFinancialKeywords = /revenue|ebitda|sales|profit/i.test(lookAhead);
if (hasFinancialKeywords) {
logger.info('Relaxed header detection found candidate', {
headerIndex: i,
headerLine: line.substring(0, 100)
});
// Try to parse this as header
const tokens = tokenizePeriodHeaders(line);
if (tokens.length >= 2) {
bestHeaderIndex = i;
bestBuckets = yearTokensToBuckets(tokens);
bestHeaderScore = 50; // Lower confidence than primary detection
break;
}
}
}
}
}
Add fuzzy row matching:
// ENHANCED: Add after line 354 (in the row matching loop)
// If exact match fails, try fuzzy matching
if (!ROW_MATCHERS[field].test(line)) {
// Try fuzzy matching (partial matches, typos)
const fuzzyMatch = fuzzyMatchFinancialRow(line, field);
if (!fuzzyMatch) continue;
}
// ADD this helper function
function fuzzyMatchFinancialRow(line: string, field: string): boolean {
const lineLower = line.toLowerCase();
switch (field) {
case 'revenue':
return /rev\b|sales|top.?line/.test(lineLower);
case 'ebitda':
return /ebit|earnings.*operations|operating.*income/.test(lineLower);
case 'grossProfit':
return /gross.*profit|gp\b/.test(lineLower);
case 'grossMargin':
return /gross.*margin|gm\b|gross.*%/.test(lineLower);
case 'ebitdaMargin':
return /ebitda.*margin|ebitda.*%|margin.*ebitda/.test(lineLower);
case 'revenueGrowth':
return /revenue.*growth|growth.*revenue|rev.*growth|yoy|y.y/.test(lineLower);
default:
return false;
}
}
Phase 2: Enhanced LLM Context Delivery (2-3 hours)
2.1: Financial Section Prioritization
File: backend/src/services/optimizedAgenticRAGProcessor.ts
Improve the prioritizeFinancialChunks method (around line 1265):
// ENHANCED: Much more aggressive financial chunk prioritization
private prioritizeFinancialChunks(chunks: ProcessingChunk[]): ProcessingChunk[] {
const scoredChunks = chunks.map(chunk => {
const content = chunk.content.toLowerCase();
let score = 0;
// TIER 1: Strong financial indicators (high score)
const tier1Patterns = [
/financial\s+summary/i,
/historical\s+financials/i,
/financial\s+performance/i,
/income\s+statement/i,
/financial\s+highlights/i,
];
tier1Patterns.forEach(pattern => {
if (pattern.test(content)) score += 100;
});
// TIER 2: Contains both periods AND metrics (very likely financial table)
const hasPeriods = /\b(20[12]\d|FY[-\s]?\d{1,2}|LTM|TTM)\b/i.test(content);
const hasMetrics = /(revenue|ebitda|sales|profit|margin)/i.test(content);
const hasNumbers = /\$[\d,]+|[\d,]+[KMB]/i.test(content);
if (hasPeriods && hasMetrics && hasNumbers) {
score += 80; // Very likely financial table
} else if (hasPeriods && hasMetrics) {
score += 50;
} else if (hasPeriods && hasNumbers) {
score += 30;
}
// TIER 3: Multiple financial keywords
const financialKeywords = [
'revenue', 'ebitda', 'gross profit', 'margin', 'sales',
'operating income', 'net income', 'cash flow', 'growth'
];
const keywordMatches = financialKeywords.filter(kw => content.includes(kw)).length;
score += keywordMatches * 5;
// TIER 4: Has year progression (2021, 2022, 2023)
const years = content.match(/20[12]\d/g);
if (years && years.length >= 3) {
score += 25; // Sequential years = likely financial table
}
// TIER 5: Multiple currency values
const currencyMatches = content.match(/\$[\d,]+(?:\.\d+)?[KMB]?/gi);
if (currencyMatches) {
score += Math.min(currencyMatches.length * 3, 30);
}
// TIER 6: Section type boost
if (chunk.sectionType && /financial|income|statement/i.test(chunk.sectionType)) {
score += 40;
}
return { chunk, score };
});
// Sort by score and return
const sorted = scoredChunks.sort((a, b) => b.score - a.score);
// Log top financial chunks for debugging
logger.info('Financial chunk prioritization results', {
topScores: sorted.slice(0, 5).map(s => ({
chunkIndex: s.chunk.chunkIndex,
score: s.score,
preview: s.chunk.content.substring(0, 100)
}))
});
return sorted.map(s => s.chunk);
}
2.2: Increase Context for Financial Pass
File: backend/src/services/optimizedAgenticRAGProcessor.ts
Update Pass 1 to use more chunks and larger context:
// ENHANCED: Update line 1259 (extractPass1CombinedMetadataFinancial)
// Change from 7 chunks to 12 chunks, and increase character limit
const maxChunks = 12; // Was 7 - give LLM more context for financials
const maxCharsPerChunk = 3000; // Was 1500 - don't truncate tables as aggressively
// And update line 1595 in extractWithTargetedQuery
const maxCharsPerChunk = options?.isFinancialPass ? 3000 : 1500;
2.3: Enhanced Financial Extraction Prompt
File: backend/src/services/optimizedAgenticRAGProcessor.ts
Update the Pass 1 query (around line 1196-1240) to be more explicit:
// ENHANCED: Much more detailed extraction instructions
const query = `Extract deal information, company metadata, and COMPREHENSIVE financial data.
CRITICAL FINANCIAL TABLE EXTRACTION INSTRUCTIONS:
I. LOCATE FINANCIAL TABLES
Look for sections titled: "Financial Summary", "Historical Financials", "Financial Performance",
"Income Statement", "P&L", "Key Metrics", "Financial Highlights", or similar.
Financial tables typically appear in these formats:
FORMAT 1 - Row-based:
FY 2021 FY 2022 FY 2023 LTM
Revenue $45.2M $52.8M $61.2M $58.5M
Revenue Growth N/A 16.8% 15.9% (4.4%)
EBITDA $8.5M $10.2M $12.1M $11.5M
FORMAT 2 - Column-based:
Metric | Value
-------------------|---------
FY21 Revenue | $45.2M
FY22 Revenue | $52.8M
FY23 Revenue | $61.2M
FORMAT 3 - Inline:
Revenue grew from $45.2M in FY2021 to $52.8M in FY2022 (+16.8%) and $61.2M in FY2023 (+15.9%)
II. EXTRACTION RULES
1. PERIOD IDENTIFICATION
- FY-3, FY-2, FY-1 = Three most recent FULL fiscal years (not projections)
- LTM/TTM = Most recent 12-month period
- Map year labels: If you see "FY2021, FY2022, FY2023, LTM Sep'23", then:
* FY2021 → fy3
* FY2022 → fy2
* FY2023 → fy1
* LTM Sep'23 → ltm
2. VALUE EXTRACTION
- Extract EXACT values as shown: "$45.2M", "16.8%", etc.
- Preserve formatting: "$45.2M" not "45.2" or "45200000"
- Include negative indicators: "(4.4%)" or "-4.4%"
- Use "N/A" or "NM" if explicitly stated (not "Not specified")
3. METRIC IDENTIFICATION
- Revenue = "Revenue", "Net Sales", "Total Sales", "Top Line"
- EBITDA = "EBITDA", "Adjusted EBITDA", "Adj. EBITDA"
- Margins = Look for "%" after metric name
- Growth = "Growth %", "YoY", "Y/Y", "Change %"
4. DEAL OVERVIEW
- Extract: company name, industry, geography, transaction type
- Extract: employee count, deal source, reason for sale
- Extract: CIM dates and metadata
III. QUALITY CHECKS
Before submitting your response:
- [ ] Did I find at least 3 distinct fiscal periods?
- [ ] Do I have Revenue AND EBITDA for at least 2 periods?
- [ ] Did I preserve exact number formats from the document?
- [ ] Did I map the periods correctly (newest = fy1, oldest = fy3)?
IV. WHAT TO DO IF TABLE IS UNCLEAR
If the table is hard to parse:
- Include the ENTIRE table section in your analysis
- Extract what you can with confidence
- Mark unclear values as "Not specified in CIM" only if truly absent
- DO NOT guess or interpolate values
V. ADDITIONAL FINANCIAL DATA
Also extract:
- Quality of earnings notes
- EBITDA adjustments and add-backs
- Revenue growth drivers
- Margin trends and analysis
- CapEx requirements
- Working capital needs
- Free cash flow comments`;
Phase 3: Hybrid Validation & Cross-Checking (1-2 hours)
3.1: Create Validation Layer
File: Create backend/src/services/financialDataValidator.ts
import { logger } from '../utils/logger';
import type { ParsedFinancials } from './financialTableParser';
import type { CIMReview } from './llmSchemas';
export interface ValidationResult {
isValid: boolean;
confidence: number;
issues: string[];
corrections: ParsedFinancials;
}
/**
* Cross-validate financial data from multiple sources
*/
export function validateFinancialData(
regexResult: ParsedFinancials,
llmResult: Partial<CIMReview>
): ValidationResult {
const issues: string[] = [];
const corrections: ParsedFinancials = { ...regexResult };
let confidence = 1.0;
// Extract LLM financials
const llmFinancials = llmResult.financialSummary?.financials;
if (!llmFinancials) {
return {
isValid: true,
confidence: 0.5,
issues: ['No LLM financial data to validate against'],
corrections: regexResult
};
}
// Validate each period
const periods: Array<keyof ParsedFinancials> = ['fy3', 'fy2', 'fy1', 'ltm'];
for (const period of periods) {
const regexPeriod = regexResult[period];
const llmPeriod = llmFinancials[period];
if (!llmPeriod) continue;
// Compare revenue
if (regexPeriod.revenue && llmPeriod.revenue) {
const match = compareFinancialValues(regexPeriod.revenue, llmPeriod.revenue);
if (!match.matches) {
issues.push(`${period} revenue mismatch: Regex="${regexPeriod.revenue}" vs LLM="${llmPeriod.revenue}"`);
confidence -= 0.1;
// Trust LLM if regex value looks suspicious
if (match.llmMoreCredible) {
corrections[period].revenue = llmPeriod.revenue;
}
}
} else if (!regexPeriod.revenue && llmPeriod.revenue && llmPeriod.revenue !== 'Not specified in CIM') {
// Regex missed it, LLM found it
corrections[period].revenue = llmPeriod.revenue;
issues.push(`${period} revenue: Regex missed, using LLM value: ${llmPeriod.revenue}`);
}
// Compare EBITDA
if (regexPeriod.ebitda && llmPeriod.ebitda) {
const match = compareFinancialValues(regexPeriod.ebitda, llmPeriod.ebitda);
if (!match.matches) {
issues.push(`${period} EBITDA mismatch: Regex="${regexPeriod.ebitda}" vs LLM="${llmPeriod.ebitda}"`);
confidence -= 0.1;
if (match.llmMoreCredible) {
corrections[period].ebitda = llmPeriod.ebitda;
}
}
} else if (!regexPeriod.ebitda && llmPeriod.ebitda && llmPeriod.ebitda !== 'Not specified in CIM') {
corrections[period].ebitda = llmPeriod.ebitda;
issues.push(`${period} EBITDA: Regex missed, using LLM value: ${llmPeriod.ebitda}`);
}
// Fill in other fields from LLM if regex didn't get them
const fields: Array<keyof typeof regexPeriod> = [
'revenueGrowth', 'grossProfit', 'grossMargin', 'ebitdaMargin'
];
for (const field of fields) {
if (!regexPeriod[field] && llmPeriod[field] && llmPeriod[field] !== 'Not specified in CIM') {
corrections[period][field] = llmPeriod[field];
}
}
}
logger.info('Financial data validation completed', {
confidence,
issueCount: issues.length,
issues: issues.slice(0, 5)
});
return {
isValid: confidence > 0.6,
confidence,
issues,
corrections
};
}
/**
* Compare two financial values to see if they match
*/
function compareFinancialValues(
value1: string,
value2: string
): { matches: boolean; llmMoreCredible: boolean } {
const clean1 = value1.replace(/[$,\s]/g, '').toUpperCase();
const clean2 = value2.replace(/[$,\s]/g, '').toUpperCase();
// Exact match
if (clean1 === clean2) {
return { matches: true, llmMoreCredible: false };
}
// Check if numeric values are close (within 5%)
const num1 = parseFinancialValue(value1);
const num2 = parseFinancialValue(value2);
if (num1 && num2) {
const percentDiff = Math.abs((num1 - num2) / num1);
if (percentDiff < 0.05) {
// Values are close enough
return { matches: true, llmMoreCredible: false };
}
// Large difference - trust value with more precision
const precision1 = (value1.match(/\./g) || []).length;
const precision2 = (value2.match(/\./g) || []).length;
return {
matches: false,
llmMoreCredible: precision2 > precision1
};
}
return { matches: false, llmMoreCredible: false };
}
/**
* Parse a financial value string to number
*/
function parseFinancialValue(value: string): number | null {
const clean = value.replace(/[$,\s]/g, '');
let multiplier = 1;
if (/M$/i.test(clean)) {
multiplier = 1000000;
} else if (/K$/i.test(clean)) {
multiplier = 1000;
} else if (/B$/i.test(clean)) {
multiplier = 1000000000;
}
const numStr = clean.replace(/[MKB]/i, '');
const num = parseFloat(numStr);
return isNaN(num) ? null : num * multiplier;
}
3.2: Integrate Validation into Processing
File: backend/src/services/optimizedAgenticRAGProcessor.ts
Add after line 1137 (after merging partial results):
// ENHANCED: Cross-validate regex and LLM results
if (deterministicFinancials) {
logger.info('Validating deterministic financials against LLM results');
const { validateFinancialData } = await import('./financialDataValidator');
const validation = validateFinancialData(deterministicFinancials, mergedData);
logger.info('Validation results', {
documentId,
isValid: validation.isValid,
confidence: validation.confidence,
issueCount: validation.issues.length
});
// Use validated/corrected data
if (validation.confidence > 0.7) {
deterministicFinancials = validation.corrections;
logger.info('Using validated corrections', {
documentId,
corrections: validation.corrections
});
}
// Merge validated data
this.mergeDeterministicFinancialData(mergedData, deterministicFinancials, documentId);
} else {
logger.info('No deterministic financial data to validate', { documentId });
}
Phase 4: Text Preprocessing Integration (1 hour)
4.1: Apply Preprocessing to Document AI Text
File: backend/src/services/documentAiProcessor.ts
Add preprocessing before passing to RAG:
// ADD import at top
import { preprocessText, extractTableTexts } from '../utils/textPreprocessor';
// UPDATE line 83 (processWithAgenticRAG method)
private async processWithAgenticRAG(documentId: string, extractedText: string): Promise<any> {
try {
logger.info('Processing extracted text with Agentic RAG', {
documentId,
textLength: extractedText.length
});
// ENHANCED: Preprocess text to identify table regions
const preprocessed = preprocessText(extractedText);
logger.info('Text preprocessing completed', {
documentId,
tableRegionsFound: preprocessed.tableRegions.length,
likelyTableCount: preprocessed.metadata.likelyTableCount
});
// Extract table texts separately for better parsing
const tableSections = extractTableTexts(preprocessed);
// Import and use the optimized agentic RAG processor
const { optimizedAgenticRAGProcessor } = await import('./optimizedAgenticRAGProcessor');
const result = await optimizedAgenticRAGProcessor.processLargeDocument(
documentId,
extractedText,
{
preprocessedData: preprocessed, // Pass preprocessing results
tableSections: tableSections // Pass isolated table texts
}
);
return result;
} catch (error) {
// ... existing error handling
}
}
Expected Results
Current State (Baseline):
Financial data extraction rate: 10-20%
Typical result: "Not specified in CIM" for most fields
After Phase 1 (Enhanced Regex):
Financial data extraction rate: 35-45%
Improvement: Better pattern matching catches more tables
After Phase 2 (Enhanced LLM):
Financial data extraction rate: 65-75%
Improvement: LLM sees financial tables more reliably
After Phase 3 (Validation):
Financial data extraction rate: 75-85%
Improvement: Cross-validation fills gaps and corrects errors
After Phase 4 (Preprocessing):
Financial data extraction rate: 80-90%
Improvement: Table structure preservation helps both regex and LLM
Implementation Priority
Start Here (Highest ROI):
- Phase 2.1 - Financial Section Prioritization (30 min, +30% accuracy)
- Phase 2.2 - Increase LLM Context (15 min, +15% accuracy)
- Phase 2.3 - Enhanced Prompt (30 min, +20% accuracy)
Total: 1.5 hours for ~50-60% improvement
Then Do:
- Phase 1.2 - Enhanced Parser Patterns (1 hour, +10% accuracy)
- Phase 3.1-3.2 - Validation (1.5 hours, +10% accuracy)
Total: 4 hours for ~70-80% improvement
Optional:
- Phase 1.1, 4.1 - Text Preprocessing (2 hours, +10% accuracy)
Testing Strategy
Test 1: Baseline Measurement
# Process 10 CIMs and record extraction rate
npm run test:pipeline
# Record: How many financial fields are populated?
Test 2: After Each Phase
# Same 10 CIMs, measure improvement
npm run test:pipeline
# Compare against baseline
Test 3: Edge Cases
- PDFs with rotated pages
- PDFs with merged table cells
- PDFs with multi-line headers
- Narrative-only financials (no tables)
Rollback Plan
Each phase is additive and can be disabled via feature flags:
// config/env.ts
export const features = {
enhancedRegexParsing: process.env.ENHANCED_REGEX === 'true',
enhancedLLMContext: process.env.ENHANCED_LLM === 'true',
financialValidation: process.env.VALIDATE_FINANCIALS === 'true',
textPreprocessing: process.env.PREPROCESS_TEXT === 'true'
};
Set ENHANCED_REGEX=false to disable any phase.
Success Metrics
| Metric | Current | Target | Measurement |
|---|---|---|---|
| Financial data extracted | 10-20% | 80-90% | % of fields populated |
| Processing time | 45s | <60s | End-to-end time |
| False positives | Unknown | <5% | Manual validation |
| Column misalignment | ~50% | <10% | Check FY mapping |
Next Steps
- Implement Phase 2 (Enhanced LLM) first - biggest impact, lowest risk
- Test with 5-10 real CIM documents
- Measure improvement
- If >70% accuracy, stop. If not, add Phase 1 and 3.
- Keep Phase 4 as optional enhancement
The LLM is actually very good at this - we just need to give it the right context!