Major release with significant performance improvements and new processing strategy. ## Core Changes - Implemented simple_full_document processing strategy (default) - Full document → LLM approach: 1-2 passes, ~5-6 minutes processing time - Achieved 100% completeness with 2 API calls (down from 5+) - Removed redundant Document AI passes for faster processing ## Financial Data Extraction - Enhanced deterministic financial table parser - Improved FY3/FY2/FY1/LTM identification from varying CIM formats - Automatic merging of parser results with LLM extraction ## Code Quality & Infrastructure - Cleaned up debug logging (removed emoji markers from production code) - Fixed Firebase Secrets configuration (using modern defineSecret approach) - Updated OpenAI API key - Resolved deployment conflicts (secrets vs environment variables) - Added .env files to Firebase ignore list ## Deployment - Firebase Functions v2 deployment successful - All 7 required secrets verified and configured - Function URL: https://api-y56ccs6wva-uc.a.run.app ## Performance Improvements - Processing time: ~5-6 minutes (down from 23+ minutes) - API calls: 1-2 (down from 5+) - Completeness: 100% achievable - LLM Model: claude-3-7-sonnet-latest ## Breaking Changes - Default processing strategy changed to 'simple_full_document' - RAG processor available as alternative strategy 'document_ai_agentic_rag' ## Files Changed - 36 files changed, 5642 insertions(+), 4451 deletions(-) - Removed deprecated documentation files - Cleaned up unused services and models This release represents a major refactoring focused on speed, accuracy, and maintainability.
889 lines
26 KiB
Markdown
889 lines
26 KiB
Markdown
# Financial Data Extraction: Hybrid Solution
|
||
## Better Regex + Enhanced LLM Approach
|
||
|
||
## Philosophy
|
||
|
||
Rather than a major architectural refactor, this solution enhances what's already working:
|
||
1. **Smarter regex** to catch more table patterns
|
||
2. **Better LLM context** to ensure financial tables are always seen
|
||
3. **Hybrid validation** where regex and LLM cross-check each other
|
||
|
||
---
|
||
|
||
## Problem Analysis (Refined)
|
||
|
||
### Current Issues:
|
||
1. **Regex is too strict** - Misses valid table formats
|
||
2. **LLM gets incomplete context** - Financial tables truncated or missing
|
||
3. **No cross-validation** - Regex and LLM don't verify each other
|
||
4. **Table structure lost** - But we can preserve it better with preprocessing
|
||
|
||
### Key Insight:
|
||
The LLM is actually VERY good at understanding financial tables, even in messy text. We just need to:
|
||
- Give it the RIGHT chunks (always include financial sections)
|
||
- Give it MORE context (increase chunk size for financial data)
|
||
- Give it BETTER formatting hints (preserve spacing/alignment where possible)
|
||
|
||
**When to use this hybrid track:** Rely on the telemetry described in `FINANCIAL_EXTRACTION_ANALYSIS.md` / `IMPLEMENTATION_PLAN.md`. If a document finishes Phase 1/2 processing with `tablesFound === 0` or `financialDataPopulated === false`, route it through the hybrid steps below so we only pay the extra cost when the structured-table path truly fails.
|
||
|
||
---
|
||
|
||
## Solution Architecture
|
||
|
||
### Three-Tier Extraction Strategy
|
||
|
||
```
|
||
Tier 1: Enhanced Regex Parser (Fast, Deterministic)
|
||
↓ (if successful)
|
||
✓ Use regex results
|
||
↓ (if incomplete/failed)
|
||
|
||
Tier 2: LLM with Enhanced Context (Powerful, Flexible)
|
||
↓ (extract from full financial sections)
|
||
✓ Fill in gaps from Tier 1
|
||
↓ (if still missing data)
|
||
|
||
Tier 3: LLM Deep Dive (Focused, Exhaustive)
|
||
↓ (targeted re-scan of entire document)
|
||
✓ Final gap-filling
|
||
```
|
||
|
||
---
|
||
|
||
## Implementation Plan
|
||
|
||
## Phase 1: Enhanced Regex Parser (2-3 hours)
|
||
|
||
### 1.1: Improve Text Preprocessing
|
||
|
||
**Goal**: Preserve table structure better before regex parsing
|
||
|
||
**File**: Create `backend/src/utils/textPreprocessor.ts`
|
||
|
||
```typescript
|
||
/**
|
||
* Enhanced text preprocessing to preserve table structures
|
||
* Attempts to maintain column alignment from PDF extraction
|
||
*/
|
||
|
||
export interface PreprocessedText {
|
||
original: string;
|
||
enhanced: string;
|
||
tableRegions: TextRegion[];
|
||
metadata: {
|
||
likelyTableCount: number;
|
||
preservedAlignment: boolean;
|
||
};
|
||
}
|
||
|
||
export interface TextRegion {
|
||
start: number;
|
||
end: number;
|
||
type: 'table' | 'narrative' | 'header';
|
||
confidence: number;
|
||
content: string;
|
||
}
|
||
|
||
/**
|
||
* Identify regions that look like tables based on formatting patterns
|
||
*/
|
||
export function identifyTableRegions(text: string): TextRegion[] {
|
||
const regions: TextRegion[] = [];
|
||
const lines = text.split('\n');
|
||
|
||
let currentRegion: TextRegion | null = null;
|
||
let regionStart = 0;
|
||
let linePosition = 0;
|
||
|
||
for (let i = 0; i < lines.length; i++) {
|
||
const line = lines[i];
|
||
const nextLine = lines[i + 1] || '';
|
||
|
||
const isTableLike = detectTableLine(line, nextLine);
|
||
|
||
if (isTableLike.isTable && !currentRegion) {
|
||
// Start new table region
|
||
currentRegion = {
|
||
start: linePosition,
|
||
end: linePosition + line.length,
|
||
type: 'table',
|
||
confidence: isTableLike.confidence,
|
||
content: line
|
||
};
|
||
regionStart = i;
|
||
} else if (isTableLike.isTable && currentRegion) {
|
||
// Extend current table region
|
||
currentRegion.end = linePosition + line.length;
|
||
currentRegion.content += '\n' + line;
|
||
currentRegion.confidence = Math.max(currentRegion.confidence, isTableLike.confidence);
|
||
} else if (!isTableLike.isTable && currentRegion) {
|
||
// End table region
|
||
if (currentRegion.confidence > 0.5 && (i - regionStart) >= 3) {
|
||
regions.push(currentRegion);
|
||
}
|
||
currentRegion = null;
|
||
}
|
||
|
||
linePosition += line.length + 1; // +1 for newline
|
||
}
|
||
|
||
// Add final region if exists
|
||
if (currentRegion && currentRegion.confidence > 0.5) {
|
||
regions.push(currentRegion);
|
||
}
|
||
|
||
return regions;
|
||
}
|
||
|
||
/**
|
||
* Detect if a line looks like part of a table
|
||
*/
|
||
function detectTableLine(line: string, nextLine: string): { isTable: boolean; confidence: number } {
|
||
let score = 0;
|
||
|
||
// Check for multiple aligned numbers
|
||
const numberMatches = line.match(/\$?[\d,]+\.?\d*[KMB%]?/g);
|
||
if (numberMatches && numberMatches.length >= 3) {
|
||
score += 0.4; // Multiple numbers = likely table row
|
||
}
|
||
|
||
// Check for consistent spacing (indicates columns)
|
||
const hasConsistentSpacing = /\s{2,}/.test(line); // 2+ spaces = column separator
|
||
if (hasConsistentSpacing && numberMatches) {
|
||
score += 0.3;
|
||
}
|
||
|
||
// Check for year/period patterns
|
||
if (/\b(FY[-\s]?\d{1,2}|20\d{2}|LTM|TTM)\b/i.test(line)) {
|
||
score += 0.3;
|
||
}
|
||
|
||
// Check for financial keywords
|
||
if (/(revenue|ebitda|sales|profit|margin|growth)/i.test(line)) {
|
||
score += 0.2;
|
||
}
|
||
|
||
// Bonus: Next line also looks like a table
|
||
if (nextLine && /\$?[\d,]+\.?\d*[KMB%]?/.test(nextLine)) {
|
||
score += 0.2;
|
||
}
|
||
|
||
return {
|
||
isTable: score > 0.5,
|
||
confidence: Math.min(score, 1.0)
|
||
};
|
||
}
|
||
|
||
/**
|
||
* Enhance text by preserving spacing in table regions
|
||
*/
|
||
export function preprocessText(text: string): PreprocessedText {
|
||
const tableRegions = identifyTableRegions(text);
|
||
|
||
// For now, return original text with identified regions
|
||
// In the future, could normalize spacing, align columns, etc.
|
||
|
||
return {
|
||
original: text,
|
||
enhanced: text, // TODO: Apply enhancement algorithms
|
||
tableRegions,
|
||
metadata: {
|
||
likelyTableCount: tableRegions.length,
|
||
preservedAlignment: true
|
||
}
|
||
};
|
||
}
|
||
|
||
/**
|
||
* Extract just the table regions as separate texts
|
||
*/
|
||
export function extractTableTexts(preprocessed: PreprocessedText): string[] {
|
||
return preprocessed.tableRegions
|
||
.filter(region => region.type === 'table' && region.confidence > 0.6)
|
||
.map(region => region.content);
|
||
}
|
||
```
|
||
|
||
### 1.2: Enhance Financial Table Parser
|
||
|
||
**File**: `backend/src/services/financialTableParser.ts`
|
||
|
||
**Add new patterns to catch more variations:**
|
||
|
||
```typescript
|
||
// ENHANCED: More flexible period token regex (add around line 21)
|
||
const PERIOD_TOKEN_REGEX = /\b(?:
|
||
(?:FY[-\s]?\d{1,2})| # FY-1, FY 2, etc.
|
||
(?:FY[-\s]?)?20\d{2}[A-Z]*| # 2021, FY2022A, etc.
|
||
(?:FY[-\s]?[1234])| # FY1, FY 2
|
||
(?:LTM|TTM)| # LTM, TTM
|
||
(?:CY\d{2})| # CY21, CY22
|
||
(?:Q[1-4]\s*(?:FY|CY)?\d{2}) # Q1 FY23, Q4 2022
|
||
)\b/gix;
|
||
|
||
// ENHANCED: Better money regex to catch more formats (update line 22)
|
||
const MONEY_REGEX = /(?:
|
||
\$\s*[\d,]+(?:\.\d+)?(?:\s*[KMB])?| # $1,234.5M
|
||
[\d,]+(?:\.\d+)?\s*[KMB]| # 1,234.5M
|
||
\([\d,]+(?:\.\d+)?(?:\s*[KMB])?\)| # (1,234.5M) - negative
|
||
[\d,]+(?:\.\d+)? # Plain numbers
|
||
)/gx;
|
||
|
||
// ENHANCED: Better percentage regex (update line 23)
|
||
const PERCENT_REGEX = /(?:
|
||
\(?[\d,]+\.?\d*\s*%\)?| # 12.5% or (12.5%)
|
||
[\d,]+\.?\d*\s*pct| # 12.5 pct
|
||
NM|N\/A|n\/a # Not meaningful, N/A
|
||
)/gix;
|
||
```
|
||
|
||
**Add multi-pass header detection:**
|
||
|
||
```typescript
|
||
// ADD after line 278 (after current header detection)
|
||
|
||
// ENHANCED: Multi-pass header detection if first pass failed
|
||
if (bestHeaderIndex === -1) {
|
||
logger.info('First pass header detection failed, trying relaxed patterns');
|
||
|
||
// Second pass: Look for ANY line with 3+ numbers and a year pattern
|
||
for (let i = 0; i < lines.length; i++) {
|
||
const line = lines[i];
|
||
const hasYearPattern = /20\d{2}|FY|LTM|TTM/i.test(line);
|
||
const numberCount = (line.match(/[\d,]+/g) || []).length;
|
||
|
||
if (hasYearPattern && numberCount >= 3) {
|
||
// Look at next 10 lines for financial keywords
|
||
const lookAhead = lines.slice(i + 1, i + 11).join(' ');
|
||
const hasFinancialKeywords = /revenue|ebitda|sales|profit/i.test(lookAhead);
|
||
|
||
if (hasFinancialKeywords) {
|
||
logger.info('Relaxed header detection found candidate', {
|
||
headerIndex: i,
|
||
headerLine: line.substring(0, 100)
|
||
});
|
||
|
||
// Try to parse this as header
|
||
const tokens = tokenizePeriodHeaders(line);
|
||
if (tokens.length >= 2) {
|
||
bestHeaderIndex = i;
|
||
bestBuckets = yearTokensToBuckets(tokens);
|
||
bestHeaderScore = 50; // Lower confidence than primary detection
|
||
break;
|
||
}
|
||
}
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
**Add fuzzy row matching:**
|
||
|
||
```typescript
|
||
// ENHANCED: Add after line 354 (in the row matching loop)
|
||
// If exact match fails, try fuzzy matching
|
||
|
||
if (!ROW_MATCHERS[field].test(line)) {
|
||
// Try fuzzy matching (partial matches, typos)
|
||
const fuzzyMatch = fuzzyMatchFinancialRow(line, field);
|
||
if (!fuzzyMatch) continue;
|
||
}
|
||
|
||
// ADD this helper function
|
||
function fuzzyMatchFinancialRow(line: string, field: string): boolean {
|
||
const lineLower = line.toLowerCase();
|
||
|
||
switch (field) {
|
||
case 'revenue':
|
||
return /rev\b|sales|top.?line/.test(lineLower);
|
||
case 'ebitda':
|
||
return /ebit|earnings.*operations|operating.*income/.test(lineLower);
|
||
case 'grossProfit':
|
||
return /gross.*profit|gp\b/.test(lineLower);
|
||
case 'grossMargin':
|
||
return /gross.*margin|gm\b|gross.*%/.test(lineLower);
|
||
case 'ebitdaMargin':
|
||
return /ebitda.*margin|ebitda.*%|margin.*ebitda/.test(lineLower);
|
||
case 'revenueGrowth':
|
||
return /revenue.*growth|growth.*revenue|rev.*growth|yoy|y.y/.test(lineLower);
|
||
default:
|
||
return false;
|
||
}
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## Phase 2: Enhanced LLM Context Delivery (2-3 hours)
|
||
|
||
### 2.1: Financial Section Prioritization
|
||
|
||
**File**: `backend/src/services/optimizedAgenticRAGProcessor.ts`
|
||
|
||
**Improve the `prioritizeFinancialChunks` method (around line 1265):**
|
||
|
||
```typescript
|
||
// ENHANCED: Much more aggressive financial chunk prioritization
|
||
private prioritizeFinancialChunks(chunks: ProcessingChunk[]): ProcessingChunk[] {
|
||
const scoredChunks = chunks.map(chunk => {
|
||
const content = chunk.content.toLowerCase();
|
||
let score = 0;
|
||
|
||
// TIER 1: Strong financial indicators (high score)
|
||
const tier1Patterns = [
|
||
/financial\s+summary/i,
|
||
/historical\s+financials/i,
|
||
/financial\s+performance/i,
|
||
/income\s+statement/i,
|
||
/financial\s+highlights/i,
|
||
];
|
||
tier1Patterns.forEach(pattern => {
|
||
if (pattern.test(content)) score += 100;
|
||
});
|
||
|
||
// TIER 2: Contains both periods AND metrics (very likely financial table)
|
||
const hasPeriods = /\b(20[12]\d|FY[-\s]?\d{1,2}|LTM|TTM)\b/i.test(content);
|
||
const hasMetrics = /(revenue|ebitda|sales|profit|margin)/i.test(content);
|
||
const hasNumbers = /\$[\d,]+|[\d,]+[KMB]/i.test(content);
|
||
|
||
if (hasPeriods && hasMetrics && hasNumbers) {
|
||
score += 80; // Very likely financial table
|
||
} else if (hasPeriods && hasMetrics) {
|
||
score += 50;
|
||
} else if (hasPeriods && hasNumbers) {
|
||
score += 30;
|
||
}
|
||
|
||
// TIER 3: Multiple financial keywords
|
||
const financialKeywords = [
|
||
'revenue', 'ebitda', 'gross profit', 'margin', 'sales',
|
||
'operating income', 'net income', 'cash flow', 'growth'
|
||
];
|
||
const keywordMatches = financialKeywords.filter(kw => content.includes(kw)).length;
|
||
score += keywordMatches * 5;
|
||
|
||
// TIER 4: Has year progression (2021, 2022, 2023)
|
||
const years = content.match(/20[12]\d/g);
|
||
if (years && years.length >= 3) {
|
||
score += 25; // Sequential years = likely financial table
|
||
}
|
||
|
||
// TIER 5: Multiple currency values
|
||
const currencyMatches = content.match(/\$[\d,]+(?:\.\d+)?[KMB]?/gi);
|
||
if (currencyMatches) {
|
||
score += Math.min(currencyMatches.length * 3, 30);
|
||
}
|
||
|
||
// TIER 6: Section type boost
|
||
if (chunk.sectionType && /financial|income|statement/i.test(chunk.sectionType)) {
|
||
score += 40;
|
||
}
|
||
|
||
return { chunk, score };
|
||
});
|
||
|
||
// Sort by score and return
|
||
const sorted = scoredChunks.sort((a, b) => b.score - a.score);
|
||
|
||
// Log top financial chunks for debugging
|
||
logger.info('Financial chunk prioritization results', {
|
||
topScores: sorted.slice(0, 5).map(s => ({
|
||
chunkIndex: s.chunk.chunkIndex,
|
||
score: s.score,
|
||
preview: s.chunk.content.substring(0, 100)
|
||
}))
|
||
});
|
||
|
||
return sorted.map(s => s.chunk);
|
||
}
|
||
```
|
||
|
||
### 2.2: Increase Context for Financial Pass
|
||
|
||
**File**: `backend/src/services/optimizedAgenticRAGProcessor.ts`
|
||
|
||
**Update Pass 1 to use more chunks and larger context:**
|
||
|
||
```typescript
|
||
// ENHANCED: Update line 1259 (extractPass1CombinedMetadataFinancial)
|
||
// Change from 7 chunks to 12 chunks, and increase character limit
|
||
|
||
const maxChunks = 12; // Was 7 - give LLM more context for financials
|
||
const maxCharsPerChunk = 3000; // Was 1500 - don't truncate tables as aggressively
|
||
|
||
// And update line 1595 in extractWithTargetedQuery
|
||
const maxCharsPerChunk = options?.isFinancialPass ? 3000 : 1500;
|
||
```
|
||
|
||
### 2.3: Enhanced Financial Extraction Prompt
|
||
|
||
**File**: `backend/src/services/optimizedAgenticRAGProcessor.ts`
|
||
|
||
**Update the Pass 1 query (around line 1196-1240) to be more explicit:**
|
||
|
||
```typescript
|
||
// ENHANCED: Much more detailed extraction instructions
|
||
const query = `Extract deal information, company metadata, and COMPREHENSIVE financial data.
|
||
|
||
CRITICAL FINANCIAL TABLE EXTRACTION INSTRUCTIONS:
|
||
|
||
I. LOCATE FINANCIAL TABLES
|
||
Look for sections titled: "Financial Summary", "Historical Financials", "Financial Performance",
|
||
"Income Statement", "P&L", "Key Metrics", "Financial Highlights", or similar.
|
||
|
||
Financial tables typically appear in these formats:
|
||
|
||
FORMAT 1 - Row-based:
|
||
FY 2021 FY 2022 FY 2023 LTM
|
||
Revenue $45.2M $52.8M $61.2M $58.5M
|
||
Revenue Growth N/A 16.8% 15.9% (4.4%)
|
||
EBITDA $8.5M $10.2M $12.1M $11.5M
|
||
|
||
FORMAT 2 - Column-based:
|
||
Metric | Value
|
||
-------------------|---------
|
||
FY21 Revenue | $45.2M
|
||
FY22 Revenue | $52.8M
|
||
FY23 Revenue | $61.2M
|
||
|
||
FORMAT 3 - Inline:
|
||
Revenue grew from $45.2M in FY2021 to $52.8M in FY2022 (+16.8%) and $61.2M in FY2023 (+15.9%)
|
||
|
||
II. EXTRACTION RULES
|
||
|
||
1. PERIOD IDENTIFICATION
|
||
- FY-3, FY-2, FY-1 = Three most recent FULL fiscal years (not projections)
|
||
- LTM/TTM = Most recent 12-month period
|
||
- Map year labels: If you see "FY2021, FY2022, FY2023, LTM Sep'23", then:
|
||
* FY2021 → fy3
|
||
* FY2022 → fy2
|
||
* FY2023 → fy1
|
||
* LTM Sep'23 → ltm
|
||
|
||
2. VALUE EXTRACTION
|
||
- Extract EXACT values as shown: "$45.2M", "16.8%", etc.
|
||
- Preserve formatting: "$45.2M" not "45.2" or "45200000"
|
||
- Include negative indicators: "(4.4%)" or "-4.4%"
|
||
- Use "N/A" or "NM" if explicitly stated (not "Not specified")
|
||
|
||
3. METRIC IDENTIFICATION
|
||
- Revenue = "Revenue", "Net Sales", "Total Sales", "Top Line"
|
||
- EBITDA = "EBITDA", "Adjusted EBITDA", "Adj. EBITDA"
|
||
- Margins = Look for "%" after metric name
|
||
- Growth = "Growth %", "YoY", "Y/Y", "Change %"
|
||
|
||
4. DEAL OVERVIEW
|
||
- Extract: company name, industry, geography, transaction type
|
||
- Extract: employee count, deal source, reason for sale
|
||
- Extract: CIM dates and metadata
|
||
|
||
III. QUALITY CHECKS
|
||
|
||
Before submitting your response:
|
||
- [ ] Did I find at least 3 distinct fiscal periods?
|
||
- [ ] Do I have Revenue AND EBITDA for at least 2 periods?
|
||
- [ ] Did I preserve exact number formats from the document?
|
||
- [ ] Did I map the periods correctly (newest = fy1, oldest = fy3)?
|
||
|
||
IV. WHAT TO DO IF TABLE IS UNCLEAR
|
||
|
||
If the table is hard to parse:
|
||
- Include the ENTIRE table section in your analysis
|
||
- Extract what you can with confidence
|
||
- Mark unclear values as "Not specified in CIM" only if truly absent
|
||
- DO NOT guess or interpolate values
|
||
|
||
V. ADDITIONAL FINANCIAL DATA
|
||
|
||
Also extract:
|
||
- Quality of earnings notes
|
||
- EBITDA adjustments and add-backs
|
||
- Revenue growth drivers
|
||
- Margin trends and analysis
|
||
- CapEx requirements
|
||
- Working capital needs
|
||
- Free cash flow comments`;
|
||
```
|
||
|
||
---
|
||
|
||
## Phase 3: Hybrid Validation & Cross-Checking (1-2 hours)
|
||
|
||
### 3.1: Create Validation Layer
|
||
|
||
**File**: Create `backend/src/services/financialDataValidator.ts`
|
||
|
||
```typescript
|
||
import { logger } from '../utils/logger';
|
||
import type { ParsedFinancials } from './financialTableParser';
|
||
import type { CIMReview } from './llmSchemas';
|
||
|
||
export interface ValidationResult {
|
||
isValid: boolean;
|
||
confidence: number;
|
||
issues: string[];
|
||
corrections: ParsedFinancials;
|
||
}
|
||
|
||
/**
|
||
* Cross-validate financial data from multiple sources
|
||
*/
|
||
export function validateFinancialData(
|
||
regexResult: ParsedFinancials,
|
||
llmResult: Partial<CIMReview>
|
||
): ValidationResult {
|
||
const issues: string[] = [];
|
||
const corrections: ParsedFinancials = { ...regexResult };
|
||
let confidence = 1.0;
|
||
|
||
// Extract LLM financials
|
||
const llmFinancials = llmResult.financialSummary?.financials;
|
||
|
||
if (!llmFinancials) {
|
||
return {
|
||
isValid: true,
|
||
confidence: 0.5,
|
||
issues: ['No LLM financial data to validate against'],
|
||
corrections: regexResult
|
||
};
|
||
}
|
||
|
||
// Validate each period
|
||
const periods: Array<keyof ParsedFinancials> = ['fy3', 'fy2', 'fy1', 'ltm'];
|
||
|
||
for (const period of periods) {
|
||
const regexPeriod = regexResult[period];
|
||
const llmPeriod = llmFinancials[period];
|
||
|
||
if (!llmPeriod) continue;
|
||
|
||
// Compare revenue
|
||
if (regexPeriod.revenue && llmPeriod.revenue) {
|
||
const match = compareFinancialValues(regexPeriod.revenue, llmPeriod.revenue);
|
||
if (!match.matches) {
|
||
issues.push(`${period} revenue mismatch: Regex="${regexPeriod.revenue}" vs LLM="${llmPeriod.revenue}"`);
|
||
confidence -= 0.1;
|
||
|
||
// Trust LLM if regex value looks suspicious
|
||
if (match.llmMoreCredible) {
|
||
corrections[period].revenue = llmPeriod.revenue;
|
||
}
|
||
}
|
||
} else if (!regexPeriod.revenue && llmPeriod.revenue && llmPeriod.revenue !== 'Not specified in CIM') {
|
||
// Regex missed it, LLM found it
|
||
corrections[period].revenue = llmPeriod.revenue;
|
||
issues.push(`${period} revenue: Regex missed, using LLM value: ${llmPeriod.revenue}`);
|
||
}
|
||
|
||
// Compare EBITDA
|
||
if (regexPeriod.ebitda && llmPeriod.ebitda) {
|
||
const match = compareFinancialValues(regexPeriod.ebitda, llmPeriod.ebitda);
|
||
if (!match.matches) {
|
||
issues.push(`${period} EBITDA mismatch: Regex="${regexPeriod.ebitda}" vs LLM="${llmPeriod.ebitda}"`);
|
||
confidence -= 0.1;
|
||
|
||
if (match.llmMoreCredible) {
|
||
corrections[period].ebitda = llmPeriod.ebitda;
|
||
}
|
||
}
|
||
} else if (!regexPeriod.ebitda && llmPeriod.ebitda && llmPeriod.ebitda !== 'Not specified in CIM') {
|
||
corrections[period].ebitda = llmPeriod.ebitda;
|
||
issues.push(`${period} EBITDA: Regex missed, using LLM value: ${llmPeriod.ebitda}`);
|
||
}
|
||
|
||
// Fill in other fields from LLM if regex didn't get them
|
||
const fields: Array<keyof typeof regexPeriod> = [
|
||
'revenueGrowth', 'grossProfit', 'grossMargin', 'ebitdaMargin'
|
||
];
|
||
|
||
for (const field of fields) {
|
||
if (!regexPeriod[field] && llmPeriod[field] && llmPeriod[field] !== 'Not specified in CIM') {
|
||
corrections[period][field] = llmPeriod[field];
|
||
}
|
||
}
|
||
}
|
||
|
||
logger.info('Financial data validation completed', {
|
||
confidence,
|
||
issueCount: issues.length,
|
||
issues: issues.slice(0, 5)
|
||
});
|
||
|
||
return {
|
||
isValid: confidence > 0.6,
|
||
confidence,
|
||
issues,
|
||
corrections
|
||
};
|
||
}
|
||
|
||
/**
|
||
* Compare two financial values to see if they match
|
||
*/
|
||
function compareFinancialValues(
|
||
value1: string,
|
||
value2: string
|
||
): { matches: boolean; llmMoreCredible: boolean } {
|
||
const clean1 = value1.replace(/[$,\s]/g, '').toUpperCase();
|
||
const clean2 = value2.replace(/[$,\s]/g, '').toUpperCase();
|
||
|
||
// Exact match
|
||
if (clean1 === clean2) {
|
||
return { matches: true, llmMoreCredible: false };
|
||
}
|
||
|
||
// Check if numeric values are close (within 5%)
|
||
const num1 = parseFinancialValue(value1);
|
||
const num2 = parseFinancialValue(value2);
|
||
|
||
if (num1 && num2) {
|
||
const percentDiff = Math.abs((num1 - num2) / num1);
|
||
if (percentDiff < 0.05) {
|
||
// Values are close enough
|
||
return { matches: true, llmMoreCredible: false };
|
||
}
|
||
|
||
// Large difference - trust value with more precision
|
||
const precision1 = (value1.match(/\./g) || []).length;
|
||
const precision2 = (value2.match(/\./g) || []).length;
|
||
|
||
return {
|
||
matches: false,
|
||
llmMoreCredible: precision2 > precision1
|
||
};
|
||
}
|
||
|
||
return { matches: false, llmMoreCredible: false };
|
||
}
|
||
|
||
/**
|
||
* Parse a financial value string to number
|
||
*/
|
||
function parseFinancialValue(value: string): number | null {
|
||
const clean = value.replace(/[$,\s]/g, '');
|
||
|
||
let multiplier = 1;
|
||
if (/M$/i.test(clean)) {
|
||
multiplier = 1000000;
|
||
} else if (/K$/i.test(clean)) {
|
||
multiplier = 1000;
|
||
} else if (/B$/i.test(clean)) {
|
||
multiplier = 1000000000;
|
||
}
|
||
|
||
const numStr = clean.replace(/[MKB]/i, '');
|
||
const num = parseFloat(numStr);
|
||
|
||
return isNaN(num) ? null : num * multiplier;
|
||
}
|
||
```
|
||
|
||
### 3.2: Integrate Validation into Processing
|
||
|
||
**File**: `backend/src/services/optimizedAgenticRAGProcessor.ts`
|
||
|
||
**Add after line 1137 (after merging partial results):**
|
||
|
||
```typescript
|
||
// ENHANCED: Cross-validate regex and LLM results
|
||
if (deterministicFinancials) {
|
||
logger.info('Validating deterministic financials against LLM results');
|
||
|
||
const { validateFinancialData } = await import('./financialDataValidator');
|
||
const validation = validateFinancialData(deterministicFinancials, mergedData);
|
||
|
||
logger.info('Validation results', {
|
||
documentId,
|
||
isValid: validation.isValid,
|
||
confidence: validation.confidence,
|
||
issueCount: validation.issues.length
|
||
});
|
||
|
||
// Use validated/corrected data
|
||
if (validation.confidence > 0.7) {
|
||
deterministicFinancials = validation.corrections;
|
||
logger.info('Using validated corrections', {
|
||
documentId,
|
||
corrections: validation.corrections
|
||
});
|
||
}
|
||
|
||
// Merge validated data
|
||
this.mergeDeterministicFinancialData(mergedData, deterministicFinancials, documentId);
|
||
} else {
|
||
logger.info('No deterministic financial data to validate', { documentId });
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## Phase 4: Text Preprocessing Integration (1 hour)
|
||
|
||
### 4.1: Apply Preprocessing to Document AI Text
|
||
|
||
**File**: `backend/src/services/documentAiProcessor.ts`
|
||
|
||
**Add preprocessing before passing to RAG:**
|
||
|
||
```typescript
|
||
// ADD import at top
|
||
import { preprocessText, extractTableTexts } from '../utils/textPreprocessor';
|
||
|
||
// UPDATE line 83 (processWithAgenticRAG method)
|
||
private async processWithAgenticRAG(documentId: string, extractedText: string): Promise<any> {
|
||
try {
|
||
logger.info('Processing extracted text with Agentic RAG', {
|
||
documentId,
|
||
textLength: extractedText.length
|
||
});
|
||
|
||
// ENHANCED: Preprocess text to identify table regions
|
||
const preprocessed = preprocessText(extractedText);
|
||
|
||
logger.info('Text preprocessing completed', {
|
||
documentId,
|
||
tableRegionsFound: preprocessed.tableRegions.length,
|
||
likelyTableCount: preprocessed.metadata.likelyTableCount
|
||
});
|
||
|
||
// Extract table texts separately for better parsing
|
||
const tableSections = extractTableTexts(preprocessed);
|
||
|
||
// Import and use the optimized agentic RAG processor
|
||
const { optimizedAgenticRAGProcessor } = await import('./optimizedAgenticRAGProcessor');
|
||
|
||
const result = await optimizedAgenticRAGProcessor.processLargeDocument(
|
||
documentId,
|
||
extractedText,
|
||
{
|
||
preprocessedData: preprocessed, // Pass preprocessing results
|
||
tableSections: tableSections // Pass isolated table texts
|
||
}
|
||
);
|
||
|
||
return result;
|
||
} catch (error) {
|
||
// ... existing error handling
|
||
}
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## Expected Results
|
||
|
||
### Current State (Baseline):
|
||
```
|
||
Financial data extraction rate: 10-20%
|
||
Typical result: "Not specified in CIM" for most fields
|
||
```
|
||
|
||
### After Phase 1 (Enhanced Regex):
|
||
```
|
||
Financial data extraction rate: 35-45%
|
||
Improvement: Better pattern matching catches more tables
|
||
```
|
||
|
||
### After Phase 2 (Enhanced LLM):
|
||
```
|
||
Financial data extraction rate: 65-75%
|
||
Improvement: LLM sees financial tables more reliably
|
||
```
|
||
|
||
### After Phase 3 (Validation):
|
||
```
|
||
Financial data extraction rate: 75-85%
|
||
Improvement: Cross-validation fills gaps and corrects errors
|
||
```
|
||
|
||
### After Phase 4 (Preprocessing):
|
||
```
|
||
Financial data extraction rate: 80-90%
|
||
Improvement: Table structure preservation helps both regex and LLM
|
||
```
|
||
|
||
---
|
||
|
||
## Implementation Priority
|
||
|
||
### Start Here (Highest ROI):
|
||
1. **Phase 2.1** - Financial Section Prioritization (30 min, +30% accuracy)
|
||
2. **Phase 2.2** - Increase LLM Context (15 min, +15% accuracy)
|
||
3. **Phase 2.3** - Enhanced Prompt (30 min, +20% accuracy)
|
||
|
||
**Total: 1.5 hours for ~50-60% improvement**
|
||
|
||
### Then Do:
|
||
4. **Phase 1.2** - Enhanced Parser Patterns (1 hour, +10% accuracy)
|
||
5. **Phase 3.1-3.2** - Validation (1.5 hours, +10% accuracy)
|
||
|
||
**Total: 4 hours for ~70-80% improvement**
|
||
|
||
### Optional:
|
||
6. **Phase 1.1, 4.1** - Text Preprocessing (2 hours, +10% accuracy)
|
||
|
||
---
|
||
|
||
## Testing Strategy
|
||
|
||
### Test 1: Baseline Measurement
|
||
```bash
|
||
# Process 10 CIMs and record extraction rate
|
||
npm run test:pipeline
|
||
# Record: How many financial fields are populated?
|
||
```
|
||
|
||
### Test 2: After Each Phase
|
||
```bash
|
||
# Same 10 CIMs, measure improvement
|
||
npm run test:pipeline
|
||
# Compare against baseline
|
||
```
|
||
|
||
### Test 3: Edge Cases
|
||
- PDFs with rotated pages
|
||
- PDFs with merged table cells
|
||
- PDFs with multi-line headers
|
||
- Narrative-only financials (no tables)
|
||
|
||
---
|
||
|
||
## Rollback Plan
|
||
|
||
Each phase is additive and can be disabled via feature flags:
|
||
|
||
```typescript
|
||
// config/env.ts
|
||
export const features = {
|
||
enhancedRegexParsing: process.env.ENHANCED_REGEX === 'true',
|
||
enhancedLLMContext: process.env.ENHANCED_LLM === 'true',
|
||
financialValidation: process.env.VALIDATE_FINANCIALS === 'true',
|
||
textPreprocessing: process.env.PREPROCESS_TEXT === 'true'
|
||
};
|
||
```
|
||
|
||
Set `ENHANCED_REGEX=false` to disable any phase.
|
||
|
||
---
|
||
|
||
## Success Metrics
|
||
|
||
| Metric | Current | Target | Measurement |
|
||
|--------|---------|--------|-------------|
|
||
| Financial data extracted | 10-20% | 80-90% | % of fields populated |
|
||
| Processing time | 45s | <60s | End-to-end time |
|
||
| False positives | Unknown | <5% | Manual validation |
|
||
| Column misalignment | ~50% | <10% | Check FY mapping |
|
||
|
||
---
|
||
|
||
## Next Steps
|
||
|
||
1. Implement Phase 2 (Enhanced LLM) first - biggest impact, lowest risk
|
||
2. Test with 5-10 real CIM documents
|
||
3. Measure improvement
|
||
4. If >70% accuracy, stop. If not, add Phase 1 and 3.
|
||
5. Keep Phase 4 as optional enhancement
|
||
|
||
The LLM is actually very good at this - we just need to give it the right context!
|