Files
cim_summary/HYBRID_SOLUTION.md
admin 9c916d12f4 feat: Production release v2.0.0 - Simple Document Processor
Major release with significant performance improvements and new processing strategy.

## Core Changes
- Implemented simple_full_document processing strategy (default)
- Full document → LLM approach: 1-2 passes, ~5-6 minutes processing time
- Achieved 100% completeness with 2 API calls (down from 5+)
- Removed redundant Document AI passes for faster processing

## Financial Data Extraction
- Enhanced deterministic financial table parser
- Improved FY3/FY2/FY1/LTM identification from varying CIM formats
- Automatic merging of parser results with LLM extraction

## Code Quality & Infrastructure
- Cleaned up debug logging (removed emoji markers from production code)
- Fixed Firebase Secrets configuration (using modern defineSecret approach)
- Updated OpenAI API key
- Resolved deployment conflicts (secrets vs environment variables)
- Added .env files to Firebase ignore list

## Deployment
- Firebase Functions v2 deployment successful
- All 7 required secrets verified and configured
- Function URL: https://api-y56ccs6wva-uc.a.run.app

## Performance Improvements
- Processing time: ~5-6 minutes (down from 23+ minutes)
- API calls: 1-2 (down from 5+)
- Completeness: 100% achievable
- LLM Model: claude-3-7-sonnet-latest

## Breaking Changes
- Default processing strategy changed to 'simple_full_document'
- RAG processor available as alternative strategy 'document_ai_agentic_rag'

## Files Changed
- 36 files changed, 5642 insertions(+), 4451 deletions(-)
- Removed deprecated documentation files
- Cleaned up unused services and models

This release represents a major refactoring focused on speed, accuracy, and maintainability.
2025-11-09 21:07:22 -05:00

26 KiB
Raw Blame History

Financial Data Extraction: Hybrid Solution

Better Regex + Enhanced LLM Approach

Philosophy

Rather than a major architectural refactor, this solution enhances what's already working:

  1. Smarter regex to catch more table patterns
  2. Better LLM context to ensure financial tables are always seen
  3. Hybrid validation where regex and LLM cross-check each other

Problem Analysis (Refined)

Current Issues:

  1. Regex is too strict - Misses valid table formats
  2. LLM gets incomplete context - Financial tables truncated or missing
  3. No cross-validation - Regex and LLM don't verify each other
  4. Table structure lost - But we can preserve it better with preprocessing

Key Insight:

The LLM is actually VERY good at understanding financial tables, even in messy text. We just need to:

  • Give it the RIGHT chunks (always include financial sections)
  • Give it MORE context (increase chunk size for financial data)
  • Give it BETTER formatting hints (preserve spacing/alignment where possible)

When to use this hybrid track: Rely on the telemetry described in FINANCIAL_EXTRACTION_ANALYSIS.md / IMPLEMENTATION_PLAN.md. If a document finishes Phase1/2 processing with tablesFound === 0 or financialDataPopulated === false, route it through the hybrid steps below so we only pay the extra cost when the structured-table path truly fails.


Solution Architecture

Three-Tier Extraction Strategy

Tier 1: Enhanced Regex Parser (Fast, Deterministic)
         ↓ (if successful)
         ✓ Use regex results
         ↓ (if incomplete/failed)

Tier 2: LLM with Enhanced Context (Powerful, Flexible)
         ↓ (extract from full financial sections)
         ✓ Fill in gaps from Tier 1
         ↓ (if still missing data)

Tier 3: LLM Deep Dive (Focused, Exhaustive)
         ↓ (targeted re-scan of entire document)
         ✓ Final gap-filling

Implementation Plan

Phase 1: Enhanced Regex Parser (2-3 hours)

1.1: Improve Text Preprocessing

Goal: Preserve table structure better before regex parsing

File: Create backend/src/utils/textPreprocessor.ts

/**
 * Enhanced text preprocessing to preserve table structures
 * Attempts to maintain column alignment from PDF extraction
 */

export interface PreprocessedText {
  original: string;
  enhanced: string;
  tableRegions: TextRegion[];
  metadata: {
    likelyTableCount: number;
    preservedAlignment: boolean;
  };
}

export interface TextRegion {
  start: number;
  end: number;
  type: 'table' | 'narrative' | 'header';
  confidence: number;
  content: string;
}

/**
 * Identify regions that look like tables based on formatting patterns
 */
export function identifyTableRegions(text: string): TextRegion[] {
  const regions: TextRegion[] = [];
  const lines = text.split('\n');

  let currentRegion: TextRegion | null = null;
  let regionStart = 0;
  let linePosition = 0;

  for (let i = 0; i < lines.length; i++) {
    const line = lines[i];
    const nextLine = lines[i + 1] || '';

    const isTableLike = detectTableLine(line, nextLine);

    if (isTableLike.isTable && !currentRegion) {
      // Start new table region
      currentRegion = {
        start: linePosition,
        end: linePosition + line.length,
        type: 'table',
        confidence: isTableLike.confidence,
        content: line
      };
      regionStart = i;
    } else if (isTableLike.isTable && currentRegion) {
      // Extend current table region
      currentRegion.end = linePosition + line.length;
      currentRegion.content += '\n' + line;
      currentRegion.confidence = Math.max(currentRegion.confidence, isTableLike.confidence);
    } else if (!isTableLike.isTable && currentRegion) {
      // End table region
      if (currentRegion.confidence > 0.5 && (i - regionStart) >= 3) {
        regions.push(currentRegion);
      }
      currentRegion = null;
    }

    linePosition += line.length + 1; // +1 for newline
  }

  // Add final region if exists
  if (currentRegion && currentRegion.confidence > 0.5) {
    regions.push(currentRegion);
  }

  return regions;
}

/**
 * Detect if a line looks like part of a table
 */
function detectTableLine(line: string, nextLine: string): { isTable: boolean; confidence: number } {
  let score = 0;

  // Check for multiple aligned numbers
  const numberMatches = line.match(/\$?[\d,]+\.?\d*[KMB%]?/g);
  if (numberMatches && numberMatches.length >= 3) {
    score += 0.4; // Multiple numbers = likely table row
  }

  // Check for consistent spacing (indicates columns)
  const hasConsistentSpacing = /\s{2,}/.test(line); // 2+ spaces = column separator
  if (hasConsistentSpacing && numberMatches) {
    score += 0.3;
  }

  // Check for year/period patterns
  if (/\b(FY[-\s]?\d{1,2}|20\d{2}|LTM|TTM)\b/i.test(line)) {
    score += 0.3;
  }

  // Check for financial keywords
  if (/(revenue|ebitda|sales|profit|margin|growth)/i.test(line)) {
    score += 0.2;
  }

  // Bonus: Next line also looks like a table
  if (nextLine && /\$?[\d,]+\.?\d*[KMB%]?/.test(nextLine)) {
    score += 0.2;
  }

  return {
    isTable: score > 0.5,
    confidence: Math.min(score, 1.0)
  };
}

/**
 * Enhance text by preserving spacing in table regions
 */
export function preprocessText(text: string): PreprocessedText {
  const tableRegions = identifyTableRegions(text);

  // For now, return original text with identified regions
  // In the future, could normalize spacing, align columns, etc.

  return {
    original: text,
    enhanced: text, // TODO: Apply enhancement algorithms
    tableRegions,
    metadata: {
      likelyTableCount: tableRegions.length,
      preservedAlignment: true
    }
  };
}

/**
 * Extract just the table regions as separate texts
 */
export function extractTableTexts(preprocessed: PreprocessedText): string[] {
  return preprocessed.tableRegions
    .filter(region => region.type === 'table' && region.confidence > 0.6)
    .map(region => region.content);
}

1.2: Enhance Financial Table Parser

File: backend/src/services/financialTableParser.ts

Add new patterns to catch more variations:

// ENHANCED: More flexible period token regex (add around line 21)
const PERIOD_TOKEN_REGEX = /\b(?:
  (?:FY[-\s]?\d{1,2})|           # FY-1, FY 2, etc.
  (?:FY[-\s]?)?20\d{2}[A-Z]*|    # 2021, FY2022A, etc.
  (?:FY[-\s]?[1234])|            # FY1, FY 2
  (?:LTM|TTM)|                   # LTM, TTM
  (?:CY\d{2})|                   # CY21, CY22
  (?:Q[1-4]\s*(?:FY|CY)?\d{2})   # Q1 FY23, Q4 2022
)\b/gix;

// ENHANCED: Better money regex to catch more formats (update line 22)
const MONEY_REGEX = /(?:
  \$\s*[\d,]+(?:\.\d+)?(?:\s*[KMB])?|     # $1,234.5M
  [\d,]+(?:\.\d+)?\s*[KMB]|                # 1,234.5M
  \([\d,]+(?:\.\d+)?(?:\s*[KMB])?\)|      # (1,234.5M) - negative
  [\d,]+(?:\.\d+)?                         # Plain numbers
)/gx;

// ENHANCED: Better percentage regex (update line 23)
const PERCENT_REGEX = /(?:
  \(?[\d,]+\.?\d*\s*%\)?|                  # 12.5% or (12.5%)
  [\d,]+\.?\d*\s*pct|                      # 12.5 pct
  NM|N\/A|n\/a                             # Not meaningful, N/A
)/gix;

Add multi-pass header detection:

// ADD after line 278 (after current header detection)

// ENHANCED: Multi-pass header detection if first pass failed
if (bestHeaderIndex === -1) {
  logger.info('First pass header detection failed, trying relaxed patterns');

  // Second pass: Look for ANY line with 3+ numbers and a year pattern
  for (let i = 0; i < lines.length; i++) {
    const line = lines[i];
    const hasYearPattern = /20\d{2}|FY|LTM|TTM/i.test(line);
    const numberCount = (line.match(/[\d,]+/g) || []).length;

    if (hasYearPattern && numberCount >= 3) {
      // Look at next 10 lines for financial keywords
      const lookAhead = lines.slice(i + 1, i + 11).join(' ');
      const hasFinancialKeywords = /revenue|ebitda|sales|profit/i.test(lookAhead);

      if (hasFinancialKeywords) {
        logger.info('Relaxed header detection found candidate', {
          headerIndex: i,
          headerLine: line.substring(0, 100)
        });

        // Try to parse this as header
        const tokens = tokenizePeriodHeaders(line);
        if (tokens.length >= 2) {
          bestHeaderIndex = i;
          bestBuckets = yearTokensToBuckets(tokens);
          bestHeaderScore = 50; // Lower confidence than primary detection
          break;
        }
      }
    }
  }
}

Add fuzzy row matching:

// ENHANCED: Add after line 354 (in the row matching loop)
// If exact match fails, try fuzzy matching

if (!ROW_MATCHERS[field].test(line)) {
  // Try fuzzy matching (partial matches, typos)
  const fuzzyMatch = fuzzyMatchFinancialRow(line, field);
  if (!fuzzyMatch) continue;
}

// ADD this helper function
function fuzzyMatchFinancialRow(line: string, field: string): boolean {
  const lineLower = line.toLowerCase();

  switch (field) {
    case 'revenue':
      return /rev\b|sales|top.?line/.test(lineLower);
    case 'ebitda':
      return /ebit|earnings.*operations|operating.*income/.test(lineLower);
    case 'grossProfit':
      return /gross.*profit|gp\b/.test(lineLower);
    case 'grossMargin':
      return /gross.*margin|gm\b|gross.*%/.test(lineLower);
    case 'ebitdaMargin':
      return /ebitda.*margin|ebitda.*%|margin.*ebitda/.test(lineLower);
    case 'revenueGrowth':
      return /revenue.*growth|growth.*revenue|rev.*growth|yoy|y.y/.test(lineLower);
    default:
      return false;
  }
}

Phase 2: Enhanced LLM Context Delivery (2-3 hours)

2.1: Financial Section Prioritization

File: backend/src/services/optimizedAgenticRAGProcessor.ts

Improve the prioritizeFinancialChunks method (around line 1265):

// ENHANCED: Much more aggressive financial chunk prioritization
private prioritizeFinancialChunks(chunks: ProcessingChunk[]): ProcessingChunk[] {
  const scoredChunks = chunks.map(chunk => {
    const content = chunk.content.toLowerCase();
    let score = 0;

    // TIER 1: Strong financial indicators (high score)
    const tier1Patterns = [
      /financial\s+summary/i,
      /historical\s+financials/i,
      /financial\s+performance/i,
      /income\s+statement/i,
      /financial\s+highlights/i,
    ];
    tier1Patterns.forEach(pattern => {
      if (pattern.test(content)) score += 100;
    });

    // TIER 2: Contains both periods AND metrics (very likely financial table)
    const hasPeriods = /\b(20[12]\d|FY[-\s]?\d{1,2}|LTM|TTM)\b/i.test(content);
    const hasMetrics = /(revenue|ebitda|sales|profit|margin)/i.test(content);
    const hasNumbers = /\$[\d,]+|[\d,]+[KMB]/i.test(content);

    if (hasPeriods && hasMetrics && hasNumbers) {
      score += 80; // Very likely financial table
    } else if (hasPeriods && hasMetrics) {
      score += 50;
    } else if (hasPeriods && hasNumbers) {
      score += 30;
    }

    // TIER 3: Multiple financial keywords
    const financialKeywords = [
      'revenue', 'ebitda', 'gross profit', 'margin', 'sales',
      'operating income', 'net income', 'cash flow', 'growth'
    ];
    const keywordMatches = financialKeywords.filter(kw => content.includes(kw)).length;
    score += keywordMatches * 5;

    // TIER 4: Has year progression (2021, 2022, 2023)
    const years = content.match(/20[12]\d/g);
    if (years && years.length >= 3) {
      score += 25; // Sequential years = likely financial table
    }

    // TIER 5: Multiple currency values
    const currencyMatches = content.match(/\$[\d,]+(?:\.\d+)?[KMB]?/gi);
    if (currencyMatches) {
      score += Math.min(currencyMatches.length * 3, 30);
    }

    // TIER 6: Section type boost
    if (chunk.sectionType && /financial|income|statement/i.test(chunk.sectionType)) {
      score += 40;
    }

    return { chunk, score };
  });

  // Sort by score and return
  const sorted = scoredChunks.sort((a, b) => b.score - a.score);

  // Log top financial chunks for debugging
  logger.info('Financial chunk prioritization results', {
    topScores: sorted.slice(0, 5).map(s => ({
      chunkIndex: s.chunk.chunkIndex,
      score: s.score,
      preview: s.chunk.content.substring(0, 100)
    }))
  });

  return sorted.map(s => s.chunk);
}

2.2: Increase Context for Financial Pass

File: backend/src/services/optimizedAgenticRAGProcessor.ts

Update Pass 1 to use more chunks and larger context:

// ENHANCED: Update line 1259 (extractPass1CombinedMetadataFinancial)
// Change from 7 chunks to 12 chunks, and increase character limit

const maxChunks = 12; // Was 7 - give LLM more context for financials
const maxCharsPerChunk = 3000; // Was 1500 - don't truncate tables as aggressively

// And update line 1595 in extractWithTargetedQuery
const maxCharsPerChunk = options?.isFinancialPass ? 3000 : 1500;

2.3: Enhanced Financial Extraction Prompt

File: backend/src/services/optimizedAgenticRAGProcessor.ts

Update the Pass 1 query (around line 1196-1240) to be more explicit:

// ENHANCED: Much more detailed extraction instructions
const query = `Extract deal information, company metadata, and COMPREHENSIVE financial data.

CRITICAL FINANCIAL TABLE EXTRACTION INSTRUCTIONS:

I. LOCATE FINANCIAL TABLES
Look for sections titled: "Financial Summary", "Historical Financials", "Financial Performance",
"Income Statement", "P&L", "Key Metrics", "Financial Highlights", or similar.

Financial tables typically appear in these formats:

FORMAT 1 - Row-based:
                    FY 2021    FY 2022    FY 2023    LTM
Revenue             $45.2M     $52.8M     $61.2M     $58.5M
Revenue Growth       N/A        16.8%      15.9%     (4.4%)
EBITDA              $8.5M      $10.2M     $12.1M     $11.5M

FORMAT 2 - Column-based:
Metric              | Value
-------------------|---------
FY21 Revenue       | $45.2M
FY22 Revenue       | $52.8M
FY23 Revenue       | $61.2M

FORMAT 3 - Inline:
Revenue grew from $45.2M in FY2021 to $52.8M in FY2022 (+16.8%) and $61.2M in FY2023 (+15.9%)

II. EXTRACTION RULES

1. PERIOD IDENTIFICATION
   - FY-3, FY-2, FY-1 = Three most recent FULL fiscal years (not projections)
   - LTM/TTM = Most recent 12-month period
   - Map year labels: If you see "FY2021, FY2022, FY2023, LTM Sep'23", then:
     * FY2021 → fy3
     * FY2022 → fy2
     * FY2023 → fy1
     * LTM Sep'23 → ltm

2. VALUE EXTRACTION
   - Extract EXACT values as shown: "$45.2M", "16.8%", etc.
   - Preserve formatting: "$45.2M" not "45.2" or "45200000"
   - Include negative indicators: "(4.4%)" or "-4.4%"
   - Use "N/A" or "NM" if explicitly stated (not "Not specified")

3. METRIC IDENTIFICATION
   - Revenue = "Revenue", "Net Sales", "Total Sales", "Top Line"
   - EBITDA = "EBITDA", "Adjusted EBITDA", "Adj. EBITDA"
   - Margins = Look for "%" after metric name
   - Growth = "Growth %", "YoY", "Y/Y", "Change %"

4. DEAL OVERVIEW
   - Extract: company name, industry, geography, transaction type
   - Extract: employee count, deal source, reason for sale
   - Extract: CIM dates and metadata

III. QUALITY CHECKS

Before submitting your response:
- [ ] Did I find at least 3 distinct fiscal periods?
- [ ] Do I have Revenue AND EBITDA for at least 2 periods?
- [ ] Did I preserve exact number formats from the document?
- [ ] Did I map the periods correctly (newest = fy1, oldest = fy3)?

IV. WHAT TO DO IF TABLE IS UNCLEAR

If the table is hard to parse:
- Include the ENTIRE table section in your analysis
- Extract what you can with confidence
- Mark unclear values as "Not specified in CIM" only if truly absent
- DO NOT guess or interpolate values

V. ADDITIONAL FINANCIAL DATA

Also extract:
- Quality of earnings notes
- EBITDA adjustments and add-backs
- Revenue growth drivers
- Margin trends and analysis
- CapEx requirements
- Working capital needs
- Free cash flow comments`;

Phase 3: Hybrid Validation & Cross-Checking (1-2 hours)

3.1: Create Validation Layer

File: Create backend/src/services/financialDataValidator.ts

import { logger } from '../utils/logger';
import type { ParsedFinancials } from './financialTableParser';
import type { CIMReview } from './llmSchemas';

export interface ValidationResult {
  isValid: boolean;
  confidence: number;
  issues: string[];
  corrections: ParsedFinancials;
}

/**
 * Cross-validate financial data from multiple sources
 */
export function validateFinancialData(
  regexResult: ParsedFinancials,
  llmResult: Partial<CIMReview>
): ValidationResult {
  const issues: string[] = [];
  const corrections: ParsedFinancials = { ...regexResult };
  let confidence = 1.0;

  // Extract LLM financials
  const llmFinancials = llmResult.financialSummary?.financials;

  if (!llmFinancials) {
    return {
      isValid: true,
      confidence: 0.5,
      issues: ['No LLM financial data to validate against'],
      corrections: regexResult
    };
  }

  // Validate each period
  const periods: Array<keyof ParsedFinancials> = ['fy3', 'fy2', 'fy1', 'ltm'];

  for (const period of periods) {
    const regexPeriod = regexResult[period];
    const llmPeriod = llmFinancials[period];

    if (!llmPeriod) continue;

    // Compare revenue
    if (regexPeriod.revenue && llmPeriod.revenue) {
      const match = compareFinancialValues(regexPeriod.revenue, llmPeriod.revenue);
      if (!match.matches) {
        issues.push(`${period} revenue mismatch: Regex="${regexPeriod.revenue}" vs LLM="${llmPeriod.revenue}"`);
        confidence -= 0.1;

        // Trust LLM if regex value looks suspicious
        if (match.llmMoreCredible) {
          corrections[period].revenue = llmPeriod.revenue;
        }
      }
    } else if (!regexPeriod.revenue && llmPeriod.revenue && llmPeriod.revenue !== 'Not specified in CIM') {
      // Regex missed it, LLM found it
      corrections[period].revenue = llmPeriod.revenue;
      issues.push(`${period} revenue: Regex missed, using LLM value: ${llmPeriod.revenue}`);
    }

    // Compare EBITDA
    if (regexPeriod.ebitda && llmPeriod.ebitda) {
      const match = compareFinancialValues(regexPeriod.ebitda, llmPeriod.ebitda);
      if (!match.matches) {
        issues.push(`${period} EBITDA mismatch: Regex="${regexPeriod.ebitda}" vs LLM="${llmPeriod.ebitda}"`);
        confidence -= 0.1;

        if (match.llmMoreCredible) {
          corrections[period].ebitda = llmPeriod.ebitda;
        }
      }
    } else if (!regexPeriod.ebitda && llmPeriod.ebitda && llmPeriod.ebitda !== 'Not specified in CIM') {
      corrections[period].ebitda = llmPeriod.ebitda;
      issues.push(`${period} EBITDA: Regex missed, using LLM value: ${llmPeriod.ebitda}`);
    }

    // Fill in other fields from LLM if regex didn't get them
    const fields: Array<keyof typeof regexPeriod> = [
      'revenueGrowth', 'grossProfit', 'grossMargin', 'ebitdaMargin'
    ];

    for (const field of fields) {
      if (!regexPeriod[field] && llmPeriod[field] && llmPeriod[field] !== 'Not specified in CIM') {
        corrections[period][field] = llmPeriod[field];
      }
    }
  }

  logger.info('Financial data validation completed', {
    confidence,
    issueCount: issues.length,
    issues: issues.slice(0, 5)
  });

  return {
    isValid: confidence > 0.6,
    confidence,
    issues,
    corrections
  };
}

/**
 * Compare two financial values to see if they match
 */
function compareFinancialValues(
  value1: string,
  value2: string
): { matches: boolean; llmMoreCredible: boolean } {
  const clean1 = value1.replace(/[$,\s]/g, '').toUpperCase();
  const clean2 = value2.replace(/[$,\s]/g, '').toUpperCase();

  // Exact match
  if (clean1 === clean2) {
    return { matches: true, llmMoreCredible: false };
  }

  // Check if numeric values are close (within 5%)
  const num1 = parseFinancialValue(value1);
  const num2 = parseFinancialValue(value2);

  if (num1 && num2) {
    const percentDiff = Math.abs((num1 - num2) / num1);
    if (percentDiff < 0.05) {
      // Values are close enough
      return { matches: true, llmMoreCredible: false };
    }

    // Large difference - trust value with more precision
    const precision1 = (value1.match(/\./g) || []).length;
    const precision2 = (value2.match(/\./g) || []).length;

    return {
      matches: false,
      llmMoreCredible: precision2 > precision1
    };
  }

  return { matches: false, llmMoreCredible: false };
}

/**
 * Parse a financial value string to number
 */
function parseFinancialValue(value: string): number | null {
  const clean = value.replace(/[$,\s]/g, '');

  let multiplier = 1;
  if (/M$/i.test(clean)) {
    multiplier = 1000000;
  } else if (/K$/i.test(clean)) {
    multiplier = 1000;
  } else if (/B$/i.test(clean)) {
    multiplier = 1000000000;
  }

  const numStr = clean.replace(/[MKB]/i, '');
  const num = parseFloat(numStr);

  return isNaN(num) ? null : num * multiplier;
}

3.2: Integrate Validation into Processing

File: backend/src/services/optimizedAgenticRAGProcessor.ts

Add after line 1137 (after merging partial results):

// ENHANCED: Cross-validate regex and LLM results
if (deterministicFinancials) {
  logger.info('Validating deterministic financials against LLM results');

  const { validateFinancialData } = await import('./financialDataValidator');
  const validation = validateFinancialData(deterministicFinancials, mergedData);

  logger.info('Validation results', {
    documentId,
    isValid: validation.isValid,
    confidence: validation.confidence,
    issueCount: validation.issues.length
  });

  // Use validated/corrected data
  if (validation.confidence > 0.7) {
    deterministicFinancials = validation.corrections;
    logger.info('Using validated corrections', {
      documentId,
      corrections: validation.corrections
    });
  }

  // Merge validated data
  this.mergeDeterministicFinancialData(mergedData, deterministicFinancials, documentId);
} else {
  logger.info('No deterministic financial data to validate', { documentId });
}

Phase 4: Text Preprocessing Integration (1 hour)

4.1: Apply Preprocessing to Document AI Text

File: backend/src/services/documentAiProcessor.ts

Add preprocessing before passing to RAG:

// ADD import at top
import { preprocessText, extractTableTexts } from '../utils/textPreprocessor';

// UPDATE line 83 (processWithAgenticRAG method)
private async processWithAgenticRAG(documentId: string, extractedText: string): Promise<any> {
  try {
    logger.info('Processing extracted text with Agentic RAG', {
      documentId,
      textLength: extractedText.length
    });

    // ENHANCED: Preprocess text to identify table regions
    const preprocessed = preprocessText(extractedText);

    logger.info('Text preprocessing completed', {
      documentId,
      tableRegionsFound: preprocessed.tableRegions.length,
      likelyTableCount: preprocessed.metadata.likelyTableCount
    });

    // Extract table texts separately for better parsing
    const tableSections = extractTableTexts(preprocessed);

    // Import and use the optimized agentic RAG processor
    const { optimizedAgenticRAGProcessor } = await import('./optimizedAgenticRAGProcessor');

    const result = await optimizedAgenticRAGProcessor.processLargeDocument(
      documentId,
      extractedText,
      {
        preprocessedData: preprocessed,  // Pass preprocessing results
        tableSections: tableSections      // Pass isolated table texts
      }
    );

    return result;
  } catch (error) {
    // ... existing error handling
  }
}

Expected Results

Current State (Baseline):

Financial data extraction rate: 10-20%
Typical result: "Not specified in CIM" for most fields

After Phase 1 (Enhanced Regex):

Financial data extraction rate: 35-45%
Improvement: Better pattern matching catches more tables

After Phase 2 (Enhanced LLM):

Financial data extraction rate: 65-75%
Improvement: LLM sees financial tables more reliably

After Phase 3 (Validation):

Financial data extraction rate: 75-85%
Improvement: Cross-validation fills gaps and corrects errors

After Phase 4 (Preprocessing):

Financial data extraction rate: 80-90%
Improvement: Table structure preservation helps both regex and LLM

Implementation Priority

Start Here (Highest ROI):

  1. Phase 2.1 - Financial Section Prioritization (30 min, +30% accuracy)
  2. Phase 2.2 - Increase LLM Context (15 min, +15% accuracy)
  3. Phase 2.3 - Enhanced Prompt (30 min, +20% accuracy)

Total: 1.5 hours for ~50-60% improvement

Then Do:

  1. Phase 1.2 - Enhanced Parser Patterns (1 hour, +10% accuracy)
  2. Phase 3.1-3.2 - Validation (1.5 hours, +10% accuracy)

Total: 4 hours for ~70-80% improvement

Optional:

  1. Phase 1.1, 4.1 - Text Preprocessing (2 hours, +10% accuracy)

Testing Strategy

Test 1: Baseline Measurement

# Process 10 CIMs and record extraction rate
npm run test:pipeline
# Record: How many financial fields are populated?

Test 2: After Each Phase

# Same 10 CIMs, measure improvement
npm run test:pipeline
# Compare against baseline

Test 3: Edge Cases

  • PDFs with rotated pages
  • PDFs with merged table cells
  • PDFs with multi-line headers
  • Narrative-only financials (no tables)

Rollback Plan

Each phase is additive and can be disabled via feature flags:

// config/env.ts
export const features = {
  enhancedRegexParsing: process.env.ENHANCED_REGEX === 'true',
  enhancedLLMContext: process.env.ENHANCED_LLM === 'true',
  financialValidation: process.env.VALIDATE_FINANCIALS === 'true',
  textPreprocessing: process.env.PREPROCESS_TEXT === 'true'
};

Set ENHANCED_REGEX=false to disable any phase.


Success Metrics

Metric Current Target Measurement
Financial data extracted 10-20% 80-90% % of fields populated
Processing time 45s <60s End-to-end time
False positives Unknown <5% Manual validation
Column misalignment ~50% <10% Check FY mapping

Next Steps

  1. Implement Phase 2 (Enhanced LLM) first - biggest impact, lowest risk
  2. Test with 5-10 real CIM documents
  3. Measure improvement
  4. If >70% accuracy, stop. If not, add Phase 1 and 3.
  5. Keep Phase 4 as optional enhancement

The LLM is actually very good at this - we just need to give it the right context!