Files

admin 9c916d12f4 feat: Production release v2.0.0 - Simple Document Processor

Major release with significant performance improvements and new processing strategy.

## Core Changes
- Implemented simple_full_document processing strategy (default)
- Full document → LLM approach: 1-2 passes, ~5-6 minutes processing time
- Achieved 100% completeness with 2 API calls (down from 5+)
- Removed redundant Document AI passes for faster processing

## Financial Data Extraction
- Enhanced deterministic financial table parser
- Improved FY3/FY2/FY1/LTM identification from varying CIM formats
- Automatic merging of parser results with LLM extraction

## Code Quality & Infrastructure
- Cleaned up debug logging (removed emoji markers from production code)
- Fixed Firebase Secrets configuration (using modern defineSecret approach)
- Updated OpenAI API key
- Resolved deployment conflicts (secrets vs environment variables)
- Added .env files to Firebase ignore list

## Deployment
- Firebase Functions v2 deployment successful
- All 7 required secrets verified and configured
- Function URL: https://api-y56ccs6wva-uc.a.run.app

## Performance Improvements
- Processing time: ~5-6 minutes (down from 23+ minutes)
- API calls: 1-2 (down from 5+)
- Completeness: 100% achievable
- LLM Model: claude-3-7-sonnet-latest

## Breaking Changes
- Default processing strategy changed to 'simple_full_document'
- RAG processor available as alternative strategy 'document_ai_agentic_rag'

## Files Changed
- 36 files changed, 5642 insertions(+), 4451 deletions(-)
- Removed deprecated documentation files
- Cleaned up unused services and models

This release represents a major refactoring focused on speed, accuracy, and maintainability.

2025-11-09 21:07:22 -05:00

16 KiB

Raw Permalink Blame History

Financial Data Extraction Issue: Root Cause Analysis & Solution

Executive Summary

Problem: Financial data showing "Not specified in CIM" even when tables exist in the PDF.

Root Cause: Document AI's structured table data is being completely ignored in favor of flattened text, causing the parser to fail.

Impact: ~80-90% of financial tables fail to parse correctly.

Current Pipeline Analysis

Stage 1: Document AI Processing ✅ (Working but underutilized)

// documentAiProcessor.ts:408-482
private async processWithDocumentAI() {
  const [result] = await this.documentAiClient.processDocument(request);
  const { document } = result;

  // ✅ Extracts structured tables
  const tables = document.pages?.flatMap(page =>
    page.tables?.map(table => ({
      rows: table.headerRows?.length || 0,      // ❌ Only counting!
      columns: table.bodyRows?.[0]?.cells?.length || 0  // ❌ Not using!
    }))
  );

  // ❌ PROBLEM: Only returns flat text, throws away table structure
  return { text: document.text, entities, tables, pages };
}

What Document AI Actually Provides:

document.pages[].tables[] - Fully structured tables with:
- headerRows[] - Column headers with cell text via layout anchors
- bodyRows[] - Data rows with aligned cell values
- layout - Text positions in the original document
- cells[] - Individual cell data with rowSpan/colSpan

What We're Using: Only document.text (flattened)

Stage 2: Text Extraction ❌ (Losing structure)

// documentAiProcessor.ts:151-207
const extractedText = await this.extractTextFromDocument(fileBuffer, fileName, mimeType);
// Returns: "FY-3 FY-2 FY-1 LTM Revenue $45.2M $52.8M $61.2M $58.5M EBITDA $8.5M..."
// Lost:    Column alignment, row structure, table boundaries

Original PDF Table:

                    FY-3        FY-2        FY-1        LTM
Revenue            $45.2M      $52.8M      $61.2M      $58.5M
Revenue Growth      N/A         16.8%       15.9%       (4.4)%
EBITDA             $8.5M      $10.2M      $12.1M      $11.5M
EBITDA Margin       18.8%       19.3%       19.8%       19.7%

What Parser Receives (flattened):

FY-3 FY-2 FY-1 LTM Revenue $45.2M $52.8M $61.2M $58.5M Revenue Growth N/A 16.8% 15.9% (4.4)% EBITDA $8.5M $10.2M $12.1M $11.5M EBITDA Margin 18.8% 19.3% 19.8% 19.7%

Stage 3: Deterministic Parser ❌ (Fighting lost structure)

// financialTableParser.ts:181-406
export function parseFinancialsFromText(fullText: string): ParsedFinancials {
  // 1. Find header line with year tokens (FY-3, FY-2, etc.)
  //    ❌ PROBLEM: Years might be on different lines now

  // 2. Look for revenue/EBITDA rows within 20 lines
  //    ❌ PROBLEM: Row detection works, but...

  // 3. Extract numeric tokens and assign to columns
  //    ❌ PROBLEM: Can't determine which number belongs to which column!
  //    Numbers are just in sequence: $45.2M $52.8M $61.2M $58.5M
  //    Are these revenues for FY-3, FY-2, FY-1, LTM? Or something else?

  // Result: Returns empty {} or incorrect mappings
}

Failure Points:

Header Detection (lines 197-278): Requires period tokens in ONE line
- Flattened text scatters tokens across multiple lines
- Scoring system can't find tables with both revenue AND EBITDA
Column Alignment (lines 160-179): Assumes tokens map to buckets by position
- No way to know which token belongs to which column
- Whitespace-based alignment is lost
Multi-line Tables: Financial tables often span multiple lines per row
- Parser combines 2-3 lines but still can't reconstruct columns

Stage 4: LLM Extraction ⚠️ (Limited context)

// optimizedAgenticRAGProcessor.ts:1552-1641
private async extractWithTargetedQuery() {
  // 1. RAG selects ~7 most relevant chunks
  // 2. Each chunk truncated to 1500 chars
  // 3. Total context: ~10,500 chars

  // ❌ PROBLEM: Financial tables might be:
  //    - Split across multiple chunks
  //    - Not in the top 7 most "similar" chunks
  //    - Truncated mid-table
  //    - Still in flattened format anyway
}

Unused Assets

1. Document AI Table Structure (BIGGEST MISS)

Location: Available in Document AI response but never used

What It Provides:

document.pages[0].tables[0] = {
  layout: { /* table position */ },
  headerRows: [{
    cells: [
      { layout: { textAnchor: { start: 123, end: 127 } } },  // "FY-3"
      { layout: { textAnchor: { start: 135, end: 139 } } },  // "FY-2"
      // ...
    ]
  }],
  bodyRows: [{
    cells: [
      { layout: { textAnchor: { start: 200, end: 207 } } },  // "Revenue"
      { layout: { textAnchor: { start: 215, end: 222 } } },  // "$45.2M"
      { layout: { textAnchor: { start: 230, end: 237 } } },  // "$52.8M"
      // ...
    ]
  }]
}

How to Use:

function getTableText(layout, documentText) {
  const start = layout.textAnchor.textSegments[0].startIndex;
  const end = layout.textAnchor.textSegments[0].endIndex;
  return documentText.substring(start, end);
}

2. Financial Extractor Utility

Location: src/utils/financialExtractor.ts (lines 1-159)

Features:

Robust column splitting: /\s{2,}|\t/ (2+ spaces or tabs)
Clean value parsing with K/M/B multipliers
Percentage and negative number handling
Better than current parser but still works on flat text

Status: Never imported or used anywhere in the codebase

Root Cause Summary

Issue	Impact	Severity
Document AI table structure ignored	100% structure loss	🔴 CRITICAL
Only flat text used for parsing	Parser can't align columns	🔴 CRITICAL
financialExtractor.ts not used	Missing better parsing logic	🟡 MEDIUM
RAG chunks miss complete tables	LLM has incomplete data	🟡 MEDIUM
No table-aware chunking	Financial sections fragmented	🟡 MEDIUM

Baseline Measurements & Instrumentation

Before changing the pipeline, capture hard numbers so we can prove the fix works and spot remaining gaps. Add the following telemetry to the processing result (also referenced in IMPLEMENTATION_PLAN.md):

metadata: {
  tablesFound: structuredTables.length,
  financialTablesIdentified: structuredTables.filter(isFinancialTable).length,
  structuredParsingUsed: Boolean(deterministicFinancialsFromTables),
  textParsingFallback: !deterministicFinancialsFromTables,
  financialDataPopulated: hasPopulatedFinancialSummary(result)
}

Baseline checklist (run on ≥20 recent CIM uploads):

Count how many documents have tablesFound > 0 but financialDataPopulated === false.
Record the average/median tablesFound, financialTablesIdentified, and current financial fill rate.
Log sample documentIds where tablesFound === 0 (helps scope Phase 3 hybrid work).

Paste the aggregated numbers back into this doc so Success Metrics are grounded in actual data rather than estimates.

Recommended Solution Architecture

Phase 1: Use Document AI Table Structure (HIGHEST IMPACT)

Implementation:

// NEW: documentAiProcessor.ts
interface StructuredTable {
  headers: string[];
  rows: string[][];
  position: { page: number; confidence: number };
}

private extractStructuredTables(document: any, text: string): StructuredTable[] {
  const tables: StructuredTable[] = [];

  for (const page of document.pages || []) {
    for (const table of page.tables || []) {
      // Extract headers
      const headers = table.headerRows?.[0]?.cells?.map(cell =>
        this.getTextFromLayout(cell.layout, text)
      ) || [];

      // Extract data rows
      const rows = table.bodyRows?.map(row =>
        row.cells.map(cell => this.getTextFromLayout(cell.layout, text))
      ) || [];

      tables.push({ headers, rows, position: { page: page.pageNumber, confidence: 0.9 } });
    }
  }

  return tables;
}

private getTextFromLayout(layout: any, documentText: string): string {
  const segments = layout.textAnchor?.textSegments || [];
  if (segments.length === 0) return '';

  const start = parseInt(segments[0].startIndex || '0');
  const end = parseInt(segments[0].endIndex || documentText.length.toString());

  return documentText.substring(start, end).trim();
}

Return Enhanced Output:

interface DocumentAIOutput {
  text: string;
  entities: Array<any>;
  tables: StructuredTable[];  // ✅ Now usable!
  pages: Array<any>;
  mimeType: string;
}

Phase 2: Financial Table Classifier

Purpose: Identify which tables are financial data

// NEW: services/financialTableClassifier.ts
export function isFinancialTable(table: StructuredTable): boolean {
  const headerText = table.headers.join(' ').toLowerCase();
  const firstRowText = table.rows[0]?.join(' ').toLowerCase() || '';

  // Check for year/period indicators
  const hasPeriods = /fy[-\s]?\d{1,2}|20\d{2}|ltm|ttm|ytd/.test(headerText);

  // Check for financial metrics
  const hasMetrics = /(revenue|ebitda|sales|profit|margin|cash flow)/i.test(
    table.rows.slice(0, 5).join(' ')
  );

  // Check for currency values
  const hasCurrency = /\$[\d,]+|\d+[km]|\d+\.\d+%/.test(firstRowText);

  return hasPeriods && (hasMetrics || hasCurrency);
}

Phase 3: Enhanced Financial Parser

Use structured tables instead of flat text:

// UPDATED: financialTableParser.ts
export function parseFinancialsFromStructuredTable(
  table: StructuredTable
): ParsedFinancials {
  const result: ParsedFinancials = { fy3: {}, fy2: {}, fy1: {}, ltm: {} };

  // 1. Parse headers to identify periods
  const buckets = yearTokensToBuckets(
    table.headers.map(h => normalizePeriodToken(h))
  );

  // 2. For each row, identify the metric
  for (const row of table.rows) {
    const metricName = row[0].toLowerCase();
    const values = row.slice(1); // Skip first column (metric name)

    // 3. Match metric to field
    for (const [field, matcher] of Object.entries(ROW_MATCHERS)) {
      if (matcher.test(metricName)) {
        // 4. Assign values to buckets (GUARANTEED ALIGNMENT!)
        buckets.forEach((bucket, index) => {
          if (bucket && values[index]) {
            result[bucket][field] = values[index];
          }
        });
      }
    }
  }

  return result;
}

Key Improvement: Column alignment is guaranteed because:

Headers and values come from the same table structure
Index positions are preserved
No string parsing or whitespace guessing needed

Phase 4: Table-Aware Chunking

Store financial tables as special chunks:

// UPDATED: optimizedAgenticRAGProcessor.ts
private async createIntelligentChunks(
  text: string,
  documentId: string,
  tables: StructuredTable[]
): Promise<ProcessingChunk[]> {
  const chunks: ProcessingChunk[] = [];

  // 1. Create dedicated chunks for financial tables
  for (const table of tables.filter(isFinancialTable)) {
    chunks.push({
      id: `${documentId}-financial-table-${chunks.length}`,
      content: this.formatTableAsMarkdown(table),
      chunkIndex: chunks.length,
      sectionType: 'financial-table',
      metadata: {
        isFinancialTable: true,
        tablePosition: table.position,
        structuredData: table  // ✅ Preserve structure!
      }
    });
  }

  // 2. Continue with normal text chunking
  // ...
}

private formatTableAsMarkdown(table: StructuredTable): string {
  const header = `| ${table.headers.join(' | ')} |`;
  const separator = `| ${table.headers.map(() => '---').join(' | ')} |`;
  const rows = table.rows.map(row => `| ${row.join(' | ')} |`);

  return [header, separator, ...rows].join('\n');
}

Phase 5: Priority Pinning for Financial Chunks

Ensure financial tables always included in LLM context:

// UPDATED: optimizedAgenticRAGProcessor.ts
private async extractPass1CombinedMetadataFinancial() {
  // 1. Find all financial table chunks
  const financialTableChunks = chunks.filter(
    c => c.metadata?.isFinancialTable === true
  );

  // 2. PIN them to always be included
  return await this.extractWithTargetedQuery(
    documentId,
    text,
    chunks,
    query,
    targetFields,
    7,
    financialTableChunks  // ✅ Always included!
  );
}

Implementation Phases & Priorities

Phase 1: Quick Win (1-2 hours) - RECOMMENDED START

Goal: Use Document AI tables immediately (matches IMPLEMENTATION_PLAN.md Phase 1)

Planned changes:

Extract structured tables in documentAiProcessor.ts.
Pass tables (and metadata) to optimizedAgenticRAGProcessor.
Emit dedicated financial-table chunks that preserve structure.
Pin financial chunks so every RAG/LLM pass sees them.

Expected Improvement: 60-70% accuracy gain (verify via new instrumentation).

Phase 2: Enhanced Parsing (2-3 hours)

Goal: Deterministic extraction from structured tables before falling back to text (see IMPLEMENTATION_PLAN.md Phase 2).

Planned changes:

Implement parseFinancialsFromStructuredTable() and reuse existing deterministic merge paths.
Add a classifier that flags which structured tables are financial.
Update merge logic to favor structured data yet keep the text/LLM fallback.

Expected Improvement: 85-90% accuracy (subject to measured baseline).

Phase 3: LLM Optimization (1-2 hours)

Goal: Better context for LLM when tables are incomplete or absent (aligns with HYBRID_SOLUTION.md Phase 2/3).

Planned changes:

Format tables as markdown and raise chunk limits for financial passes.
Prioritize and pin financial chunks in extractPass1CombinedMetadataFinancial.
Inject explicit “find the table” instructions into the prompt.

Expected Improvement: 90-95% accuracy when Document AI tables exist; otherwise falls back to the hybrid regex/LLM path.

Phase 4: Integration & Testing (2-3 hours)

Goal: Ensure backward compatibility and document measured improvements

Planned changes:

Keep the legacy text parser as a fallback whenever tablesFound === 0.
Capture the telemetry outlined earlier and publish before/after numbers.
Test against a labeled CIM set covering: clean tables, multi-line rows, scanned PDFs (no structured tables), and partial data cases.

Handling Documents With No Structured Tables

Even after Phases 1-2, some CIMs (e.g., scans or image-only tables) will have tablesFound === 0. When that happens:

Trigger the enhanced preprocessing + regex route from HYBRID_SOLUTION.md (Phase 1).
Surface an explicit warning in metadata/logs so analysts know the deterministic path was skipped.
Feed the isolated table text (if any) plus surrounding context into the LLM with the financial prompt upgrades from Phase 3.

This ensures the hybrid approach only engages when the Document AI path truly lacks structured tables, keeping maintenance manageable while covering the remaining gap.

Success Metrics

Metric	Current	Phase 1	Phase 2	Phase 3
Financial data extracted	10-20%	60-70%	85-90%	90-95%
Tables identified	0%	80%	90%	95%
Column alignment accuracy	10%	95%	98%	99%
Processing time	45s	42s	38s	35s

Code Quality Improvements

Current Issues:

❌ Document AI tables extracted but never used
❌ financialExtractor.ts exists but never imported
❌ Parser assumes flat text has structure
❌ No table-specific chunking strategy

After Implementation:

✅ Full use of Document AI's structured data
✅ Multi-tier extraction strategy (structured → fallback → LLM)
✅ Table-aware chunking and RAG
✅ Guaranteed column alignment
✅ Better error handling and logging

Alternative Approaches Considered

Option 1: Better Regex Parsing (REJECTED)

Reason: Can't solve the fundamental problem of lost structure

Option 2: Use Only LLM (REJECTED)

Reason: Expensive, slower, less accurate than structured extraction

Option 3: Replace Document AI (REJECTED)

Reason: Document AI works fine, we're just not using it properly

Option 4: Manual Table Markup (REJECTED)

Reason: Not scalable, requires user intervention

Conclusion

The issue is NOT a parsing problem or an LLM problem.

The issue is an architecture problem: We're extracting structured tables from Document AI and then throwing away the structure.

The fix is simple: Use the data we're already getting.

Recommended action: Implement Phase 1 (Quick Win) immediately for 60-70% improvement, then evaluate if Phases 2-3 are needed based on results.

16 KiB Raw Permalink Blame History Unescape Escape

Financial Data Extraction Issue: Root Cause Analysis & Solution

Executive Summary

Current Pipeline Analysis

Stage 1: Document AI Processing ✅ (Working but underutilized)

Stage 2: Text Extraction ❌ (Losing structure)

Stage 3: Deterministic Parser ❌ (Fighting lost structure)

Stage 4: LLM Extraction ⚠️ (Limited context)

Unused Assets

1. Document AI Table Structure (BIGGEST MISS)

2. Financial Extractor Utility

Root Cause Summary

Baseline Measurements & Instrumentation

Recommended Solution Architecture

Phase 1: Use Document AI Table Structure (HIGHEST IMPACT)

Phase 2: Financial Table Classifier

Phase 3: Enhanced Financial Parser

Phase 4: Table-Aware Chunking

Phase 5: Priority Pinning for Financial Chunks

Implementation Phases & Priorities

Phase 1: Quick Win (1-2 hours) - RECOMMENDED START

Phase 2: Enhanced Parsing (2-3 hours)

Phase 3: LLM Optimization (1-2 hours)

Phase 4: Integration & Testing (2-3 hours)

Handling Documents With No Structured Tables

Success Metrics

Code Quality Improvements

Current Issues:

After Implementation:

Alternative Approaches Considered

Option 1: Better Regex Parsing (REJECTED)

Option 2: Use Only LLM (REJECTED)

Option 3: Replace Document AI (REJECTED)

Option 4: Manual Table Markup (REJECTED)

Conclusion

16 KiB

Raw Permalink Blame History