Files
cim_summary/IMPLEMENTATION_PLAN.md
admin 9c916d12f4 feat: Production release v2.0.0 - Simple Document Processor
Major release with significant performance improvements and new processing strategy.

## Core Changes
- Implemented simple_full_document processing strategy (default)
- Full document → LLM approach: 1-2 passes, ~5-6 minutes processing time
- Achieved 100% completeness with 2 API calls (down from 5+)
- Removed redundant Document AI passes for faster processing

## Financial Data Extraction
- Enhanced deterministic financial table parser
- Improved FY3/FY2/FY1/LTM identification from varying CIM formats
- Automatic merging of parser results with LLM extraction

## Code Quality & Infrastructure
- Cleaned up debug logging (removed emoji markers from production code)
- Fixed Firebase Secrets configuration (using modern defineSecret approach)
- Updated OpenAI API key
- Resolved deployment conflicts (secrets vs environment variables)
- Added .env files to Firebase ignore list

## Deployment
- Firebase Functions v2 deployment successful
- All 7 required secrets verified and configured
- Function URL: https://api-y56ccs6wva-uc.a.run.app

## Performance Improvements
- Processing time: ~5-6 minutes (down from 23+ minutes)
- API calls: 1-2 (down from 5+)
- Completeness: 100% achievable
- LLM Model: claude-3-7-sonnet-latest

## Breaking Changes
- Default processing strategy changed to 'simple_full_document'
- RAG processor available as alternative strategy 'document_ai_agentic_rag'

## Files Changed
- 36 files changed, 5642 insertions(+), 4451 deletions(-)
- Removed deprecated documentation files
- Cleaned up unused services and models

This release represents a major refactoring focused on speed, accuracy, and maintainability.
2025-11-09 21:07:22 -05:00

24 KiB
Raw Permalink Blame History

Financial Data Extraction: Implementation Plan

Overview

This document provides a step-by-step implementation plan to fix the financial data extraction issue by utilizing Document AI's structured table data.


Timeline: 1-2 hours Expected Improvement: 60-70% accuracy gain Risk: Low - additive changes, no breaking modifications

Step 1.1: Update DocumentAIOutput Interface

File: backend/src/services/documentAiProcessor.ts

Current (lines 15-25):

interface DocumentAIOutput {
  text: string;
  entities: Array<{...}>;
  tables: Array<any>;  // ❌ Just counts, no structure
  pages: Array<any>;
  mimeType: string;
}

Updated:

export interface StructuredTable {
  headers: string[];
  rows: string[][];
  position: {
    pageNumber: number;
    confidence: number;
  };
  rawTable?: any; // Keep original for debugging
}

interface DocumentAIOutput {
  text: string;
  entities: Array<{...}>;
  tables: StructuredTable[];  // ✅ Full structure
  pages: Array<any>;
  mimeType: string;
}

Step 1.2: Add Table Text Extraction Helper

File: backend/src/services/documentAiProcessor.ts Location: Add after line 51 (after constructor)

/**
 * Extract text from a Document AI layout object using text anchors
 * Based on Google's best practices: https://cloud.google.com/document-ai/docs/handle-response
 */
private getTextFromLayout(layout: any, documentText: string): string {
  try {
    const textAnchor = layout?.textAnchor;
    if (!textAnchor?.textSegments || textAnchor.textSegments.length === 0) {
      return '';
    }

    // Get the first segment (most common case)
    const segment = textAnchor.textSegments[0];
    const startIndex = parseInt(segment.startIndex || '0');
    const endIndex = parseInt(segment.endIndex || documentText.length.toString());

    // Validate indices
    if (startIndex < 0 || endIndex > documentText.length || startIndex >= endIndex) {
      logger.warn('Invalid text anchor indices', { startIndex, endIndex, docLength: documentText.length });
      return '';
    }

    return documentText.substring(startIndex, endIndex).trim();
  } catch (error) {
    logger.error('Failed to extract text from layout', {
      error: error instanceof Error ? error.message : String(error),
      layout
    });
    return '';
  }
}

Step 1.3: Add Structured Table Extraction

File: backend/src/services/documentAiProcessor.ts Location: Add after getTextFromLayout method

/**
 * Extract structured tables from Document AI response
 * Preserves column alignment and table structure
 */
private extractStructuredTables(document: any, documentText: string): StructuredTable[] {
  const tables: StructuredTable[] = [];

  try {
    const pages = document.pages || [];
    logger.info('Extracting structured tables from Document AI response', {
      pageCount: pages.length
    });

    for (const page of pages) {
      const pageTables = page.tables || [];
      const pageNumber = page.pageNumber || 0;

      logger.info('Processing page for tables', {
        pageNumber,
        tableCount: pageTables.length
      });

      for (let tableIndex = 0; tableIndex < pageTables.length; tableIndex++) {
        const table = pageTables[tableIndex];

        try {
          // Extract headers from first header row
          const headers: string[] = [];
          if (table.headerRows && table.headerRows.length > 0) {
            const headerRow = table.headerRows[0];
            for (const cell of headerRow.cells || []) {
              const cellText = this.getTextFromLayout(cell.layout, documentText);
              headers.push(cellText);
            }
          }

          // Extract data rows
          const rows: string[][] = [];
          for (const bodyRow of table.bodyRows || []) {
            const row: string[] = [];
            for (const cell of bodyRow.cells || []) {
              const cellText = this.getTextFromLayout(cell.layout, documentText);
              row.push(cellText);
            }
            if (row.length > 0) {
              rows.push(row);
            }
          }

          // Only add tables with content
          if (headers.length > 0 || rows.length > 0) {
            tables.push({
              headers,
              rows,
              position: {
                pageNumber,
                confidence: table.confidence || 0.9
              },
              rawTable: table // Keep for debugging
            });

            logger.info('Extracted structured table', {
              pageNumber,
              tableIndex,
              headerCount: headers.length,
              rowCount: rows.length,
              headers: headers.slice(0, 10) // Log first 10 headers
            });
          }
        } catch (tableError) {
          logger.error('Failed to extract table', {
            pageNumber,
            tableIndex,
            error: tableError instanceof Error ? tableError.message : String(tableError)
          });
        }
      }
    }

    logger.info('Structured table extraction completed', {
      totalTables: tables.length
    });

  } catch (error) {
    logger.error('Failed to extract structured tables', {
      error: error instanceof Error ? error.message : String(error)
    });
  }

  return tables;
}

Step 1.4: Update processWithDocumentAI to Use Structured Tables

File: backend/src/services/documentAiProcessor.ts Location: Update lines 462-482

Current:

// Extract tables
const tables = document.pages?.flatMap(page =>
  page.tables?.map(table => ({
    rows: table.headerRows?.length || 0,
    columns: table.bodyRows?.[0]?.cells?.length || 0
  })) || []
) || [];

Updated:

// Extract structured tables with full content
const tables = this.extractStructuredTables(document, text);

Step 1.5: Pass Tables to Agentic RAG Processor

File: backend/src/services/documentAiProcessor.ts Location: Update line 337 (processLargeDocument call)

Current:

const result = await optimizedAgenticRAGProcessor.processLargeDocument(
  documentId,
  extractedText,
  {}
);

Updated:

const result = await optimizedAgenticRAGProcessor.processLargeDocument(
  documentId,
  extractedText,
  {
    structuredTables: documentAiOutput.tables || []
  }
);

Step 1.6: Update Agentic RAG Processor Signature

File: backend/src/services/optimizedAgenticRAGProcessor.ts Location: Update lines 41-48

Current:

async processLargeDocument(
  documentId: string,
  text: string,
  options: {
    enableSemanticChunking?: boolean;
    enableMetadataEnrichment?: boolean;
    similarityThreshold?: number;
  } = {}
)

Updated:

async processLargeDocument(
  documentId: string,
  text: string,
  options: {
    enableSemanticChunking?: boolean;
    enableMetadataEnrichment?: boolean;
    similarityThreshold?: number;
    structuredTables?: StructuredTable[];
  } = {}
)

Step 1.7: Add Import for StructuredTable Type

File: backend/src/services/optimizedAgenticRAGProcessor.ts Location: Add to imports at top (around line 1-6)

import type { StructuredTable } from './documentAiProcessor';

Step 1.8: Create Financial Table Identifier

File: backend/src/services/optimizedAgenticRAGProcessor.ts Location: Add after line 503 (after calculateCosineSimilarity)

/**
 * Identify if a structured table contains financial data
 * Uses heuristics to detect financial tables vs. other tables
 */
private isFinancialTable(table: StructuredTable): boolean {
  const headerText = table.headers.join(' ').toLowerCase();
  const allRowsText = table.rows.map(row => row.join(' ').toLowerCase()).join(' ');

  // Check for year/period indicators in headers
  const hasPeriods = /fy[-\s]?\d{1,2}|20\d{2}|ltm|ttm|ytd|cy\d{2}|q[1-4]/i.test(headerText);

  // Check for financial metrics in rows
  const financialMetrics = [
    'revenue', 'sales', 'ebitda', 'ebit', 'profit', 'margin',
    'gross profit', 'operating income', 'net income', 'cash flow',
    'earnings', 'assets', 'liabilities', 'equity'
  ];
  const hasFinancialMetrics = financialMetrics.some(metric =>
    allRowsText.includes(metric)
  );

  // Check for currency/percentage values
  const hasCurrency = /\$[\d,]+(?:\.\d+)?[kmb]?|\d+(?:\.\d+)?%/i.test(allRowsText);

  // A financial table should have periods AND (metrics OR currency values)
  const isFinancial = hasPeriods && (hasFinancialMetrics || hasCurrency);

  if (isFinancial) {
    logger.info('Identified financial table', {
      headers: table.headers,
      rowCount: table.rows.length,
      pageNumber: table.position.pageNumber
    });
  }

  return isFinancial;
}

/**
 * Format a structured table as markdown for better LLM comprehension
 * Preserves column alignment and makes tables human-readable
 */
private formatTableAsMarkdown(table: StructuredTable): string {
  const lines: string[] = [];

  // Add header row
  if (table.headers.length > 0) {
    lines.push(`| ${table.headers.join(' | ')} |`);
    lines.push(`| ${table.headers.map(() => '---').join(' | ')} |`);
  }

  // Add data rows
  for (const row of table.rows) {
    lines.push(`| ${row.join(' | ')} |`);
  }

  return lines.join('\n');
}

Step 1.9: Update Chunk Creation to Include Financial Tables

File: backend/src/services/optimizedAgenticRAGProcessor.ts Location: Update createIntelligentChunks method (lines 115-158)

Add after line 118:

// Extract structured tables from options
const structuredTables = (options as any)?.structuredTables || [];

Add after line 119 (inside the method, before semantic chunking):

// PRIORITY: Create dedicated chunks for financial tables
if (structuredTables.length > 0) {
  logger.info('Processing structured tables for chunking', {
    documentId,
    tableCount: structuredTables.length
  });

  for (let i = 0; i < structuredTables.length; i++) {
    const table = structuredTables[i];
    const isFinancial = this.isFinancialTable(table);

    // Format table as markdown for better readability
    const markdownTable = this.formatTableAsMarkdown(table);

    chunks.push({
      id: `${documentId}-table-${i}`,
      content: markdownTable,
      chunkIndex: chunks.length,
      startPosition: -1, // Tables don't have text positions
      endPosition: -1,
      sectionType: isFinancial ? 'financial-table' : 'table',
      metadata: {
        isStructuredTable: true,
        isFinancialTable: isFinancial,
        tableIndex: i,
        pageNumber: table.position.pageNumber,
        headerCount: table.headers.length,
        rowCount: table.rows.length,
        structuredData: table // Preserve original structure
      }
    });

    logger.info('Created chunk for structured table', {
      documentId,
      tableIndex: i,
      isFinancial,
      chunkId: chunks[chunks.length - 1].id,
      contentPreview: markdownTable.substring(0, 200)
    });
  }
}

Step 1.10: Pin Financial Tables in Extraction

File: backend/src/services/optimizedAgenticRAGProcessor.ts Location: Update extractPass1CombinedMetadataFinancial method (around line 1190-1260)

Add before the return statement (around line 1259):

// Identify and pin financial table chunks to ensure they're always included
const financialTableChunks = chunks.filter(
  chunk => chunk.metadata?.isFinancialTable === true
);

logger.info('Financial table chunks identified for pinning', {
  documentId,
  financialTableCount: financialTableChunks.length,
  chunkIds: financialTableChunks.map(c => c.id)
});

// Combine deterministic financial chunks with structured table chunks
const allPinnedChunks = [
  ...pinnedChunks,
  ...financialTableChunks
];

Update the return statement to use allPinnedChunks:

return await this.extractWithTargetedQuery(
  documentId,
  text,
  financialChunks,
  query,
  targetFields,
  7,
  allPinnedChunks  // ✅ Now includes both deterministic and structured tables
);

Testing Phase 1

Test 1.1: Verify Table Extraction

# Monitor logs for table extraction
cd backend
npm run dev

# Look for log entries:
# - "Extracting structured tables from Document AI response"
# - "Extracted structured table"
# - "Identified financial table"

Test 1.2: Upload a CIM Document

# Upload a test document and check processing
curl -X POST http://localhost:8080/api/documents/upload \
  -F "file=@test-cim.pdf" \
  -H "Authorization: Bearer YOUR_TOKEN"

Test 1.3: Verify Financial Data Populated

Check the database or API response for:

  • financialSummary.financials.fy3.revenue - Should have values
  • financialSummary.financials.fy2.ebitda - Should have values
  • NOT "Not specified in CIM" for fields that exist in tables

Test 1.4: Check Logs for Success Indicators

# Should see:"Identified financial table" - confirms tables detected
✅ "Created chunk for structured table" - confirms chunking worked
✅ "Financial table chunks identified for pinning" - confirms pinning worked
✅ "Deterministic financial data merged successfully" - confirms data merged

Baseline & Post-Change Metrics

Collect before/after numbers so we can validate the expected accuracy lift and know when to pull in the hybrid fallback:

  1. Instrument the processing metadata (see FINANCIAL_EXTRACTION_ANALYSIS.md) with tablesFound, financialTablesIdentified, structuredParsingUsed, textParsingFallback, and financialDataPopulated.
  2. Run ≥20 recent CIMs through the current pipeline and record aggregate stats (mean/median for the above plus sample documentIds with tablesFound === 0).
  3. Repeat after deploying Phase1 and Phase2 changes; paste the numbers back into the analysis doc so Success Criteria reference real data instead of estimates.

Expected Results After Phase 1

Before Phase 1:

{
  "financialSummary": {
    "financials": {
      "fy3": {
        "revenue": "Not specified in CIM",
        "ebitda": "Not specified in CIM"
      },
      "fy2": {
        "revenue": "Not specified in CIM",
        "ebitda": "Not specified in CIM"
      }
    }
  }
}

After Phase 1:

{
  "financialSummary": {
    "financials": {
      "fy3": {
        "revenue": "$45.2M",
        "revenueGrowth": "N/A",
        "ebitda": "$8.5M",
        "ebitdaMargin": "18.8%"
      },
      "fy2": {
        "revenue": "$52.8M",
        "revenueGrowth": "16.8%",
        "ebitda": "$10.2M",
        "ebitdaMargin": "19.3%"
      }
    }
  }
}

Phase 2: Enhanced Deterministic Parsing (Optional)

Timeline: 2-3 hours Expected Additional Improvement: +15-20% accuracy Trigger: If Phase 1 results are below 70% accuracy

Step 2.1: Create Structured Table Parser

File: Create backend/src/services/structuredFinancialParser.ts

import { logger } from '../utils/logger';
import type { StructuredTable } from './documentAiProcessor';
import type { ParsedFinancials, FinancialPeriod } from './financialTableParser';

/**
 * Parse financials directly from Document AI structured tables
 * This is more reliable than parsing from flattened text
 */
export function parseFinancialsFromStructuredTable(
  table: StructuredTable
): ParsedFinancials {
  const result: ParsedFinancials = {
    fy3: {},
    fy2: {},
    fy1: {},
    ltm: {}
  };

  try {
    // 1. Identify period columns from headers
    const periodMapping = mapHeadersToPeriods(table.headers);

    logger.info('Structured table period mapping', {
      headers: table.headers,
      periodMapping
    });

    // 2. Process each row to extract metrics
    for (let rowIndex = 0; rowIndex < table.rows.length; rowIndex++) {
      const row = table.rows[rowIndex];
      if (row.length === 0) continue;

      const metricName = row[0].toLowerCase();

      // Match against known financial metrics
      const fieldName = identifyMetricField(metricName);
      if (!fieldName) continue;

      // 3. Assign values to correct periods
      periodMapping.forEach((period, columnIndex) => {
        if (!period) return; // Skip unmapped columns

        const value = row[columnIndex + 1]; // +1 because first column is metric name
        if (!value || value.trim() === '') return;

        // 4. Validate value type matches field
        if (isValidValueForField(value, fieldName)) {
          result[period][fieldName] = value.trim();

          logger.debug('Mapped structured table value', {
            period,
            field: fieldName,
            value: value.trim(),
            row: rowIndex,
            column: columnIndex
          });
        }
      });
    }

    logger.info('Structured table parsing completed', {
      fy3: result.fy3,
      fy2: result.fy2,
      fy1: result.fy1,
      ltm: result.ltm
    });

  } catch (error) {
    logger.error('Failed to parse structured financial table', {
      error: error instanceof Error ? error.message : String(error)
    });
  }

  return result;
}

/**
 * Map header columns to financial periods (fy3, fy2, fy1, ltm)
 */
function mapHeadersToPeriods(headers: string[]): Array<keyof ParsedFinancials | null> {
  const periodMapping: Array<keyof ParsedFinancials | null> = [];

  for (const header of headers) {
    const normalized = header.trim().toUpperCase().replace(/\s+/g, '');
    let period: keyof ParsedFinancials | null = null;

    // Check for LTM/TTM
    if (normalized.includes('LTM') || normalized.includes('TTM')) {
      period = 'ltm';
    }
    // Check for year patterns
    else if (/FY[-\s]?1$|FY[-\s]?2024|2024/.test(normalized)) {
      period = 'fy1'; // Most recent full year
    }
    else if (/FY[-\s]?2$|FY[-\s]?2023|2023/.test(normalized)) {
      period = 'fy2'; // Second most recent year
    }
    else if (/FY[-\s]?3$|FY[-\s]?2022|2022/.test(normalized)) {
      period = 'fy3'; // Third most recent year
    }
    // Generic FY pattern - assign based on position
    else if (/FY\d{2}/.test(normalized)) {
      // Will be assigned based on relative position
      period = null; // Handle in second pass
    }

    periodMapping.push(period);
  }

  // Second pass: fill in generic FY columns based on position
  // Most recent on right, oldest on left (common CIM format)
  let fyIndex = 1;
  for (let i = periodMapping.length - 1; i >= 0; i--) {
    if (periodMapping[i] === null && /FY/i.test(headers[i])) {
      if (fyIndex === 1) periodMapping[i] = 'fy1';
      else if (fyIndex === 2) periodMapping[i] = 'fy2';
      else if (fyIndex === 3) periodMapping[i] = 'fy3';
      fyIndex++;
    }
  }

  return periodMapping;
}

/**
 * Identify which financial field a metric name corresponds to
 */
function identifyMetricField(metricName: string): keyof FinancialPeriod | null {
  const name = metricName.toLowerCase();

  if (/^revenue|^net sales|^total sales|^top\s+line/.test(name)) {
    return 'revenue';
  }
  if (/gross\s*profit/.test(name)) {
    return 'grossProfit';
  }
  if (/gross\s*margin/.test(name)) {
    return 'grossMargin';
  }
  if (/ebitda\s*margin|adj\.?\s*ebitda\s*margin/.test(name)) {
    return 'ebitdaMargin';
  }
  if (/ebitda|adjusted\s*ebitda|adj\.?\s*ebitda/.test(name)) {
    return 'ebitda';
  }
  if (/revenue\s*growth|yoy|y\/y|year[-\s]*over[-\s]*year/.test(name)) {
    return 'revenueGrowth';
  }

  return null;
}

/**
 * Validate that a value is appropriate for a given field
 */
function isValidValueForField(value: string, field: keyof FinancialPeriod): boolean {
  const trimmed = value.trim();

  // Margin and growth fields should have %
  if (field.includes('Margin') || field.includes('Growth')) {
    return /\d/.test(trimmed) && (trimmed.includes('%') || trimmed.toLowerCase() === 'n/a');
  }

  // Revenue, profit, EBITDA should have $ or numbers
  if (['revenue', 'grossProfit', 'ebitda'].includes(field)) {
    return /\d/.test(trimmed) && (trimmed.includes('$') || /\d+[KMB]/i.test(trimmed));
  }

  return /\d/.test(trimmed);
}

Step 2.2: Integrate Structured Parser

File: backend/src/services/optimizedAgenticRAGProcessor.ts Location: Update multi-pass extraction (around line 1063-1088)

Add import:

import { parseFinancialsFromStructuredTable } from './structuredFinancialParser';

Update financial extraction logic (around line 1066-1088):

// Try structured table parsing first (most reliable)
try {
  const structuredTables = (options as any)?.structuredTables || [];
  const financialTables = structuredTables.filter((t: StructuredTable) => this.isFinancialTable(t));

  if (financialTables.length > 0) {
    logger.info('Attempting structured table parsing', {
      documentId,
      financialTableCount: financialTables.length
    });

    // Try each financial table until we get good data
    for (const table of financialTables) {
      const parsedFromTable = parseFinancialsFromStructuredTable(table);

      if (this.hasStructuredFinancialData(parsedFromTable)) {
        deterministicFinancials = parsedFromTable;
        deterministicFinancialChunk = this.buildDeterministicFinancialChunk(documentId, parsedFromTable);

        logger.info('Structured table parsing successful', {
          documentId,
          tableIndex: financialTables.indexOf(table),
          fy3: parsedFromTable.fy3,
          fy2: parsedFromTable.fy2,
          fy1: parsedFromTable.fy1,
          ltm: parsedFromTable.ltm
        });
        break; // Found good data, stop trying tables
      }
    }
  }
} catch (structuredParserError) {
  logger.warn('Structured table parsing failed, falling back to text parser', {
    documentId,
    error: structuredParserError instanceof Error ? structuredParserError.message : String(structuredParserError)
  });
}

// Fallback to text-based parsing if structured parsing failed
if (!deterministicFinancials) {
  try {
    const { parseFinancialsFromText } = await import('./financialTableParser');
    const parsedFinancials = parseFinancialsFromText(text);
    // ... existing code
  } catch (parserError) {
    // ... existing error handling
  }
}

Rollback Plan

If Phase 1 causes issues:

Quick Rollback (5 minutes)

git checkout HEAD -- backend/src/services/documentAiProcessor.ts
git checkout HEAD -- backend/src/services/optimizedAgenticRAGProcessor.ts
npm run build
npm start

Add environment variable to control new behavior:

// backend/src/config/env.ts
export const config = {
  features: {
    useStructuredTables: process.env.USE_STRUCTURED_TABLES === 'true'
  }
};

Then wrap new code:

if (config.features.useStructuredTables) {
  // Use structured tables
} else {
  // Use old flat text approach
}

Success Criteria

Phase 1 Success:

  • 60%+ of CIM documents have populated financial data (validated via new telemetry)
  • No regression in processing time (< 10% increase acceptable)
  • No errors in table extraction pipeline
  • Structured tables logged in console

Phase 2 Success:

  • 85%+ of CIM documents have populated financial data or fall back to the hybrid path when tablesFound === 0
  • Column alignment accuracy > 95%
  • Reduction in "Not specified in CIM" responses

Monitoring & Debugging

Key Metrics to Track

// Add to processing result
metadata: {
  tablesFound: number;
  financialTablesIdentified: number;
  structuredParsingUsed: boolean;
  textParsingFallback: boolean;
  financialDataPopulated: boolean;
}

Log Analysis Queries

# Find documents with no tables
grep "totalTables: 0" backend.log

# Find failed table extractions
grep "Failed to extract table" backend.log

# Find successful financial extractions
grep "Structured table parsing successful" backend.log

Next Steps After Implementation

  1. Run on historical documents: Reprocess 10-20 existing CIMs to compare before/after
  2. A/B test: Process new documents with both old and new system, compare results
  3. Tune thresholds: Adjust financial table identification heuristics based on results
  4. Document findings: Update this plan with actual results and lessons learned

Resources